Sample Size Calculator Calculator

Metricuno

May 17, 2026

5 min read

Quick answer

Work out how many visitors per variant your A/B test needs to detect a target lift at 80% power and 95% confidence — before you ship the experiment.

Definition

Experimentation

Sample Size Calculator

A tool that returns the visitors-per-variant an A/B test needs to detect a target lift at chosen power and significance levels.

A sample size calculator is the pre-flight check for any A/B test. You feed in your baseline conversion rate, the smallest lift you care about (the minimum detectable effect, or MDE), your significance threshold, and your statistical power. The calculator returns the number of visitors each variant needs before the result is trustworthy.

Running a test without this step is the most common source of false reads in CRO. Tests stopped early on a flattering p-value, or tests sized for a 20% lift when the real effect is 3%, both lead to the same outcome: shipping changes that don't actually move revenue.

Also known as

A/B test sample size calculator

Test duration calculator

Power calculator

Calculator

A/B test sample size calculator

Inputs

Baseline conversion rate

Your current conversion rate, before the test.

Minimum detectable effect (relative)

Smallest relative lift you want to detect. 10% means a 5% baseline lifting to 5.5%.

Significance level (α)

0.05 = 5% false-positive rate (standard).

Statistical power (1−β)

0.80 = 80% chance of detecting a true effect (standard).

Result

Visitors needed per variant

—

Total visitors (control + variant)

—

Inputs assume a two-sided z-test for proportions. Daily visitor input is split evenly across variants to estimate test duration.

The widget above uses the standard two-proportion z-test, which is the right model for binary outcomes like add-to-cart, checkout completion, or email signup. For continuous metrics like average order value or revenue per visitor, the math changes — you'd use a t-test variant that factors in the variance of the metric, not just its mean.

The formula behind the number

Formula

n = 2 * ( (Z_alpha/2 + Z_beta)^2 * p_bar * (1 - p_bar) ) / (p1 - p2)^2

Variables

Sample size per variant

Visitors needed in each arm of the test

Baseline conversion rate

Your current conversion rate on the control

Treatment conversion rate

Baseline plus the minimum detectable effect

p_bar

Pooled rate

The average of p1 and p2

Z_alpha/2

Significance z-score

1.96 for α = 0.05, two-sided

Z_beta

Power z-score

0.84 for 80% power

Worked example

A Shopify apparel store wants to test a new product page layout. Baseline checkout conversion is 3.5%, and the team will only ship a change that delivers at least a 10% relative lift (so p2 = 3.85%).

Baseline conversion (p1): 3.5%

Target conversion (p2): 3.85%

Significance (α): 0.05 two-sided

Power (1 − β): 0.80

→ ≈ 47,400 visitors per variant (≈ 94,800 total)

At 4,000 daily sessions split evenly across two variants, the test needs roughly 24 days to reach significance. Shipping the winner earlier on a flattering interim p-value would be peeking — and roughly doubles the false-positive rate.

Two intuitions are worth internalising from this formula. First, sample size scales with the inverse square of the effect you're trying to detect — halving your MDE quadruples the required visitors. Second, lower baseline conversion rates need more traffic, because the same relative lift represents a smaller absolute change that's harder to distinguish from noise.

How sample size scales with baseline CVR and MDE

Benchmark

Visitors per variant required at 80% power, α = 0.05 (two-sided), for common baseline conversion rates and minimum detectable effects.

Baseline CVR	MDE 5% relative	MDE 10% relative	MDE 20% relative	MDE 30% relative
1.5%	830,000	208,000	52,000	23,000
2.5%	493,000	123,000	31,000	14,000
3.5%	348,000	87,000	22,000	10,000
5.0%	239,000	60,000	15,000	6,700
8.0%	145,000	36,000	9,100	4,000
12.0%	92,000	23,000	5,800	2,600

The cell that usually surprises store owners: at a 2.5% baseline checkout rate, detecting a 5% relative lift needs nearly half a million visitors per variant. Most Shopify stores in the €1M–€5M revenue band don't have that traffic in a reasonable test window, which is why you usually need to test higher up the funnel (PDP click-through, add-to-cart) where baseline rates are higher and MDEs are easier to hit.

Choosing inputs you can defend

MDE is the input most teams get wrong. The right number isn't "the lift I hope to see" — it's "the smallest lift that would still justify shipping this change." If a 3% checkout lift on your apparel store would pay for the engineering work and you'd happily ship it, set MDE at 3% (not 10% because the test would otherwise take too long). A test sized for a larger effect simply won't detect smaller, real wins.

Power and significance are more conventional. Eighty percent power and 5% significance are the industry defaults and a reasonable starting point. Raise power to 90% when the cost of a missed winner is high (a homepage test you can't easily re-run); tighten significance to 1% when you're testing many variants in parallel and want to control the family-wise error rate.

Peeking inflates your false-positive rate

Checking the test daily and stopping the moment p drops below 0.05 is the single most common way A/B tests lie. Under repeated peeking at α = 0.05, your real false-positive rate climbs to 20–30%. Commit to a sample size up front, run to completion, and only ship on the final read — or use a sequential test design that's built to be peeked at.

Frequently asked

Sample size calculator FAQ

Absolute MDE is the percentage-point change (3.5% → 4.5% is a 1.0 absolute MDE). Relative MDE is the percentage change relative to baseline (3.5% → 4.5% is a 28.6% relative MDE). Most calculators, including this one, default to relative MDE because it's easier to reason about across different baselines.

At low baselines, conversions are rare events and their variance dominates the signal. A 10% relative lift on a 1% baseline (1.0% → 1.1%) is a 0.1 percentage-point change — much harder to distinguish from random fluctuation than a 0.5-point change on a 5% baseline. The formula's pooled-variance term reflects this directly.

Use two-sided unless you have a strong prior reason to only care about lifts in one direction. Two-sided is more conservative and protects against the case where your change actively hurts conversion. It's the default in this calculator and in most experimentation platforms.

Each additional variant needs its own full sample, and you should tighten your significance threshold to control for multiple comparisons. A common Bonferroni-style correction: divide your α by the number of comparisons. With three variants vs control, run α = 0.05 / 3 ≈ 0.017.

Size on a conservative daily-visitor estimate (e.g. the 25th percentile of recent daily traffic) so the test runs at least one full business cycle. Always run for whole weeks — never stop mid-week — because conversion patterns differ across weekdays and weekends.

No. Continuous peeking at a fixed α inflates the false-positive rate well above the nominal 5%. Either commit to the full sample size up front, or switch to a sequential testing method (such as group sequential boundaries or always-valid p-values) that's mathematically built for early stopping.

Most winning tests on PDP or cart-page changes deliver 2–8% relative lifts; only large redesigns or pricing changes routinely move conversion 10%+. If your calculator says you need 500k visitors per variant, the issue is usually that you've set MDE too low for your traffic — consider testing further up the funnel where baselines are higher.

Not directly. Revenue and AOV are continuous metrics with their own variance, so they need a t-test (or bootstrap) approach that takes the standard deviation of the metric into account, not just its mean. The two-proportion z-test this widget uses is correct for any binary metric: conversion, click-through, add-to-cart, email signup.

Metricuno pulls your historical GA4 baseline conversion rates automatically, so when you set up a test the platform pre-fills the calculator and warns you if your declared MDE is unrealistic given your traffic. You set the experiment to ship only when the sample is hit — no manual cut-off discipline required.

You have three honest options: raise MDE (only test bigger changes), test higher up the funnel where baseline rates are larger, or accept lower power (say 70%) knowing you'll miss more true winners. Don't lower significance to 10% — that just trades a real problem for a hidden one.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

Sample Size Calculator Calculator

Sample Size Calculator

The formula behind the number

How sample size scales with baseline CVR and MDE

Visitors per variant required at 80% power, α = 0.05 (two-sided), for common baseline conversion rates and minimum detectable effects.

Choosing inputs you can defend

Sample size calculator FAQ

What's the difference between absolute and relative MDE?

Why does lower baseline conversion need more traffic?

Should I use a one-sided or two-sided test?

How do I size a test with more than two variants?

What if my traffic varies a lot day-to-day?

Can I just stop the test as soon as it hits significance?

What's a realistic MDE for an e-commerce A/B test?

Does this work for revenue per visitor or AOV?

How does Metricuno handle sample size in practice?

What if I can't reach the required sample size?

Test ideas before you ship them