Sample Size
Sample size is the number of visitors an A/B test needs to detect a target effect with chosen power and significance — calculated before the test starts.
Sample Size
The number of visitors an A/B test needs to detect a target effect at a chosen significance level and statistical power.
Sample size is the minimum number of visitors (per variant) an experiment needs before the result is statistically trustworthy. It is a function of three inputs you choose up front: the baseline conversion rate, the minimum detectable effect (MDE) you want to catch, and the false-positive and false-negative rates you are willing to tolerate.
You calculate sample size before the test launches, not after. Without a pre-committed number, there is no honest stopping rule — you end up peeking at the dashboard and calling winners on noise. Treat the sample-size number as the experiment's exit criterion.
Skipping the sample-size step is the single most common mistake on a low-velocity testing program. Tests get called after a few days because one variant is up 14%, then quietly regress to flat once real traffic accumulates. Calculating sample size is what stops that pattern.
Three numbers drive the result: your baseline conversion rate, the smallest lift you genuinely care about (MDE), and your tolerance for being wrong. Lower baseline, smaller MDE, or stricter thresholds — any of those push the required sample up, usually faster than people expect.
n = 2 * ((z_alpha + z_beta)^2 * p * (1 - p)) / (p * mde)^2
n
Sample size per variant
Visitors required in each variant (control and treatment).
p
Baseline conversion rate
Current conversion rate of the page or flow you are testing, as a decimal.
mde
Minimum detectable effect
Smallest relative lift you want to be able to detect, as a decimal (e.g. 0.10 for a 10% relative lift).
z_alpha
Significance z-score
Z-score for your significance level. 1.96 for a two-sided test at α = 0.05.
z_beta
Power z-score
Z-score for your statistical power. 0.84 for 80% power.
A Shopify apparel store wants to test a new product-page hero. The PDP currently converts at 3.2%. The team wants to detect a 10% relative lift (so the variant would need to hit ~3.52%) at 95% significance and 80% power.
Baseline conversion rate (p): 0.032
MDE (relative): 0.10
z_alpha (95% two-sided): 1.96
z_beta (80% power): 0.84
→ ≈ 47,300 visitors per variant (≈ 94,600 total)
At 30,000 PDP visitors per week, that is roughly a 3-week test. If the store only gets 8,000 PDP visitors a week, the same test takes ~12 weeks — at which point you either widen the MDE, broaden the test surface, or pick a higher-traffic page.
Two patterns fall out of the formula. First, halving the MDE roughly quadruples the sample size — chasing 5% lifts on a low-traffic site is usually a non-starter. Second, lower baseline conversion rates need disproportionately more visitors, which is why checkout-step tests are slower than top-of-funnel tests.
Required sample size per variant by baseline conversion rate and target MDE (95% significance, 80% power)
| Baseline CVR | MDE 5% | MDE 10% | MDE 15% | MDE 20% |
|---|---|---|---|---|
| 1.5% | ~2,060,000 | ~515,000 | ~229,000 | ~129,000 |
| 2.5% | ~1,225,000 | ~306,000 | ~136,000 | ~76,500 |
| 3.5% | ~865,000 | ~216,000 | ~96,000 | ~54,000 |
| 5.0% | ~596,000 | ~149,000 | ~66,000 | ~37,300 |
| 8.0% | ~361,000 | ~90,300 | ~40,100 | ~22,600 |
When the math says your test would take six months, you have four levers: pick a higher-traffic surface (PDP > checkout step 3), widen the MDE you are willing to detect, accept lower power (e.g. 70%), or pool similar variants. What you should not do is launch the test anyway and stop it early when it looks good — that is how false positives ship to production.
Sample size FAQ
Relative MDE (e.g. "a 10% lift on a 3% baseline") is the e-commerce standard and what most calculators expect. Absolute MDE (e.g. "+0.3 percentage points") is fine too, but make sure your calculator and your reporting use the same convention — mixing them silently is a common source of underpowered tests.
95% significance (α = 0.05) and 80% power are the conventional defaults and the right starting point for most A/B tests. Raise power to 90% if a false negative is genuinely expensive (e.g. you'd kill a real winner). Loosening significance below 90% is rarely worth it — you're trading test discipline for speed.
Not on a fixed-horizon frequentist test — peeking inflates your false-positive rate well above the nominal 5%. If you need the option to stop early, run a sequential test (e.g. sequential probability ratio test) or a Bayesian setup explicitly designed for continuous monitoring.
Duration = sample size ÷ traffic per day, then rounded up to whole business cycles. Always run for at least one full week, ideally two, to absorb weekday/weekend behavior. A test that hits its sample size on a Wednesday should still run through the weekend before you call it.
Pull the last 4-8 weeks of data for the exact surface and audience you'll test on — not site-wide CVR, which is usually misleading. If the page is brand new, run two weeks of traffic first to establish a baseline, then calculate sample size for the actual experiment.
Yes — 50/50 splits are the most statistically efficient, so each variant needs the calculated n. Uneven splits (e.g. 90/10) require a larger total sample to reach the same power, and most calculators assume equal allocation unless you tell them otherwise.
Each additional variant adds another comparison, which inflates the family-wise error rate. Either apply a Bonferroni correction to your significance threshold (e.g. α = 0.05 ÷ 3 = 0.017 for three variants) and recalculate, or use a proper multi-arm method. Both approaches push sample size up.
Yes, but the formula changes — revenue is continuous and high-variance, so you need the standard deviation of revenue per visitor, not just a baseline rate. Revenue tests typically need 2-5x the sample size of a CVR test on the same page; budget accordingly.
If you have 50,000 monthly PDP visitors at 3% CVR, a 4-week test can credibly detect about a 15% relative lift. Below 25,000 monthly visitors on the test surface, MDEs under 20% become impractical — pick bigger swings or test higher up the funnel.
Sample size is the planning half of statistical analysis — it's what you compute before the test. The reporting half (significance, confidence intervals, effect sizes) is what you compute after. Skipping the planning half makes the reporting half meaningless, because you have no pre-committed stopping rule.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.