Sample Size Calculator Calculator
Work out how many visitors per variant your A/B test needs to detect a target lift at 80% power and 95% confidence — before you ship the experiment.
Sample Size Calculator
A tool that returns the visitors-per-variant an A/B test needs to detect a target lift at chosen power and significance levels.
A sample size calculator is the pre-flight check for any A/B test. You feed in your baseline conversion rate, the smallest lift you care about (the minimum detectable effect, or MDE), your significance threshold, and your statistical power. The calculator returns the number of visitors each variant needs before the result is trustworthy.
Running a test without this step is the most common source of false reads in CRO. Tests stopped early on a flattering p-value, or tests sized for a 20% lift when the real effect is 3%, both lead to the same outcome: shipping changes that don't actually move revenue.
A/B test sample size calculator
Baseline conversion rate
%
Your current conversion rate, before the test.
Minimum detectable effect (relative)
%
Smallest relative lift you want to detect. 10% means a 5% baseline lifting to 5.5%.
Significance level (α)
0.05 = 5% false-positive rate (standard).
Statistical power (1−β)
0.80 = 80% chance of detecting a true effect (standard).
Visitors needed per variant
—
Total visitors (control + variant)
—
Inputs assume a two-sided z-test for proportions. Daily visitor input is split evenly across variants to estimate test duration.
The widget above uses the standard two-proportion z-test, which is the right model for binary outcomes like add-to-cart, checkout completion, or email signup. For continuous metrics like average order value or revenue per visitor, the math changes — you'd use a t-test variant that factors in the variance of the metric, not just its mean.
The formula behind the number
n = 2 * ( (Z_alpha/2 + Z_beta)^2 * p_bar * (1 - p_bar) ) / (p1 - p2)^2
n
Sample size per variant
Visitors needed in each arm of the test
p1
Baseline conversion rate
Your current conversion rate on the control
p2
Treatment conversion rate
Baseline plus the minimum detectable effect
p_bar
Pooled rate
The average of p1 and p2
Z_alpha/2
Significance z-score
1.96 for α = 0.05, two-sided
Z_beta
Power z-score
0.84 for 80% power
A Shopify apparel store wants to test a new product page layout. Baseline checkout conversion is 3.5%, and the team will only ship a change that delivers at least a 10% relative lift (so p2 = 3.85%).
Baseline conversion (p1): 3.5%
Target conversion (p2): 3.85%
Significance (α): 0.05 two-sided
Power (1 − β): 0.80
→ ≈ 47,400 visitors per variant (≈ 94,800 total)
At 4,000 daily sessions split evenly across two variants, the test needs roughly 24 days to reach significance. Shipping the winner earlier on a flattering interim p-value would be peeking — and roughly doubles the false-positive rate.
Two intuitions are worth internalising from this formula. First, sample size scales with the inverse square of the effect you're trying to detect — halving your MDE quadruples the required visitors. Second, lower baseline conversion rates need more traffic, because the same relative lift represents a smaller absolute change that's harder to distinguish from noise.
How sample size scales with baseline CVR and MDE
Visitors per variant required at 80% power, α = 0.05 (two-sided), for common baseline conversion rates and minimum detectable effects.
| Baseline CVR | MDE 5% relative | MDE 10% relative | MDE 20% relative | MDE 30% relative |
|---|---|---|---|---|
| 1.5% | 830,000 | 208,000 | 52,000 | 23,000 |
| 2.5% | 493,000 | 123,000 | 31,000 | 14,000 |
| 3.5% | 348,000 | 87,000 | 22,000 | 10,000 |
| 5.0% | 239,000 | 60,000 | 15,000 | 6,700 |
| 8.0% | 145,000 | 36,000 | 9,100 | 4,000 |
| 12.0% | 92,000 | 23,000 | 5,800 | 2,600 |
The cell that usually surprises store owners: at a 2.5% baseline checkout rate, detecting a 5% relative lift needs nearly half a million visitors per variant. Most Shopify stores in the €1M–€5M revenue band don't have that traffic in a reasonable test window, which is why you usually need to test higher up the funnel (PDP click-through, add-to-cart) where baseline rates are higher and MDEs are easier to hit.
Choosing inputs you can defend
MDE is the input most teams get wrong. The right number isn't "the lift I hope to see" — it's "the smallest lift that would still justify shipping this change." If a 3% checkout lift on your apparel store would pay for the engineering work and you'd happily ship it, set MDE at 3% (not 10% because the test would otherwise take too long). A test sized for a larger effect simply won't detect smaller, real wins.
Power and significance are more conventional. Eighty percent power and 5% significance are the industry defaults and a reasonable starting point. Raise power to 90% when the cost of a missed winner is high (a homepage test you can't easily re-run); tighten significance to 1% when you're testing many variants in parallel and want to control the family-wise error rate.
Peeking inflates your false-positive rate
Checking the test daily and stopping the moment p drops below 0.05 is the single most common way A/B tests lie. Under repeated peeking at α = 0.05, your real false-positive rate climbs to 20–30%. Commit to a sample size up front, run to completion, and only ship on the final read — or use a sequential test design that's built to be peeked at.
Sample size calculator FAQ
Absolute MDE is the percentage-point change (3.5% → 4.5% is a 1.0 absolute MDE). Relative MDE is the percentage change relative to baseline (3.5% → 4.5% is a 28.6% relative MDE). Most calculators, including this one, default to relative MDE because it's easier to reason about across different baselines.
At low baselines, conversions are rare events and their variance dominates the signal. A 10% relative lift on a 1% baseline (1.0% → 1.1%) is a 0.1 percentage-point change — much harder to distinguish from random fluctuation than a 0.5-point change on a 5% baseline. The formula's pooled-variance term reflects this directly.
Use two-sided unless you have a strong prior reason to only care about lifts in one direction. Two-sided is more conservative and protects against the case where your change actively hurts conversion. It's the default in this calculator and in most experimentation platforms.
Each additional variant needs its own full sample, and you should tighten your significance threshold to control for multiple comparisons. A common Bonferroni-style correction: divide your α by the number of comparisons. With three variants vs control, run α = 0.05 / 3 ≈ 0.017.
Size on a conservative daily-visitor estimate (e.g. the 25th percentile of recent daily traffic) so the test runs at least one full business cycle. Always run for whole weeks — never stop mid-week — because conversion patterns differ across weekdays and weekends.
No. Continuous peeking at a fixed α inflates the false-positive rate well above the nominal 5%. Either commit to the full sample size up front, or switch to a sequential testing method (such as group sequential boundaries or always-valid p-values) that's mathematically built for early stopping.
Most winning tests on PDP or cart-page changes deliver 2–8% relative lifts; only large redesigns or pricing changes routinely move conversion 10%+. If your calculator says you need 500k visitors per variant, the issue is usually that you've set MDE too low for your traffic — consider testing further up the funnel where baselines are higher.
Not directly. Revenue and AOV are continuous metrics with their own variance, so they need a t-test (or bootstrap) approach that takes the standard deviation of the metric into account, not just its mean. The two-proportion z-test this widget uses is correct for any binary metric: conversion, click-through, add-to-cart, email signup.
Metricuno pulls your historical GA4 baseline conversion rates automatically, so when you set up a test the platform pre-fills the calculator and warns you if your declared MDE is unrealistic given your traffic. You set the experiment to ship only when the sample is hit — no manual cut-off discipline required.
You have three honest options: raise MDE (only test bigger changes), test higher up the funnel where baseline rates are larger, or accept lower power (say 70%) knowing you'll miss more true winners. Don't lower significance to 10% — that just trades a real problem for a hidden one.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.