How to use Power Analysis
Power analysis tells you whether a test can actually detect the effect you care about. This guide covers the math, the 80% convention, and how to size experiments on real DTC traffic.
Power Analysis
The calculation of how likely an A/B test is to detect a real effect — industry standard is 80% power.
Power analysis is the discipline of working out, before a test runs, whether it can actually find the effect you care about. Statistical power is the probability that a test correctly rejects the null hypothesis when there is a true difference between variants — in plain terms, the odds you'll spot a winner that really exists.
The convention is 80% power, meaning if the new variant truly beats control, you have an 80% chance of detecting it at your chosen significance level. Anything lower and you're flipping a coin on real wins. Power analysis turns that goal into a concrete sample size, given your baseline conversion rate and the smallest effect worth caring about.
Most underperforming A/B test programs aren't undone by bad hypotheses — they're undone by underpowered tests. A test with 40% power isn't a test; it's a coin flip dressed up as data. You'll ship inconclusive results, kill real winners, and burn weeks of traffic doing it.
Power analysis sits inside the broader discipline of statistical analysis, alongside significance testing and confidence intervals. It's the planning step: before you launch, you decide what effect size you can realistically detect with the traffic you have.
What statistical power actually means
Every A/B test has two ways it can go wrong. A false positive (Type I error) means calling a winner when the variants are actually identical — controlled by your significance level, usually 5%. A false negative (Type II error) means missing a real winner — controlled by power.
Power is defined as 1 minus the false-negative rate. At 80% power, you accept a 20% chance of missing a true effect of the size you specified. Push to 90% power and you cut that miss rate to 10%, but you need roughly 33% more traffic to get there.
The number that matters most in power analysis is the minimum detectable effect (MDE) — the smallest lift you want to be able to spot reliably. A test sized for a 10% relative lift will routinely miss a 3% lift, even though a 3% lift compounded across a checkout funnel is genuinely valuable revenue.
The 'we ran it for two weeks' trap
Runtime is not power. A test can run for a month and still be hopelessly underpowered if traffic is thin or the effect is small. Always size by required sample, not by calendar time — then convert sample to runtime using your actual weekly sessions.
The four levers that decide your sample size
Sample size in a two-variant test is driven by four inputs: baseline conversion rate, minimum detectable effect, significance level, and power. Change any one and the required sample shifts — sometimes dramatically. Halving your MDE roughly quadruples the sample you need.
Baseline conversion rate matters because lower baselines are noisier in relative terms. A checkout page converting at 60% needs far less traffic to detect a 5% relative lift than a product page converting at 3%. That's why bottom-of-funnel tests usually reach significance faster than top-of-funnel ones.
Statistical power vs. sample size per variant
5% baseline, detect 10% relative lift
5% baseline, detect 5% relative lift
Notice the diminishing-returns curve: getting from 50% to 80% power costs roughly the same traffic as getting from 0% to 50%. That's why pushing past 90% power rarely pays for itself — the runtime cost outpaces the marginal reduction in miss rate.
How much traffic do you actually need?
The table below shows sample size per variant for an 80%-power, 95%-confidence two-sided test, across common baseline conversion rates and minimum detectable effects. These are the numbers most experimentation tools quietly produce when you click 'calculate sample size'.
Read it as a sanity check against your weekly traffic. If your product detail page sees 20,000 visitors a week and you want to detect a 5% relative lift on a 3% baseline, you're looking at roughly six weeks of runtime per variant — before you even account for seasonality or weekly cycles.
Visitors required per variant for 80% power, 95% confidence
| Baseline conversion rate | Detect 20% lift | Detect 10% lift | Detect 5% lift | Detect 2% lift |
|---|---|---|---|---|
| 1% (cold traffic) | ~7,800 | ~30,500 | ~120,000 | ~750,000 |
| 3% (PDP add-to-cart) | ~2,500 | ~10,000 | ~39,000 | ~245,000 |
| 5% (collection page) | ~1,500 | ~5,900 | ~23,000 | ~144,000 |
| 15% (checkout step) | ~440 | ~1,700 | ~6,600 | ~41,000 |
| 50% (returning users) | ~80 | ~310 | ~1,200 | ~7,500 |
Two things jump out. First, tests on rare events (cold-traffic conversion, refunds, support contacts) need eye-watering volume — often more than a mid-market store sees in a quarter. Second, micro-conversions further down the funnel are much cheaper to test on, which is why the cheapest velocity wins usually live in cart and checkout.
Fixing an underpowered test program
If your sample sizes are dwarfing your traffic, the answer isn't to lower power below 80% — it's to change what you're testing. Move measurement upstream to a higher-volume event (add-to-cart instead of purchase), increase the MDE you target by testing bolder changes, or stop testing pages that simply don't get enough traffic to ever resolve.
Another lever is segmenting smarter. Running one test across mobile and desktop dilutes any device-specific effect; running parallel tests by device sometimes finds wins that aggregated tests bury in the average. The catch is that each segment needs its own sample budget — so this only helps where total traffic is high.
A working rule of thumb
Before launching any test, write down the required sample per variant and divide by your last 4 weeks of relevant sessions. If that gives you more than 4 weeks of runtime, redesign the test before you ship it — bolder hypothesis, earlier funnel event, or higher-traffic page. Don't run tests you can't finish.
Frequently asked questions
It's a convention, not a law — Cohen proposed it in the 1960s as a reasonable balance between false-negative risk and the cost of collecting data. In practice, 80% power means accepting a 1-in-5 chance of missing a real winner, which most teams find tolerable. Higher-stakes tests (pricing, checkout) sometimes use 90%.
Significance answers 'is this result unlikely under the null?' after the test runs. Power analysis answers 'will this test be able to detect a real effect?' before the test runs. You need both: significance protects you from false positives, power protects you from false negatives.
You can, but you'll miss 40% of real winners — which destroys the economics of your test program over time. The hidden cost of a low-power test isn't the test that runs; it's the winning variant you killed that you'll never know existed. Change what you test, not your power threshold.
Sample size scales roughly with 1/MDE². Halving the MDE quadruples the required sample. That's why detecting a 2% relative lift on a low-baseline conversion is so expensive — the math is unforgiving at small effects.
Yes, but the formula changes. For revenue or AOV you size based on the metric's mean and standard deviation, not a conversion rate. Variance in order value is usually high, which is why revenue tests typically need 3-5x the sample of conversion-rate tests on the same page.
Sequential testing (with proper corrections like always-valid p-values) lets you peek at results without inflating false-positive rates. It can shorten runtime when effects are large, but it doesn't change the underlying power-versus-sample relationship — small effects still need lots of data.
Without a sequential-testing framework, peeking and stopping inflates your false-positive rate dramatically — a nominal 5% significance level can balloon to 15-25% in practice. The variants that look significant early are disproportionately noise that hasn't regressed to the mean yet.
Each pairwise comparison needs its own sample, and you typically apply a multiple-comparisons correction (Bonferroni, Holm) to the significance level. A 4-variant test usually needs roughly 2x the per-variant traffic of a 2-variant test to maintain the same power against control.
Standard power formulas assume the sampling distribution of the test statistic is approximately normal, which holds for conversion rates above ~1% with reasonable sample sizes (Central Limit Theorem). For very rare events or heavy-tailed metrics like revenue, simulation-based power analysis is more reliable than closed-form formulas.
Pull the last 4-8 weeks of data for the exact metric and segment you'll be testing — not a sitewide average. If the page is new or the metric is volatile, run a 1-2 week observation period before designing the test. Sizing on a wrong baseline can off by 2-3x in either direction.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.