Statistical Analysis
A working framework for the statistics that decide whether a winning variant is real — significance, power, sample size, and the validity traps that quietly inflate false positives.
Statistical Analysis (in experimentation)
The math that decides whether an A/B test result is a real effect or random noise.
Statistical analysis in experimentation is the set of methods that turn raw conversion counts into a defensible decision: ship the variant, kill it, or keep testing. It quantifies how likely the observed difference between control and variant would be if the variant had no real effect, and how confident you can be in the size of that effect.
In practice it spans four jobs — picking a decision rule (frequentist significance or a Bayesian posterior), sizing the test before it starts (power and sample size), interpreting the result (p-values and confidence intervals), and protecting it from threats to validity (peeking, sample-ratio mismatch, novelty effects). Get those four right and your win rate stops being a coin flip.
Most A/B test disappointments are not bad ideas — they are statistical errors dressed up as wins. A 12% lift on 800 sessions per variant is mathematically indistinguishable from noise, but the dashboard shows green and the variant ships. Three weeks later, revenue is flat and nobody can explain why.
The fix is not more statistics jargon. It is a small, repeatable framework that lives behind every test you run: a decision rule you committed to before launch, a sample size you actually reached, and a checklist of validity threats you ruled out. The rest of this page walks through those three layers.
Layer 1: The decision rule
Pick how you will decide before you start the test. The two mainstream options are frequentist null-hypothesis testing — which produces a p-value and a confidence interval — and Bayesian testing, which produces a posterior probability that the variant beats control. Both are defensible. Mixing them mid-test is not.
Frequentist testing is the default in most A/B tools: you set a significance threshold (typically α = 0.05), a power target (typically 80%), and a minimum detectable effect, then read the p-value at the planned end date. Bayesian testing replaces the p-value with statements like "94% probability variant B is better," which read more naturally to stakeholders but require sensible priors and the same discipline on sample size.
Layer 2: Planning the sample size
A test that ends underpowered is not a test — it is a guess with a confidence interval wide enough to drive a truck through. Before you launch, run a power analysis using four numbers: your baseline conversion rate, the minimum detectable effect you care about, your significance level, and your target power. The output is the required sample size per variant.
The non-obvious bit: required sample size scales roughly with 1/MDE². Halving the effect you want to detect quadruples the traffic you need. On a Shopify apparel store doing 4,000 checkouts per week, detecting a 2% lift on a 3% baseline conversion rate takes about six weeks per variant — not the "we'll check Friday" timeline most teams assume.
Don't peek at uncorrected p-values
Checking a frequentist test every day and stopping the first time p < 0.05 inflates your real false-positive rate from 5% to roughly 25-30%. If you need early-stopping, use sequential testing or always-valid p-values designed for it — not the default test, eyeballed daily.
Layer 3: Protecting validity
Statistical significance is a necessary condition for a win, not a sufficient one. A test can be perfectly significant and still wrong if traffic was split unevenly (sample-ratio mismatch), if one variant loaded slower, if the test ran through Black Friday, or if users saw the variant on mobile but the control on desktop. Experiment validity is the audit layer that catches these before you ship.
Build a pre-ship checklist: SRM chi-square test under 0.01, no overlap with other live tests on the same surface, a minimum run length of one full business cycle (usually two weeks), and segment-level sanity checks on the top three traffic sources. If any one fails, the headline number is suspect — re-run rather than rationalise.
Required sample size per variant by minimum detectable effect (3% baseline conversion, 95% confidence, 80% power)
Statistical analysis FAQ
It means that if the variant had no real effect, you'd see a difference at least this large about 4% of the time by random chance alone. It is not the probability the variant is better, and it says nothing about how big the lift is — read it alongside the confidence interval.
For most e-commerce decisions on reversible changes, yes — 95% confidence with 80% power is the working standard. For high-stakes, hard-to-reverse changes (checkout flow rewrites, pricing), tighten to 99% confidence and 90% power, and accept the longer runtime.
Significance asks "is this difference too large to be chance?" Validity asks "is the difference measuring what we think it is?" A test can be significant but invalid (sample-ratio mismatch, contamination, novelty bias). Both have to clear before you ship.
Pick one and stay consistent. Frequentist is the default in most platforms and what stats textbooks teach. Bayesian testing communicates more naturally ("93% probability B wins"), handles early-stopping more cleanly, and is worth adopting if your team finds p-values genuinely confusing — but it doesn't fix underpowered tests.
Because checking repeatedly inflates the false-positive rate well above your stated 5%. If you want to stop early, use sequential testing or always-valid p-values, which are designed to keep the error rate honest across many peeks.
Run a power analysis with four inputs: baseline conversion rate, minimum detectable effect, significance level (usually 0.05), and power (usually 0.80). Most testing tools include a calculator; the output is required sessions per variant. Required size scales with 1/MDE², so small effects need a lot of traffic.
For typical Shopify and WooCommerce stores, 5-10% relative lift is realistic for meaningful UX changes; below 3% requires enterprise-scale traffic to detect reliably. Plan tests around effects worth shipping — there's no point detecting a 0.5% lift you couldn't act on.
A false positive is when a test declares a winner that has no real effect. At α = 0.05 you accept a 5% rate per test by design — so if you run 20 tests, expect roughly one "winner" that isn't. This is why post-launch holdouts and re-tests of big wins matter.
Yes — most tools report significance but don't enforce sample-size pre-commitment, validity checks, or peeking corrections. Treat the automated winner flag as a prompt to run the validity checklist, not as the decision itself.
Long enough to reach the planned sample size AND cover at least one full business cycle (usually two weeks for e-commerce, to catch weekday vs weekend behaviour). Ending early — even at significance — risks day-of-week bias and novelty effects skewing the result.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.