Statistical Significance
Statistical significance tells you how likely an A/B test result is to be a real effect rather than random noise. Here's how p-values work, why 0.05 is the convention, and what sample size you need.
Statistical Significance
The probability that an observed test result occurred by chance under the null hypothesis — conventionally 'significant' when p < 0.05.
Statistical significance is the formal way experimenters decide whether a difference between two variants is real or a roll of the dice. It is expressed as a p-value: the probability of seeing a result at least as extreme as the observed one if the variants actually performed identically. A p-value below a pre-set threshold — usually 0.05, meaning a 5% false-positive risk — is reported as 'statistically significant'.
In an e-commerce A/B test, significance is what separates a checkout redesign that genuinely lifts conversion from a two-day spike caused by a paid-traffic burst. It is one of the core outputs of any statistical analysis applied to experimentation data.
The 5% threshold is a convention, not a law. R.A. Fisher proposed it in the 1920s as a reasonable cut-off for agricultural trials, and the rest of science copied it. In practice, many CRO teams use 90% confidence (p < 0.10) on cheap, reversible tests, and 99% (p < 0.01) on changes that touch pricing or core checkout logic where a false positive is expensive.
Significance is not effect size. A test on 800,000 sessions can flag a 0.1% lift as 'highly significant' yet the lift is too small to matter commercially. Always read significance alongside the absolute uplift and the confidence interval — the range of plausible true effects given your data.
z = (p_B - p_A) / sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )
p_A
Control conversion rate
Conversions in variant A divided by sessions in variant A.
p_B
Variant conversion rate
Conversions in variant B divided by sessions in variant B.
p_pool
Pooled conversion rate
Total conversions across both variants divided by total sessions.
n_A
Control sample size
Number of sessions assigned to variant A.
n_B
Variant sample size
Number of sessions assigned to variant B.
z
z-score
Convert to a p-value via the standard normal distribution. |z| ≥ 1.96 corresponds to p < 0.05 (two-tailed).
An apparel store on Shopify tests a sticky 'Add to bag' button against the standard one.
Control sessions (n_A): 18,000
Control conversions: 540 (3.00%)
Variant sessions (n_B): 18,000
Variant conversions: 612 (3.40%)
→ z ≈ 2.21, p ≈ 0.027
p < 0.05, so the 0.40-percentage-point lift is statistically significant at the conventional threshold. The team can ship the sticky button with a 2.7% chance the lift was random noise.
A common misread: 'p = 0.027 means there's a 97.3% chance the variant is better.' It doesn't. The p-value is the probability of the data given the null hypothesis, not the probability of the hypothesis given the data. The practical takeaway is similar in most cases, but the distinction matters when you start peeking at results mid-test.
Sessions per variant needed to detect a given relative lift at 95% confidence and 80% power, by baseline conversion rate.
| Baseline conversion rate | Detect +5% lift | Detect +10% lift | Detect +20% lift |
|---|---|---|---|
| 1.5% (cold traffic landing page) | ~210,000 | ~53,000 | ~13,500 |
| 3.0% (apparel store homepage) | ~103,000 | ~26,000 | ~6,600 |
| 5.0% (beauty PDP, warm traffic) | ~60,000 | ~15,200 | ~3,900 |
| 8.0% (returning-customer checkout) | ~36,500 | ~9,200 | ~2,400 |
Significance assumes you fixed your sample size before the test and looked once at the end. Peeking — checking the p-value daily and stopping when it crosses 0.05 — inflates your false-positive rate well above 5%. If you want to look early, use sequential testing or Bayesian methods designed for it.
Frequently asked questions
If the two variants truly performed identically, you would see a difference at least as large as the one you observed less than 5% of the time. It is a statement about the data under the null hypothesis, not about the probability that your variant is better.
Yes — '95% confidence' and 'p < 0.05' are two ways of expressing the same threshold. Confidence intervals and p-values are mathematically linked outputs of the same statistical analysis.
Often. A 0.2% lift on a high-traffic homepage can clear p < 0.05 yet not justify the engineering cost or the risk of cannibalising another metric. Always weigh significance against absolute effect size and business impact.
Either the true effect is smaller than your test was powered to detect, or there's no real effect. Calculate the minimum detectable effect for your sample, and call the test inconclusive rather than 'negative' if the confidence interval still spans zero meaningfully.
Long enough to hit your pre-calculated sample size, and at least one full business cycle (typically 7-14 days) to absorb day-of-week effects. Decide both upfront; don't stop early just because the p-value crossed 0.05.
Two-tailed is the safer default — it accounts for the possibility that your variant could be worse, not just better. Most experimentation platforms report two-tailed p-values by default.
Significance (alpha) controls false positives — calling a flat result a winner. Power (1 - beta) controls false negatives — missing a real winner. Standard practice is 95% significance with 80% power, meaning you accept a 5% false-positive rate and a 20% miss rate.
It replaces them with a different metric — typically 'probability variant B is better than A'. Bayesian methods handle early stopping more gracefully, but you still need a decision threshold (e.g. 95% probability to ship) and a pre-defined sample budget.
Segment-level results have smaller samples and wider confidence intervals, so significance is harder to reach. Mobile may also have genuinely different behaviour. Pre-register segments you care about and power the test for the smallest segment, not the overall.
Not with classical (frequentist) testing — peeking inflates your real false-positive rate to 20-30%. If you need to monitor and stop early, use sequential testing (e.g. group sequential or always-valid p-values) designed to preserve the error rate under repeated looks.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.