Statistical Significance

Metricuno
May 19, 2026
4 min read
Quick answer

Statistical significance tells you how likely an A/B test result is to be a real effect rather than random noise. Here's how p-values work, why 0.05 is the convention, and what sample size you need.

Definition
Experimentation

Statistical Significance

The probability that an observed test result occurred by chance under the null hypothesis — conventionally 'significant' when p < 0.05.

Statistical significance is the formal way experimenters decide whether a difference between two variants is real or a roll of the dice. It is expressed as a p-value: the probability of seeing a result at least as extreme as the observed one if the variants actually performed identically. A p-value below a pre-set threshold — usually 0.05, meaning a 5% false-positive risk — is reported as 'statistically significant'.

In an e-commerce A/B test, significance is what separates a checkout redesign that genuinely lifts conversion from a two-day spike caused by a paid-traffic burst. It is one of the core outputs of any statistical analysis applied to experimentation data.

Also known as
p-value significance
statistical confidence

The 5% threshold is a convention, not a law. R.A. Fisher proposed it in the 1920s as a reasonable cut-off for agricultural trials, and the rest of science copied it. In practice, many CRO teams use 90% confidence (p < 0.10) on cheap, reversible tests, and 99% (p < 0.01) on changes that touch pricing or core checkout logic where a false positive is expensive.

Significance is not effect size. A test on 800,000 sessions can flag a 0.1% lift as 'highly significant' yet the lift is too small to matter commercially. Always read significance alongside the absolute uplift and the confidence interval — the range of plausible true effects given your data.

Formula

z = (p_B - p_A) / sqrt( p_pool * (1 - p_pool) * (1/n_A + 1/n_B) )

Variables

p_A

Control conversion rate

Conversions in variant A divided by sessions in variant A.

p_B

Variant conversion rate

Conversions in variant B divided by sessions in variant B.

p_pool

Pooled conversion rate

Total conversions across both variants divided by total sessions.

n_A

Control sample size

Number of sessions assigned to variant A.

n_B

Variant sample size

Number of sessions assigned to variant B.

z

z-score

Convert to a p-value via the standard normal distribution. |z| ≥ 1.96 corresponds to p < 0.05 (two-tailed).

Worked example

An apparel store on Shopify tests a sticky 'Add to bag' button against the standard one.

Control sessions (n_A): 18,000

Control conversions: 540 (3.00%)

Variant sessions (n_B): 18,000

Variant conversions: 612 (3.40%)

z ≈ 2.21, p ≈ 0.027

p < 0.05, so the 0.40-percentage-point lift is statistically significant at the conventional threshold. The team can ship the sticky button with a 2.7% chance the lift was random noise.

A common misread: 'p = 0.027 means there's a 97.3% chance the variant is better.' It doesn't. The p-value is the probability of the data given the null hypothesis, not the probability of the hypothesis given the data. The practical takeaway is similar in most cases, but the distinction matters when you start peeking at results mid-test.

Benchmark

Sessions per variant needed to detect a given relative lift at 95% confidence and 80% power, by baseline conversion rate.

Baseline conversion rateDetect +5% liftDetect +10% liftDetect +20% lift
1.5% (cold traffic landing page)~210,000~53,000~13,500
3.0% (apparel store homepage)~103,000~26,000~6,600
5.0% (beauty PDP, warm traffic)~60,000~15,200~3,900
8.0% (returning-customer checkout)~36,500~9,200~2,400

Significance assumes you fixed your sample size before the test and looked once at the end. Peeking — checking the p-value daily and stopping when it crosses 0.05 — inflates your false-positive rate well above 5%. If you want to look early, use sequential testing or Bayesian methods designed for it.

Frequently asked

Frequently asked questions

If the two variants truly performed identically, you would see a difference at least as large as the one you observed less than 5% of the time. It is a statement about the data under the null hypothesis, not about the probability that your variant is better.

Yes — '95% confidence' and 'p < 0.05' are two ways of expressing the same threshold. Confidence intervals and p-values are mathematically linked outputs of the same statistical analysis.

Often. A 0.2% lift on a high-traffic homepage can clear p < 0.05 yet not justify the engineering cost or the risk of cannibalising another metric. Always weigh significance against absolute effect size and business impact.

Either the true effect is smaller than your test was powered to detect, or there's no real effect. Calculate the minimum detectable effect for your sample, and call the test inconclusive rather than 'negative' if the confidence interval still spans zero meaningfully.

Long enough to hit your pre-calculated sample size, and at least one full business cycle (typically 7-14 days) to absorb day-of-week effects. Decide both upfront; don't stop early just because the p-value crossed 0.05.

Two-tailed is the safer default — it accounts for the possibility that your variant could be worse, not just better. Most experimentation platforms report two-tailed p-values by default.

Significance (alpha) controls false positives — calling a flat result a winner. Power (1 - beta) controls false negatives — missing a real winner. Standard practice is 95% significance with 80% power, meaning you accept a 5% false-positive rate and a 20% miss rate.

It replaces them with a different metric — typically 'probability variant B is better than A'. Bayesian methods handle early stopping more gracefully, but you still need a decision threshold (e.g. 95% probability to ship) and a pre-defined sample budget.

Segment-level results have smaller samples and wider confidence intervals, so significance is harder to reach. Mobile may also have genuinely different behaviour. Pre-register segments you care about and power the test for the smallest segment, not the overall.

Not with classical (frequentist) testing — peeking inflates your real false-positive rate to 20-30%. If you need to monitor and stop early, use sequential testing (e.g. group sequential or always-valid p-values) designed to preserve the error rate under repeated looks.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.