Statistical Significance Calculator Calculator

Metricuno
May 17, 2026
5 min read
Quick answer

Drop in visitor and conversion counts from your A/B test and get a p-value, confidence level, and lift interval — so you know whether the win is real before you ship it.

Definition
Experimentation

Statistical Significance Calculator

A tool that takes two A/B variants' visitor and conversion counts and returns a p-value, confidence level, and lift interval.

A statistical significance calculator compares the conversion rates of two test variants and tells you the probability that the observed difference is real signal rather than random noise. Under the hood it runs a two-proportion z-test on the visitor and conversion counts you provide, then converts the resulting z-score into a p-value and a confidence interval around the lift.

The job of the calculator is binary at decision time: did the variant beat the control with enough certainty to roll out, or do you need more traffic? It's the standard post-test gate before you ship a change to 100% of buyers.

Also known as
A/B test significance calculator
p-value calculator
conversion test calculator
Calculator

A/B test significance calculator

Inputs

Control visitors

Control conversions

Variant visitors

Variant conversions

Significance level (α)

0.05 = 5% false-positive rate (standard).

Result

Statistical significance

p = 0.0284

Significant

Z-score

2.192

Relative lift

16.67%

Conversion rates

Control: 3.60%Variant: 4.20%

Enter visitor and conversion counts for each variant. The calculator runs a two-proportion z-test and returns the p-value, the variant's lift over control, and a 95% confidence interval on that lift. Default confidence threshold is 95% (α = 0.05), two-tailed.

Use this calculator at the end of a test — after you've hit your pre-registered sample size and run for at least one full business cycle. Checking significance mid-test, then stopping early when you see a green number, is the single most common way to ship a fake win.

The math behind the calculator

Formula

z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )

Variables

p_a

Control conversion rate

Conversions in control divided by visitors in control.

p_b

Variant conversion rate

Conversions in variant divided by visitors in variant.

n_a

Control visitors

Total unique visitors assigned to the control.

n_b

Variant visitors

Total unique visitors assigned to the variant.

p_pool

Pooled conversion rate

(conversions_a + conversions_b) / (n_a + n_b) — the shared baseline used under the null hypothesis.

z

Z-score

Standardised difference between the two rates; converts to a p-value via the normal distribution.

Worked example

A Shopify apparel store tests a new product-page hero against control. Control: 12,000 visitors, 360 add-to-carts (3.00%). Variant: 12,000 visitors, 420 add-to-carts (3.50%).

Control visitors (n_a): 12000

Control conversions: 360

Variant visitors (n_b): 12000

Variant conversions: 420

Pooled rate (p_pool): 0.0325

z ≈ 2.20, two-tailed p ≈ 0.028

p = 0.028 is below the 0.05 threshold, so you'd reject the null at 95% confidence. The variant's 16.7% relative lift is unlikely to be noise — ship it, and continue monitoring revenue per visitor in the 2 weeks after rollout.

The z-test assumes independent visitors, a binary outcome per visitor (converted / didn't), and large enough samples that the normal approximation holds — generally at least ~30 conversions per arm and conversion rates not pinned near 0% or 100%. For sparse events (refunds, high-AOV checkout completions on low traffic), Fisher's exact test is more honest.

What a real test result looks like

Benchmark

Sample A/B test scenarios on an apparel store running PDP variants — what the calculator returns

ScenarioControl CRVariant CRVisitors per armRelative liftp-valueCall
Clear winner3.00%3.60%15,000+20.0%0.004Ship
Borderline3.00%3.30%15,000+10.0%0.061Extend test
Underpowered3.00%3.45%4,000+15.0%0.213Inconclusive
Flat3.00%3.05%20,000+1.7%0.713No effect
Negative3.00%2.70%15,000-10.0%0.041Kill variant
Large win, small N3.00%5.00%1,200+66.7%0.012Replicate before shipping

Two patterns repeat in the table. First, big relative lifts on small samples (the last row) are often regression-to-the-mean traps — replicate or extend before rolling out. Second, a result of p ≈ 0.06 isn't "almost significant" — it means you need more data, not a softer threshold.

Common ways teams misread the output

A p-value answers one specific question: if the variants were truly identical, how often would you see a difference this large by chance? It does not tell you the probability the variant is better, the size of the effect, or whether the change will hold up at full traffic. For those, look at the confidence interval on the lift — if it spans zero, you don't have a reliable directional read yet.

Don't peek, don't stop early

Checking significance every morning and ending the test the first time p drops below 0.05 inflates your false-positive rate dramatically — by some estimates from 5% to 25%+. Decide your sample size upfront, run until you hit it, then evaluate once. If you need sequential monitoring, switch to a Bayesian or always-valid test framework, not the classical z-test in this calculator.

Frequently asked

Statistical significance calculator FAQ

95% (α = 0.05) is the standard default and what this calculator uses out of the box. Drop to 90% only for low-risk, easily reversible changes like copy tweaks where shipping a near-miss is cheap. Bump to 99% for high-risk changes — checkout flow, pricing pages — where a false positive costs real revenue.

A sample size calculator runs before the test to tell you how much traffic you need to detect a given lift. A significance calculator runs after the test to evaluate the data you actually collected. You should use both — sample size to plan, significance to decide.

Two-tailed is the safe default and what most CRO teams use, because it accounts for the variant being either better or worse than control. Use one-tailed only when a worse-than-control outcome would lead to exactly the same decision as a flat outcome — rare in practice, since you usually want to know if your variant tanked.

Not on the strength of this test alone. p = 0.07 means the data is suggestive but doesn't clear the 95% threshold. Your options are: extend the test to gather more traffic, lower your confidence threshold to 90% if the change is genuinely low-risk and pre-agreed, or treat the result as a hypothesis to retest with a sharper variant.

No — this calculator runs a two-proportion z-test for binary outcomes (converted yes/no). For continuous metrics like revenue per visitor or AOV you need a t-test or a bootstrap, because the distribution is skewed and a few high-value orders distort the variance. Test conversion rate here, then sanity-check revenue separately.

At least 30 conversions per arm for the normal approximation to behave, and ideally 200-400+ before you trust the result for a business decision. Below 30, switch to Fisher's exact test. The exact threshold depends on your minimum detectable effect — smaller lifts need much larger samples.

It's the range of relative lifts consistent with your data at the chosen confidence level. A 95% CI of [+4%, +28%] means you can be 95% confident the true lift is somewhere in that range. If the interval includes zero (e.g. [-2%, +15%]), the result isn't statistically significant regardless of the point estimate.

Not directly — comparing 3+ variants pairwise with this calculator inflates your false-positive rate through multiple comparisons. Either apply a Bonferroni correction (divide your α by the number of comparisons) or use an ANOVA-style multi-variant test. For most cases, run sequential two-variant tests instead.

Bayesian results are easier to communicate ("94% probability variant B wins") and don't suffer from peeking the same way. They're a reasonable choice, especially for stakeholders who find p-values unintuitive. The underlying decision is similar in most realistic scenarios — pick one framework and stick with it across your program rather than switching when results disagree.

About 5% of the time, an A/A test will return p < 0.05 by pure chance — that's what the 95% confidence level means. If you see it more often, check for instrumentation bugs: uneven traffic split, sample ratio mismatch, bot traffic in one arm, or events firing twice. A well-run A/A test is the fastest way to expose a broken tracking setup.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.