False Positives

Metricuno
May 19, 2026
4 min read
Quick answer

A false positive is an A/B test that declares a winner when no real effect exists. Here's how often they happen, why running multiple tests inflates the risk, and what to do about it.

Definition
Statistical Analysis

False Positives

An A/B test result that declares a winner when no true effect exists — the statistical 'type I error.'

A false positive happens when a test reaches statistical significance purely by chance, even though the variant has no real impact on conversion. With the standard 5% significance threshold (α = 0.05), roughly one in twenty null-result tests will look like a winner when nothing actually changed.

The risk compounds quickly when you run many tests, peek at results early, or test multiple metrics per experiment. Multiple-comparison corrections like Bonferroni or Benjamini–Hochberg exist specifically to keep the false-positive rate in check as your experimentation program scales.

Also known as
Type I error
False discovery
Alpha error

Every significance threshold is a deliberate trade-off. Set α = 0.05 and you're saying: I accept a 5% chance of calling a winner when there isn't one. That feels conservative on a single test, but it's the per-test risk, not the program-level risk.

Run twenty independent A/B tests on truly inert variants and you should expect roughly one to cross the significance line. Run a hundred and you're looking at five phantom wins shipped to production — variants that won't reproduce when you re-test or roll out fully.

Formula

FWER = 1 - (1 - α)^k

Variables

FWER

Family-wise error rate

The probability of at least one false positive across the full set of tests.

α

Per-test significance level

The chosen alpha for an individual test, typically 0.05.

k

Number of tests

How many independent comparisons you run in the family.

Worked example

An apparel store on Shopify runs 10 independent A/B tests on the product detail page in a quarter, each at α = 0.05.

Per-test α: 0.05

Number of tests (k): 10

FWER ≈ 40.1%

Even if none of the ten variants has any real effect, there's about a 40% chance at least one will look like a winner. That's why teams with high test velocity need either lower per-test α, multiple-comparison correction, or pre-registered primary metrics.

The fix isn't to stop testing — it's to budget your error rate. Bonferroni correction divides α by the number of tests; Benjamini–Hochberg controls the false discovery rate while staying less punitive on power. For most CRO programs, picking one primary metric per test and pre-committing to it eliminates the worst offender: silent multiple-comparison inflation.

Benchmark

Expected false positives at α = 0.05 as test volume scales

Tests run on null variantsExpected false positivesP(at least one false positive)
10.055.0%
50.2522.6%
100.5040.1%
201.0064.2%
502.5092.3%
1005.0099.4%

Beyond multiple comparisons, two operational habits inflate false positives: peeking at results before reaching the planned sample size, and stopping the test the moment p crosses 0.05. Both let randomness do the talking. Fix the sample size up front, or use a sequential testing method that's designed to handle interim looks.

Frequently asked

False positives in A/B testing — FAQ

It's a test result that crosses your significance threshold even though the variant has no real effect on the metric. The signal you saw was random noise that happened to align in the variant's favour during the test window.

They're the same thing in different vocabularies. 'Type I error' is the formal statistics term; 'false positive' is the plain-language version. Both describe rejecting the null hypothesis when the null is actually true.

Only per test. Across a program of many tests, the probability that at least one is a false positive grows quickly — about 40% after ten tests and over 90% after fifty. That program-level risk is the family-wise error rate.

A false positive (type I error) declares a winner when none exists. A false negative (type II error) misses a real winner because the test didn't detect the effect — usually due to under-powering or stopping early.

If you check the test repeatedly and stop the moment p drops below 0.05, you're effectively running multiple comparisons across time. The true false-positive rate ends up much higher than 5% — often 20–30% with daily peeking on a two-week test.

It's an adjustment to your significance threshold that accounts for running several tests or comparing several metrics. Use it when you test more than two variants in one experiment, evaluate multiple secondary metrics, or analyse many subgroups. Bonferroni and Benjamini–Hochberg are the two most common methods.

Bonferroni is simple and strict — divide α by the number of comparisons. It protects against any false positive but kills power. Benjamini–Hochberg controls the proportion of false discoveries among declared winners, which is usually the right trade-off for CRO programs with high test velocity.

Bayesian tests don't use the same null-hypothesis framework, but the analogue exists: the probability that the chosen variant is actually worse than control. Decision rules like 'ship when P(variant > control) > 95%' will still be wrong some of the time, and that error rate also inflates with more tests.

Re-test it. A real effect roughly reproduces in a re-run; a false positive shrinks toward zero or flips sign. You can also check whether the original test stopped early, had unusually low sample size, or was one of many simultaneous tests without correction.

Most mature programs aim for a per-test α of 0.05 with either a pre-registered primary metric or Benjamini–Hochberg correction across the test family. If shipping a bad change is costly (checkout, pricing), tighten to α = 0.01 on those tests rather than accepting the default.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.