Experiment Validity

Metricuno
May 18, 2026
4 min read
Quick answer

Experiment validity is the property that decides whether an A/B test result is real and whether it will hold up beyond the conditions you tested in. Here's how to protect both internal and external validity.

Definition
Statistical Analysis

Experiment Validity

The degree to which an A/B test result reflects a true causal effect (internal validity) and generalises beyond the test conditions (external validity).

Experiment validity is the quality bar that separates a measured lift from a publishable decision. It has two halves. Internal validity asks whether the difference you observed was actually caused by the variant — or whether a tracking bug, a sample ratio mismatch, a confound, or a peeking decision manufactured the result. External validity asks whether the effect will repeat outside the test window: for new visitors, on Black Friday traffic, on mobile Safari, in a different market.

A test can be statistically significant and still invalid on either axis. Most rollback stories in e-commerce trace back to a validity failure, not a maths failure — which is why disciplined teams treat validity checks as a gate before reading the p-value, not after.

Also known as
test validity
trustworthy experimentation

Internal validity is what makes the causal claim defensible. The randomisation has to actually be random, the variant has to be the only thing that changed, and the measurement has to fire identically for both groups. If any of those break, your observed lift is a measurement artefact dressed up as insight.

External validity is what makes the rollout decision defensible. A 6% checkout lift on desktop traffic during a quiet sales week tells you very little about what will happen on a peak mobile day with paid social cohorts. The further the rollout context drifts from the test context, the more validity you have to argue for explicitly.

Formula

Adjusted Lift = Observed Lift × (1 − Validity Discount)

Variables

Observed Lift

Raw measured lift

The relative difference between variant and control on the primary metric.

Validity Discount

Combined risk factor

A 0–1 haircut reflecting known internal threats (SRM, tracking gaps) and external threats (novelty, seasonality, segment mismatch).

Adjusted Lift

Decision-grade lift

The lift you should plan revenue forecasts against once validity risk is priced in.

Worked example

A Shopify apparel store tests a new product-page layout. The test runs for 9 days during a discount campaign and shows a +5.2% lift on add-to-cart. Internal checks are clean, but the team flags two external threats: the discount campaign (estimated 10% discount) and a novelty effect from heavy returning-visitor exposure (estimated 15% discount). Combined validity discount ≈ 1 − (0.90 × 0.85) = 0.235.

Observed Lift: 5.2%

Validity Discount: 0.235

Adjusted Lift ≈ 4.0%

The team forecasts the rollout against a 4.0% lift, not 5.2%, and plans a holdback to validate the gap in steady-state traffic post-campaign.

Discount factors are judgement calls, not maths — but writing them down forces the conversation. The alternative is rolling out the raw lift, missing the forecast, and blaming the test rather than the validity assumptions baked into it.

Benchmark

Common validity threats in e-commerce A/B tests and how often they materially distort the result

Validity threatTypeTypical incidenceTypical lift distortion
Sample Ratio Mismatch (SRM)Internal5–10% of testsUnbounded — invalidates result
Tracking fires only on one variantInternal3–6% of tests+20 to +200% spurious lift
Peeking / early stoppingInternal30–50% of tests+30–80% inflated effect
Novelty effect (returning visitors)External20–30% of tests−30 to −60% decay over 30 days
Seasonality / campaign overlapExternal15–25% of tests±15–40% vs steady state
Segment generalisation (desktop→mobile)ExternalMost testsEffect can flip sign

The table makes the case for a pre-readout checklist. SRM is cheap to detect (a chi-squared on assignment counts) and catastrophic if missed. Peeking is the most common own-goal — the fix is committing to a sample size before launch and refusing to read significance until you hit it. Novelty is the one most teams underestimate because the early numbers look so good.

Frequently asked

Experiment validity FAQ

Internal validity asks whether the variant truly caused the observed effect inside your test — no bugs, no confounds, no broken randomisation. External validity asks whether that effect will hold up for different audiences, devices, seasons, or traffic mixes once you roll out. You need both before shipping.

Statistical significance is one piece of internal validity — it rules out random chance as the explanation. It says nothing about tracking bugs, SRM, peeking, novelty, or whether the result generalises. A test can be highly significant and completely invalid, which is why validity checks belong in the broader practice of statistical analysis, not just the p-value calculation.

SRM is when the actual traffic split between variants differs significantly from the intended split (e.g., you set 50/50 but get 53/47). It signals that randomisation or bucketing is broken, which means the two groups aren't comparable to begin with — so any lift you measure could be a population difference rather than a treatment effect. Run a chi-squared SRM check before reading any result.

Segment your results by new vs returning visitors and plot the daily lift over time. A novelty effect typically shows a strong initial lift among returning visitors that decays over 2–4 weeks, while new-visitor lift stays flat. If the effect is concentrated in returning users and trending down, discount the rolled-out forecast accordingly.

Repeatedly checking significance and stopping the moment p-value crosses 0.05 inflates your false positive rate from 5% to 20–30%. The maths assumes you look once, at a pre-committed sample size. Either commit to that sample size upfront, or use a sequential / always-valid testing method that's designed for continuous monitoring.

At minimum, two full business weeks to cover weekday/weekend cycles, and ideally a full purchase cycle for your category (often 3–4 weeks). For categories with strong seasonality or paid-campaign-driven traffic, run across at least one campaign-on and campaign-off period before generalising the result.

Yes — and it happens often. A tracking pixel that only fires on the variant, an SRM caused by a redirect, or peeking on day 3 of a planned 14-day test can all produce a clean 95% confidence result that's pure artefact. Significance is a necessary condition, not a sufficient one.

Often not. Interaction patterns, scroll behaviour, and form friction differ enough that lifts frequently shrink, vanish, or flip on mobile. If mobile is more than 30% of your traffic, segment your test analysis by device — or run a confirmatory mobile-only test before rolling out a desktop-tested change globally.

Four gates: (1) SRM check on assignment counts, (2) tracking-parity check on at least one upstream event, (3) confirm you reached the pre-committed sample size before reading significance, (4) segment the lift by new/returning and by device to spot novelty or device-specific effects. If any gate fails, treat the result as inconclusive.

The observed lift is almost always optimistic compared to what you'll see post-rollout — because novelty fades, campaign tailwinds disappear, and the rollout audience is broader than the test audience. Apply an explicit validity discount (often 20–40%) when projecting revenue impact, and plan a holdback group to measure the steady-state lift.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.