Sequential Testing

Metricuno
May 19, 2026
4 min read
Quick answer

Sequential testing is a family of A/B test designs that let you peek at results during a run — and stop early when the answer is clear — without breaking your false-positive rate.

Definition
Statistical Analysis

Sequential Testing

A/B test designs that let you check results repeatedly and stop early without inflating the false-positive rate.

Sequential testing is a family of statistical test designs built to handle the fact that experimenters look at their dashboards before the planned end date. In a fixed-horizon test, every peek silently inflates the false-positive rate; by the fifth look at a 95% test, your effective false-positive rate can sit closer to 14% than 5%. Sequential methods — group sequential boundaries, alpha-spending functions, mixture sequential probability ratio tests (mSPRT), and always-valid p-values — adjust the decision threshold so that repeated analyses keep the overall error rate at the level you declared. The trade-off is sample size: protecting against peeking costs a 10-30% efficiency loss versus a perfectly disciplined fixed-horizon test.

Also known as
group sequential testing
always-valid inference
alpha-spending tests

The problem sequential testing solves is peeking. Every time you look at a fixed-horizon test and decide whether to stop, you're running another implicit hypothesis test. The probability that at least one of those looks crosses the significance threshold by chance grows with every check, and most teams check daily.

Sequential designs handle this by spending your error budget across the planned looks. Instead of one 5% false-positive allowance at the end, you might spend 0.5% at look one, 1% at look two, and so on, with the boundary tightening or loosening based on the alpha-spending function you chose (O'Brien-Fleming and Pocock are the two classics).

Formula

alpha_spent(t) = alpha * (t / T)^rho

Variables

alpha_spent(t)

Alpha spent by look t

Cumulative false-positive budget consumed by analysis point t

alpha

Total alpha

Overall false-positive rate you're willing to accept (typically 0.05)

t

Information fraction at current look

Sample size collected so far divided by planned total

T

Planned total information

Final sample size or duration at which the test would otherwise end

rho

Shape parameter

Controls how front-loaded vs back-loaded the spending is (Pocock ≈ flat, O'Brien-Fleming ≈ saves alpha for late looks)

Worked example

A Shopify apparel store runs a checkout test planned for 4 weeks (T = 4) with alpha = 0.05 and a Pocock-like spending function (rho = 1). By the end of week 1, they want to know how much false-positive budget is available if they stop early.

alpha: 0.05

t (weeks elapsed): 1

T (planned weeks): 4

rho: 1

alpha_spent(1) = 0.05 * (1/4)^1 = 0.0125

After week 1, the team can declare a winner only if the p-value at that look is below 0.0125, not 0.05. The remaining 0.0375 of alpha is available for later looks. With an O'Brien-Fleming-style boundary (rho ≈ 3), the week-1 threshold would be far stricter (~0.0008), saving almost the full 0.05 for the planned end.

Which spending function you choose changes the personality of the test. O'Brien-Fleming makes early stopping hard but barely penalises the final look — good when you'd rather run the full duration unless something is overwhelming. Pocock makes every look roughly equal, which trades a slightly larger final-look penalty for faster early stops.

Benchmark

Approximate stopping thresholds (two-sided z-statistic) for a 5-look group sequential test at alpha = 0.05

Look #Information fractionO'Brien-Fleming zPocock zFixed-horizon z
10.204.562.411.96
20.403.232.411.96
30.602.632.411.96
40.802.282.411.96
51.002.042.411.96

The cost of buying yourself permission to peek is a modest hit to sensitivity. A Pocock test typically needs about 20-30% more samples than a fixed-horizon test to reach the same statistical power, and an O'Brien-Fleming test about 5-10% more. Most CRO teams happily trade that for the option to stop a clearly-winning checkout variant in week two instead of waiting four.

Frequently asked

Sequential testing FAQ

You can check it, but you can't make stop/continue decisions based on what you see without inflating your false-positive rate. If you peek daily on a 14-day fixed-horizon test at 95% confidence, your effective false-positive rate climbs to roughly 20-30%. Sequential designs are the correct fix; promising not to peek is the wrong one.

Always-valid p-values (the approach used by Optimizely's Stats Engine and Evan Miller's mSPRT) are a flavour of sequential testing where the p-value is recalibrated to be valid at any sample size, so you can stop whenever you want. Group sequential testing instead pre-specifies a finite number of look times with adjusted thresholds at each one. Both belong to the same family of statistical analysis methods.

Bayesian posterior probabilities don't suffer from peeking the same way frequentist p-values do, so 'sequential testing' as a frequentist correction isn't strictly needed. But you still need a stopping rule that controls decision quality, often expected loss falling below a threshold. The practical effect — being able to stop early — is the same.

Use O'Brien-Fleming when you want to discourage early stops and only halt on dramatic effects — it preserves nearly the full alpha for the final look. Use Pocock when you genuinely want the option to stop early and don't mind a slightly stricter final-look threshold. O'Brien-Fleming is the safer default for most CRO programs.

Between 4 and 10 looks is the practical sweet spot. Fewer than 4 and you lose most of the early-stopping benefit; more than 10 and the boundaries get punitively strict at every look without adding much flexibility. Many teams use weekly looks over an 8-week max horizon.

If both tests run to their planned end with no early stop, yes — Pocock costs ~20-30% more samples, O'Brien-Fleming ~5-10% more. But sequential tests usually finish earlier on real winners, so average sample size across a portfolio of tests is typically lower, not higher.

Only if you haven't already made decisions based on peeks. The boundaries assume a pre-specified analysis plan. If you've been informally watching the dashboard and almost stopped twice, retrofitting a sequential design doesn't fully fix the inflated error rate. Pre-register the plan before you start.

No. Sequential testing is about deciding when to stop a fixed comparison; bandits are about dynamically reallocating traffic to better-performing variants while learning. Bandits optimise reward during the test; sequential tests optimise the speed and validity of a stop/ship decision. They solve different problems.

Calculate the fixed-horizon sample size for your minimum detectable effect, then multiply by an inflation factor: roughly 1.07 for O'Brien-Fleming with 5 looks, 1.21 for Pocock with 5 looks. That gives the maximum sample size you might need; you'll often stop well before it.

Yes. Sequential testing controls statistical error from peeking, but it can't fix bias from running only on Mondays or only during a sale. Set a minimum runtime (typically one full week, often two) before early stopping is allowed, so weekday and weekend buyers are both represented.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.