Bayesian Testing

Metricuno
May 19, 2026
4 min read
Quick answer

Bayesian testing reports the probability that variant B beats A and updates as data arrives — letting you stop early without inflating false positives the way frequentist peeking does.

Definition
Statistical Analysis

Bayesian Testing

An A/B testing framework that reports the posterior probability a variant beats control, updated continuously as new data arrives.

Bayesian testing is an approach to experiment analysis that combines a prior belief about conversion rates with observed data to produce a posterior probability — most often phrased as 'P(B > A)' or 'probability to be best'. Instead of asking 'how surprising would this result be if there were no real effect?' (the frequentist p-value question), it answers the question the reader actually wants: 'given what I've seen, how likely is variant B to win?'

Because every update is a fresh, self-contained probability statement, Bayesian methods don't carry the peeking penalty that inflates frequentist false-positive rates. The trade-off: you must pick a prior, and you interpret expected loss and credible intervals instead of significance thresholds.

Also known as
Bayesian A/B testing
Posterior probability testing

The frequentist alternative — t-tests, z-tests, p-values — assumes a fixed sample size and treats each look at the data as a separate hypothesis test. Peek ten times and your real false-positive rate balloons well past the nominal 5%. Bayesian testing sidesteps this because the posterior is just a current belief, not a repeated-trials statement.

In practice this matters for online stores running short cycles. You can ship a variant at 95% probability-to-be-best after one weekend of traffic without violating the math, where a frequentist test would still demand the pre-registered sample. The cost is conceptual overhead: stakeholders need to understand that 'P(B > A) = 92%' is not the same statement as a 92% confidence interval.

Formula

P(θ_B > θ_A | data) ∝ P(data | θ_A, θ_B) × P(θ_A) × P(θ_B)

Variables

θ_A

Control conversion rate

The unknown true conversion rate for variant A, modelled as a probability distribution.

θ_B

Variant conversion rate

The unknown true conversion rate for variant B, modelled as a probability distribution.

P(θ)

Prior

Your belief about conversion rates before seeing data — typically a Beta(α, β) distribution for binary conversion events.

P(data | θ)

Likelihood

How probable the observed conversions are given a hypothesised conversion rate.

P(θ | data)

Posterior

Updated belief about the conversion rate after combining prior and data — the basis for P(B > A).

Worked example

A Shopify apparel store tests a new product-page hero against control. After 7 days: control 8,400 sessions / 252 conversions (3.00%), variant 8,200 sessions / 287 conversions (3.50%). Using a weak Beta(1, 1) prior, the posterior for control is Beta(253, 8149) and for variant is Beta(288, 7914).

Control conversions / sessions: 252 / 8,400

Variant conversions / sessions: 287 / 8,200

Prior: Beta(1, 1) — uninformative

P(B > A) ≈ 96.8%, expected uplift ≈ +16.7%, expected loss of choosing B ≈ 0.02%

There's a 96.8% probability variant B has a higher true conversion rate than control, and the cost of being wrong is negligible — a reasonable basis to ship even though a strict frequentist test at α=0.05 would be marginal here.

Pair P(B > A) with expected loss to decide. Probability-to-be-best tells you which variant is likely better; expected loss tells you how bad it would be to pick the loser. A common stopping rule: ship when P(B > A) > 95% AND expected loss of the chosen variant < 0.1% of baseline conversion rate.

Benchmark

How Bayesian and frequentist testing behave on the same e-commerce experiment

BehaviorBayesianFrequentist (fixed-horizon)
OutputP(B > A), expected loss, credible intervalp-value, confidence interval
Peeking allowed?Yes — posterior updates are self-containedNo — inflates Type I error past nominal α
Stopping ruleP(B > A) > 95% and expected loss < thresholdReach pre-registered sample size
Typical time to decision (3% baseline, +10% MDE)10-14 days16-21 days
Requires a prior?Yes — weak/uninformative is fineNo
Handles small samplesGracefully (prior regularises)Poorly (relies on asymptotic approximations)
Stakeholder interpretationDirect: 'B is 96% likely to win'Indirect: 'p=0.03 under H0'

Most modern experimentation tools — including Metricuno — default to Bayesian reporting because it matches how operators actually make decisions. You still need guardrails: don't ship on day one with 30 conversions per arm just because P(B > A) clears 95%, and document your prior so results stay reproducible.

Frequently asked

Bayesian testing FAQ

Frequentist testing asks how surprising your data would be if there were no real effect, expressed as a p-value with a fixed sample size. Bayesian testing combines a prior with the data to give a direct probability that one variant beats another, and can be checked at any time. The frameworks usually agree at large sample sizes; they diverge most when samples are small or you peek often.

Yes, in the sense that each posterior is a valid probability statement on its own — there's no p-value inflation. But peeking still tempts you to stop early on noise. Best practice is to combine P(B > A) with an expected-loss threshold and a minimum sample size so a single lucky day doesn't end the test prematurely.

For checkout or product-page conversion, a weak Beta(1, 1) or Beta(2, 50) prior works fine — it lets the data dominate quickly. If you have strong historical data (say, 2 years of GA4 import showing a stable 2.8% baseline), an informative prior centered on that rate tightens credible intervals and shortens tests, but you should pre-register the choice.

Run for at least one full business cycle (usually 7 or 14 days) to cover weekday/weekend mix and traffic-source variation. Then ship once P(B > A) > 95% and expected loss is below your tolerance — typically 0.05-0.1% of baseline CVR. For a store on a 3% baseline, that's usually 8,000-15,000 sessions per arm.

Slightly, on average, and the gap widens when you'd otherwise need to pre-register a conservative sample size. The bigger gain is flexibility: you stop when the evidence is sufficient, not when an arbitrary horizon is reached. Expect 20-30% faster decisions on typical e-commerce experiments.

Expected loss is the average conversion-rate sacrifice you'd incur if you picked the wrong variant, integrated over the posterior. P(B > A) of 96% sounds great until you realise the 4% downside scenario costs you 8% of revenue. Expected loss collapses both numbers into a single 'how bad is being wrong' figure.

Yes — you compute 'probability to be best' across all arms simultaneously, and expected loss for each. Multi-armed bandits use the same machinery to dynamically allocate traffic toward the leading variant. For standard A/B/n tests, just make sure each arm gets enough traffic before reading results.

Phrase it as a betting odds statement: 'There's a 96% chance the new checkout converts better than the current one, and if we're wrong, we lose about 0.03% of conversions.' Avoid mixing it with p-value language — calling it a 'confidence level' will create exactly the misunderstanding you're trying to avoid.

VWO uses Bayesian stats by default (their SmartStats engine), Optimizely uses sequential frequentist (Stats Engine), and Google Optimize used Bayesian before it sunset. Metricuno reports Bayesian P(B > A) and expected loss alongside frequentist p-values, so you can use whichever framework your team is comfortable with.

If your organisation has regulatory or scientific-publication requirements that mandate p-values (rare in DTC, common in pharma or academia), stick with frequentist. Also avoid Bayesian if you can't commit to documenting and defending your prior — undisclosed priors are the easiest way for results to look more conclusive than they are.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.