What is A/B Testing

Q: What statistical significance level should I use?

95% confidence (p < 0.05) is the e-commerce default and pairs well with 80% statistical power. Lowering to 90% halves the required sample size but doubles your false-positive rate — fine for low-risk UI tweaks, risky for checkout or pricing changes.

Q: Does an A/B test slow down my Shopify store?

It depends on the tool. Older client-side platforms inject a render-blocking script that can add 200-600ms to LCP and cause flicker. Lightweight modern snippets (or edge/server-side rendering) keep the impact under 50ms. Measure with Lighthouse before and after install.

Q: Can I A/B test with low traffic?

Yes, but be honest about what you can detect. Under ~5,000 weekly sessions per variant you'll only reliably catch 20%+ lifts. Focus on bottom-of-funnel pages (checkout, cart, PDP) where the baseline conversion rate is higher and big swings are more likely.

Q: What should I test first?

Pages with the highest revenue-at-risk and the clearest qualitative signal. For most online stores that's the product page or the first checkout step. Use heatmap and session-replay data to find a real friction point, then design a test around fixing it — not around copy preference.

Q: How many A/B tests fail?

Industry data suggests 70-85% of A/B tests show no statistically significant winner, or a clear loser. That's normal and expected — the value of testing is avoiding the losers you would otherwise have shipped, not just shipping winners.

Q: Do I need to A/B test if I have a strong hypothesis?

Yes, especially then. Strong hypotheses based on user research, drop-off data, or competitor analysis are exactly the ones worth validating with real traffic, because they're the ones you're tempted to ship without proof. Testing protects you from a confident but wrong hypothesis.

Q: What's the difference between A/B testing and A/B/n testing?

A/B/n testing is the same methodology with more than one treatment variant — for example, one control and three new hero designs. The trade-off: each extra variant either lengthens the test or reduces your power to detect a winner, since traffic is now split four ways instead of two.

Metricuno

May 19, 2026

4 min read

Quick answer

A plain-English definition of A/B testing for online stores, with the underlying statistics, realistic sample-size benchmarks, and answers to the questions people actually ask.

Definition

Experimentation

A/B Testing

A/B testing splits traffic between two variants of a page or element to measure which one performs better against a defined goal.

A/B testing (also called split testing) is a controlled experiment where visitors are randomly assigned to one of two variants — the original (control) and a modified version (treatment) — so you can attribute any difference in conversion rate, revenue per visitor, or another KPI directly to the change you made.

The discipline matters because it converts CRO from an opinion contest into evidence. Instead of arguing about whether a sticky add-to-cart bar will help, you ship both versions to live traffic, wait until you have enough sessions to clear the statistical-significance bar, and let the data decide. Used well, it compounds: each shipped winner becomes the new baseline for the next test.

Also known as

Split testing

Online controlled experiment

Bucket testing

Mechanically, an A/B test does three things at once: it randomises visitors into buckets so the two groups are comparable, it tracks a primary metric (usually conversion rate or revenue per visitor) for each bucket, and it applies a statistical test to decide whether the gap between them is real or just noise.

The reason teams formalise this rather than just eyeballing dashboards is that conversion rates are noisy. A 3.1% vs 2.9% gap on a Shopify product page can flip direction over a weekend. Without a sample-size target and a significance threshold, you'll ship losers and kill winners in roughly equal measure.

Formula

n = 16 * p * (1 - p) / MDE^2

Variables

Sample size per variant

Visitors needed in each bucket to detect the effect at 95% confidence and 80% power.

Baseline conversion rate

Current conversion rate of the control, as a decimal (e.g. 0.03 for 3%).

MDE

Minimum detectable effect

Smallest absolute lift you want to be able to detect, as a decimal (e.g. 0.005 for half a percentage point).

Worked example

An apparel store wants to test a new product-page hero on a category currently converting at 3%, and wants to catch lifts of 0.5 percentage points or more.

Baseline conversion rate (p): 0.03

Minimum detectable effect (MDE): 0.005

→ ≈ 18,624 visitors per variant

With ~3,000 visitors a week per variant, that's a six-week test. If the team only has two weeks of patience, they need either a bigger expected lift, a higher-traffic page, or to accept a lower confidence level.

The formula above is the rough rule-of-thumb version of a two-sample proportion test at 95% confidence and 80% power. It's good enough for planning — it tells you whether a test is feasible on your traffic before you spend two weeks running it. Use a full sample-size calculator for the actual go/no-go decision.

Benchmark

Required weekly sessions per variant to finish an A/B test in 2-4 weeks, by baseline conversion rate and target lift

Baseline CR	Detect 5% relative lift	Detect 10% relative lift	Detect 20% relative lift
1.5% (cold traffic LP)	~210k / week	~52k / week	~13k / week
3% (apparel PDP)	~103k / week	~26k / week	~6.5k / week
5% (beauty PDP)	~61k / week	~15k / week	~3.8k / week
8% (checkout step)	~37k / week	~9.2k / week	~2.3k / week
12% (returning-visitor cart)	~24k / week	~6.0k / week	~1.5k / week

Two things to notice. First, the higher up the funnel you test, the more traffic you need — homepage tests are brutal because the conversion rate is low. Second, chasing small lifts (5% relative) is a luxury reserved for very high-traffic stores; under €5M revenue, most teams should be hunting for changes worth 10-20% on focused funnel steps.

Frequently asked

Frequently asked questions about A/B testing

Long enough to hit your pre-calculated sample size AND cover at least one full business cycle — usually two weeks minimum, so weekday/weekend and payday patterns are represented. Stopping early on a 'significant' result is the single most common cause of false winners.

A/B testing typically modifies elements on the same URL via JavaScript or server logic, while split URL testing sends traffic to entirely separate pages. Split URL is the right choice for radically different layouts or different tech stacks; A/B is faster to ship for component-level changes.

A/B testing compares two complete variants. Multivariate testing (MVT) tests multiple element combinations at once — say, three headlines × two hero images = six variants — to isolate which element drives the lift. MVT needs roughly N times the traffic, so most stores under €10M revenue should stick to A/B.

95% confidence (p < 0.05) is the e-commerce default and pairs well with 80% statistical power. Lowering to 90% halves the required sample size but doubles your false-positive rate — fine for low-risk UI tweaks, risky for checkout or pricing changes.

It depends on the tool. Older client-side platforms inject a render-blocking script that can add 200-600ms to LCP and cause flicker. Lightweight modern snippets (or edge/server-side rendering) keep the impact under 50ms. Measure with Lighthouse before and after install.

Yes, but be honest about what you can detect. Under ~5,000 weekly sessions per variant you'll only reliably catch 20%+ lifts. Focus on bottom-of-funnel pages (checkout, cart, PDP) where the baseline conversion rate is higher and big swings are more likely.

Pages with the highest revenue-at-risk and the clearest qualitative signal. For most online stores that's the product page or the first checkout step. Use heatmap and session-replay data to find a real friction point, then design a test around fixing it — not around copy preference.

Industry data suggests 70-85% of A/B tests show no statistically significant winner, or a clear loser. That's normal and expected — the value of testing is avoiding the losers you would otherwise have shipped, not just shipping winners.

Yes, especially then. Strong hypotheses based on user research, drop-off data, or competitor analysis are exactly the ones worth validating with real traffic, because they're the ones you're tempted to ship without proof. Testing protects you from a confident but wrong hypothesis.

A/B/n testing is the same methodology with more than one treatment variant — for example, one control and three new hero designs. The trade-off: each extra variant either lengthens the test or reduces your power to detect a winner, since traffic is now split four ways instead of two.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

What is A/B Testing

A/B Testing

Required weekly sessions per variant to finish an A/B test in 2-4 weeks, by baseline conversion rate and target lift

Frequently asked questions about A/B testing

How long should an A/B test run?

What's the difference between A/B testing and split URL testing?

How is A/B testing different from multivariate testing?

What statistical significance level should I use?

Does an A/B test slow down my Shopify store?

Can I A/B test with low traffic?

What should I test first?

How many A/B tests fail?

Do I need to A/B test if I have a strong hypothesis?

What's the difference between A/B testing and A/B/n testing?

Test ideas before you ship them