Ecommerce Experimentation

Metricuno
May 20, 2026
4 min read
Quick answer

A practical reference on running A/B tests on an ecommerce site: platform friction, traffic realities, and the test designs that survive a real purchase cycle.

Definition
Conversion Rate Optimization

Ecommerce Experimentation

Running controlled A/B (and multivariate) tests on an online store to measure how design, copy, and UX changes affect revenue per visitor.

Ecommerce experimentation is the discipline of testing changes on a live storefront — PDP layout, checkout copy, shipping thresholds, upsell modules — by splitting traffic between a control and one or more variants, then measuring the lift on a commercial metric.

It differs from generic A/B testing in three ways: the primary metric is almost always revenue per visitor (not conversion rate in isolation), traffic volumes are modest compared to ad-tech or SaaS marketing sites, and the testing surface is constrained by the platform — Shopify themes, WooCommerce hooks, or a headless front end each impose their own limits on what you can change and how fast.

Also known as
Online store A/B testing
Retail experimentation
DTC testing

Most stores in the €1M–€15M revenue band sit in an awkward middle: enough traffic to run meaningful tests, not enough to run dozens in parallel. A typical Shopify apparel store with 80k monthly sessions and a 2.4% conversion rate has roughly one test slot per page template at a time — try to run three concurrent tests on the PDP and you'll spend the quarter chasing noise.

This is why ecommerce experimentation lives downstream of broader ecommerce CRO: the prioritisation framework, hypothesis quality, and analytics setup decide whether the test queue is worth running at all. A fast test that ships the wrong hypothesis is just expensive randomness.

Formula

n_per_variant = 16 * p * (1 - p) / (p * mde)^2

Variables

n_per_variant

Sample size per variant

Visitors needed in EACH arm (control + variant) to detect the effect.

p

Baseline conversion rate

Current conversion rate as a decimal (e.g. 0.024 for 2.4%).

mde

Minimum detectable effect

Smallest relative lift you want to detect, as a decimal (e.g. 0.10 for +10%).

Worked example

A Shopify apparel store wants to test a new PDP image gallery. Baseline conversion is 2.4%, and the team wants to reliably detect a +10% relative lift (i.e. 2.4% → 2.64%) at 80% power, 95% confidence.

Baseline conversion rate (p): 0.024

Minimum detectable effect (mde): 0.10 (relative)

~67,800 sessions per variant — roughly 135,600 sessions total.

At 80k monthly sessions split 50/50 across the PDP, that's a 4–5 week test. If the team wants results in 2 weeks, they need either a larger MDE (test bigger changes), more traffic, or to accept lower confidence.

The single biggest mistake at this revenue band is calling tests too early. A variant that's +18% after week one almost always regresses — the early sessions are dominated by your repeat buyers, who behave differently from cold paid traffic. Always pre-commit to a sample size and a stop date, and only call the result when both are hit.

Benchmark

Realistic experimentation cadence by platform and vertical (stores in the €1M–€15M band)

SegmentMonthly sessionsAvg. conv. rateTests / quarterTypical winner rate
Shopify — apparel60k–150k1.8–2.8%6–1020–25%
Shopify — beauty / skincare80k–200k2.5–3.5%8–1222–28%
WooCommerce — home goods40k–100k1.4–2.2%4–718–22%
Magento — electronics100k–300k1.2–1.8%5–815–20%
Headless (Shopify Hydrogen / Next)80k–250k2.0–3.0%4–620–25%

Notice the headless row: more traffic, fewer tests. That's the trade-off teams underestimate when they replatform — every variant ships through the dev queue instead of a no-code editor, so test velocity drops 30–50% in the first two quarters. If experimentation is a core KPI, factor that into the replatform business case.

Frequently asked

Ecommerce experimentation FAQ

Roughly yes if you have 30k+ monthly sessions on the page you want to test and a baseline conversion rate above 1.5%. Below that, focus tests on the highest-traffic template (usually the PDP or cart) and accept longer cycles — 4 to 6 weeks per test is normal at this size.

Most legacy tools add 80–250ms via a blocking script in the <head>. That's enough to drop your Largest Contentful Paint into a bad bucket and cost you 3–7% of paid traffic conversions. Look for tools that run the snippet async or via an edge worker — the performance cost should be under 30ms.

Three things: revenue per visitor is the metric that matters (not click-through or signup), the purchase cycle means you often need to wait days for the conversion to actually fire, and platform constraints (Shopify checkout, theme structure) cap what you can change. Test designs that ignore those realities produce winners that don't replicate.

Only on Shopify Plus, and only via Checkout Extensibility — the legacy checkout.liquid is locked on standard plans. Most teams test up to the cart, then test post-purchase pages (thank-you upsells), which are fully editable on every plan.

20–25% of tests producing a statistically significant winner is healthy. If you're above 40%, you're probably calling tests too early or your hypotheses are too cosmetic. Below 10%, your hypothesis quality is the problem — not your tooling.

Minimum two full business cycles (usually 14 days) to capture the weekly purchase pattern, and until you hit the pre-calculated sample size. Never call a test inside 7 days, even if it 'looks significant' — early-week buyers and weekend buyers convert differently.

Yes, almost always. Mobile and desktop behave like different stores: mobile carries 60–75% of sessions but converts at half the desktop rate, and a winner on one device often loses on the other. Pre-segment your analysis or run the variants device-specifically.

On-site tests and Klaviyo flow tests are separate systems and shouldn't share success metrics. Run the on-site variant cleanly, let Klaviyo's flow logic fire post-purchase, and analyse email lift inside Klaviyo. Mixing them in one report makes both look noisy.

Yes, but only tests that are pre-validated or extremely low-risk. Peak traffic is a great learning window, but the cost of a buggy variant is also 5–10x normal. Freeze new code 10 days before peak, and only run tests where the loser scenario is acceptable.

For most stores at this revenue band, it doesn't matter — both produce the same calls 90% of the time on tests run to a proper sample size. Bayesian is easier to communicate to non-analysts ('72% chance B is better') but doesn't fix bad hypothesis quality or undersized tests.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.