How to use Segment Analysis

Q: How many segments can I safely analyse in one test?

Two to three pre-registered segments is the working ceiling. Beyond that the multiple-comparisons risk becomes material and you should apply a correction (Bonferroni for a small number of slices, Benjamini-Hochberg for larger families). Exploratory slices on top are fine if you label them as such.

Q: What's the difference between segment analysis and personalisation?

Segment analysis tells you whether a single variant affects different audiences differently. Personalisation acts on that finding by serving different experiences to different segments. Segment analysis is the diagnostic; personalisation is one possible treatment, alongside 'ship to everyone' or 'ship to one segment only'.

Q: If mobile wins and desktop loses, should I ship to mobile only?

Often yes, but verify it in a confirmatory follow-up that targets mobile as the primary population. A segment-level finding inside a sitewide test is one data point. Re-running it as a mobile-only test gives you a clean, fully-powered read before you commit to a permanent split experience.

Q: How does segment analysis fit into broader experiment analysis?

It's the readout layer that sits underneath the sitewide effect. A complete experiment analysis covers: sitewide effect, pre-registered segments, guardrail metrics (revenue, AOV, bounce), and exploratory slices for hypothesis generation. Skipping segments means you accept whatever the average tells you.

Q: Do I need a Bonferroni correction for every segmented test?

Only when you're treating multiple segments as confirmatory. If you pre-register one primary effect and treat segments as exploratory, no correction is needed because you're not making multiple ship-decisions. The correction matters when each slice could independently trigger a rollout.

Q: Which segments tend to differ most in e-commerce tests?

Device class (mobile vs desktop) shows the largest dispersion in our experience, followed by new-vs-returning visitor and paid-vs-organic source. Geography matters when you ship localised content or pricing. Browser, OS, and hour-of-day rarely move the needle enough to be worth pre-registering.

Q: Can I segment by Klaviyo list or email cohort?

Yes, if you can join the experiment exposure data to the email identifier. This is common for returning-customer tests where a Klaviyo VIP segment behaves very differently from a cold-list re-engagement segment. Treat list-based slices like any other segment: pre-register them or label them exploratory.

Q: What sample size does each segment need?

Roughly the same as a standalone test for that minimum detectable effect. If sitewide needs 50,000 sessions per variant to detect a 3% lift at 80% power, a segment that's 25% of traffic needs four times the calendar duration to reach the same precision. Plan test length around your smallest pre-registered segment.

Q: Should I look at segments before the test ends?

No. Interim segment peeking compounds the multiple-comparisons problem with sequential-testing inflation. Wait until the test hits its pre-defined duration or sample-size stopping rule, then read all segments at once. If you must peek, use a sequential-testing framework (e.g. mSPRT) that accounts for it.

Q: How do I report a segmented result to stakeholders?

Lead with the sitewide effect and its confidence interval. Then show the pre-registered segments as a small table with lift, CI, and sample share. Flag exploratory slices separately, with the caveat that they're hypothesis-generating. The ship-decision should reference only the confirmatory layer.

Metricuno

May 19, 2026

7 min read

Quick answer

Sitewide-neutral A/B tests often hide large per-segment wins and losses. This guide shows how to run disciplined segment analysis without inflating false positives.

Definition

Experiment Analysis

Segment Analysis

Breaking down A/B test results by traffic source, device, geography, or visitor cohort to find hidden wins and losses.

Segment analysis is the practice of splitting an experiment's results into sub-populations — mobile vs desktop, paid vs organic, new vs returning, EU vs US — to see whether the average effect masks a strong response in one group and a flat or negative response in another.

It is the most underused tool in experiment analysis and the easiest to abuse. Done well, it turns a flat sitewide test into a targeted rollout. Done badly, it manufactures statistically significant noise by slicing the data until something looks like a winner. The discipline is in deciding which segments to inspect before the test ships, and treating exploratory slices as hypotheses for the next test rather than verdicts on this one.

Also known as

subgroup analysis

segmented test results

cohort-level lift analysis

Most A/B tests are read at the aggregate level: one variant, one number, one verdict. That works when the effect is uniform across visitors. It rarely is. A checkout change that lifts mobile by 8% can drag desktop down by 3% — and the sitewide number reads as a flat 1% lift you'd probably ship by mistake.

Segment analysis is the corrective. It belongs in the same conversation as broader experiment analysis: not a separate exercise, but the part of the readout where you ask which visitors the effect actually applies to. The risk is that with enough segments, something will always look significant by chance — so the technique only works inside guardrails.

When sitewide-neutral hides a winner

The classic case is a test that comes back at +1.2% with p=0.42 — statistically nothing. You're about to call it inconclusive. Then you split by device: mobile is +6.1% (p=0.01), desktop is -2.4% (p=0.08). The aggregate cancelled out two real effects pointing in opposite directions.

This pattern is common in apparel and beauty stores where mobile traffic shares are 70-80% but desktop drives a quarter of revenue at twice the AOV. A new product page layout optimised for thumb-reach can be a clear mobile win while breaking a desktop comparison flow that bigger spenders rely on.

The same logic applies to paid vs organic traffic, branded vs non-branded search, and new vs returning visitors. Each is a fundamentally different audience landing on the same page with different intent. Treating them as one population is a modelling choice — usually a bad one when the variance between segments is large.

The multiple-comparisons trap

If you test 20 segments at α=0.05, you should expect one false-positive 'significant' segment even when the variant does nothing. Three slices (device, source, returning) is usually defensible. Twelve is fishing. Apply a Bonferroni or Benjamini-Hochberg correction when you go beyond your pre-registered slices.

Pre-register the segments that matter

The cleanest defence against p-hacking is to write down — before the test ships — the two or three segments you commit to analysing. Pre-registration converts those slices from exploratory to confirmatory. Anything else you look at afterwards is exploratory: useful for generating the next hypothesis, not for shipping a decision.

A reasonable default for an e-commerce store: device class (mobile / desktop / tablet), traffic source bucket (paid / organic / direct / email), and visitor recency (new / returning within 30 days). That's three slices, each with 2-4 levels — small enough that you're not multiplying yourself into false significance.

Chart

Lift dispersion across segments — same test, different audiences

The sitewide aggregate above is the bar you'd report in a flat readout. It hides the fact that the variant is a clear mobile win and a likely desktop loss. Shipping to 100% of traffic gives you the +1.3%. Shipping to mobile only gives you closer to +6% on the segment that drives most of your sessions.

Which segments are worth the sample cost

Every segment you analyse needs enough traffic to reach significance inside that slice — not the whole test. A segment that's 8% of your sessions needs roughly 12x the test duration to detect the same minimum effect at the same power as the sitewide read. That's why small segments rarely pay off as primary slices.

The table below gives a rough orientation for a mid-size Shopify store running tests on the product detail page. Use it to decide which slices deserve to be pre-registered and which are too thin to read as anything but directional.

Benchmark

Typical segment share of sessions and analysis viability — mid-size Shopify apparel store

Segment	Share of sessions	Weeks to MDE 5% (vs 2 weeks sitewide)	Read as
Mobile (iOS + Android)	72%	~3 weeks	Primary slice
Desktop	23%	~9 weeks	Primary slice (with patience)
Paid social	31%	~6.5 weeks	Primary slice
Branded organic	18%	~11 weeks	Secondary
Returning < 30 days	27%	~7.5 weeks	Primary slice
Tablet	5%	~40 weeks	Directional only
Single country (e.g. NL)	9%	~22 weeks	Directional only

The pattern is consistent: anything under 10-15% of your sessions is exploratory at best. If a niche segment matters strategically — a launch market, a high-AOV cohort — you either need a much longer test window or a dedicated test that targets that segment as the population, not a slice of a broader one.

Exploratory vs confirmatory: write down which is which

After your pre-registered slices, you'll often want to look further: by browser, by landing page, by hour-of-day. Do it. Just label these reads exploratory in the readout. An exploratory finding becomes the hypothesis for a follow-up test, not a justification for shipping the current one to that segment.

This separation is what protects the integrity of your testing program over time. Teams that mix exploratory wins into confirmatory rollouts end up with a roadmap full of features that don't replicate. Teams that re-test exploratory findings end up with a backlog of validated, segment-specific wins.

A workable rule of thumb

Pre-register 2-3 segments. Report those as primary. Run any number of exploratory slices but label them clearly and feed the interesting ones back into the hypothesis backlog. Never ship a variant based on an exploratory segment alone — re-test it as the primary read of its own experiment.

Frequently asked

Frequently asked questions