How to use Segment Analysis

Metricuno
May 19, 2026
7 min read
Quick answer

Sitewide-neutral A/B tests often hide large per-segment wins and losses. This guide shows how to run disciplined segment analysis without inflating false positives.

Definition
Experiment Analysis

Segment Analysis

Breaking down A/B test results by traffic source, device, geography, or visitor cohort to find hidden wins and losses.

Segment analysis is the practice of splitting an experiment's results into sub-populations — mobile vs desktop, paid vs organic, new vs returning, EU vs US — to see whether the average effect masks a strong response in one group and a flat or negative response in another.

It is the most underused tool in experiment analysis and the easiest to abuse. Done well, it turns a flat sitewide test into a targeted rollout. Done badly, it manufactures statistically significant noise by slicing the data until something looks like a winner. The discipline is in deciding which segments to inspect before the test ships, and treating exploratory slices as hypotheses for the next test rather than verdicts on this one.

Also known as
subgroup analysis
segmented test results
cohort-level lift analysis

Most A/B tests are read at the aggregate level: one variant, one number, one verdict. That works when the effect is uniform across visitors. It rarely is. A checkout change that lifts mobile by 8% can drag desktop down by 3% — and the sitewide number reads as a flat 1% lift you'd probably ship by mistake.

Segment analysis is the corrective. It belongs in the same conversation as broader experiment analysis: not a separate exercise, but the part of the readout where you ask which visitors the effect actually applies to. The risk is that with enough segments, something will always look significant by chance — so the technique only works inside guardrails.

When sitewide-neutral hides a winner

The classic case is a test that comes back at +1.2% with p=0.42 — statistically nothing. You're about to call it inconclusive. Then you split by device: mobile is +6.1% (p=0.01), desktop is -2.4% (p=0.08). The aggregate cancelled out two real effects pointing in opposite directions.

This pattern is common in apparel and beauty stores where mobile traffic shares are 70-80% but desktop drives a quarter of revenue at twice the AOV. A new product page layout optimised for thumb-reach can be a clear mobile win while breaking a desktop comparison flow that bigger spenders rely on.

The same logic applies to paid vs organic traffic, branded vs non-branded search, and new vs returning visitors. Each is a fundamentally different audience landing on the same page with different intent. Treating them as one population is a modelling choice — usually a bad one when the variance between segments is large.

The multiple-comparisons trap

If you test 20 segments at α=0.05, you should expect one false-positive 'significant' segment even when the variant does nothing. Three slices (device, source, returning) is usually defensible. Twelve is fishing. Apply a Bonferroni or Benjamini-Hochberg correction when you go beyond your pre-registered slices.

Pre-register the segments that matter

The cleanest defence against p-hacking is to write down — before the test ships — the two or three segments you commit to analysing. Pre-registration converts those slices from exploratory to confirmatory. Anything else you look at afterwards is exploratory: useful for generating the next hypothesis, not for shipping a decision.

A reasonable default for an e-commerce store: device class (mobile / desktop / tablet), traffic source bucket (paid / organic / direct / email), and visitor recency (new / returning within 30 days). That's three slices, each with 2-4 levels — small enough that you're not multiplying yourself into false significance.

Chart

Lift dispersion across segments — same test, different audiences

-4%-2%0%2%4%6%8%Mobile / PaidMobile / OrganicDesktop / PaidDesktop / OrganicTablet / AllSitewide aggregateConversion-rate lift (%)Segment

The sitewide aggregate above is the bar you'd report in a flat readout. It hides the fact that the variant is a clear mobile win and a likely desktop loss. Shipping to 100% of traffic gives you the +1.3%. Shipping to mobile only gives you closer to +6% on the segment that drives most of your sessions.

Which segments are worth the sample cost

Every segment you analyse needs enough traffic to reach significance inside that slice — not the whole test. A segment that's 8% of your sessions needs roughly 12x the test duration to detect the same minimum effect at the same power as the sitewide read. That's why small segments rarely pay off as primary slices.

The table below gives a rough orientation for a mid-size Shopify store running tests on the product detail page. Use it to decide which slices deserve to be pre-registered and which are too thin to read as anything but directional.

Benchmark

Typical segment share of sessions and analysis viability — mid-size Shopify apparel store

SegmentShare of sessionsWeeks to MDE 5% (vs 2 weeks sitewide)Read as
Mobile (iOS + Android)72%~3 weeksPrimary slice
Desktop23%~9 weeksPrimary slice (with patience)
Paid social31%~6.5 weeksPrimary slice
Branded organic18%~11 weeksSecondary
Returning < 30 days27%~7.5 weeksPrimary slice
Tablet5%~40 weeksDirectional only
Single country (e.g. NL)9%~22 weeksDirectional only

The pattern is consistent: anything under 10-15% of your sessions is exploratory at best. If a niche segment matters strategically — a launch market, a high-AOV cohort — you either need a much longer test window or a dedicated test that targets that segment as the population, not a slice of a broader one.

Exploratory vs confirmatory: write down which is which

After your pre-registered slices, you'll often want to look further: by browser, by landing page, by hour-of-day. Do it. Just label these reads exploratory in the readout. An exploratory finding becomes the hypothesis for a follow-up test, not a justification for shipping the current one to that segment.

This separation is what protects the integrity of your testing program over time. Teams that mix exploratory wins into confirmatory rollouts end up with a roadmap full of features that don't replicate. Teams that re-test exploratory findings end up with a backlog of validated, segment-specific wins.

A workable rule of thumb

Pre-register 2-3 segments. Report those as primary. Run any number of exploratory slices but label them clearly and feed the interesting ones back into the hypothesis backlog. Never ship a variant based on an exploratory segment alone — re-test it as the primary read of its own experiment.

Frequently asked

Frequently asked questions

Two to three pre-registered segments is the working ceiling. Beyond that the multiple-comparisons risk becomes material and you should apply a correction (Bonferroni for a small number of slices, Benjamini-Hochberg for larger families). Exploratory slices on top are fine if you label them as such.

Segment analysis tells you whether a single variant affects different audiences differently. Personalisation acts on that finding by serving different experiences to different segments. Segment analysis is the diagnostic; personalisation is one possible treatment, alongside 'ship to everyone' or 'ship to one segment only'.

Often yes, but verify it in a confirmatory follow-up that targets mobile as the primary population. A segment-level finding inside a sitewide test is one data point. Re-running it as a mobile-only test gives you a clean, fully-powered read before you commit to a permanent split experience.

It's the readout layer that sits underneath the sitewide effect. A complete experiment analysis covers: sitewide effect, pre-registered segments, guardrail metrics (revenue, AOV, bounce), and exploratory slices for hypothesis generation. Skipping segments means you accept whatever the average tells you.

Only when you're treating multiple segments as confirmatory. If you pre-register one primary effect and treat segments as exploratory, no correction is needed because you're not making multiple ship-decisions. The correction matters when each slice could independently trigger a rollout.

Device class (mobile vs desktop) shows the largest dispersion in our experience, followed by new-vs-returning visitor and paid-vs-organic source. Geography matters when you ship localised content or pricing. Browser, OS, and hour-of-day rarely move the needle enough to be worth pre-registering.

Yes, if you can join the experiment exposure data to the email identifier. This is common for returning-customer tests where a Klaviyo VIP segment behaves very differently from a cold-list re-engagement segment. Treat list-based slices like any other segment: pre-register them or label them exploratory.

Roughly the same as a standalone test for that minimum detectable effect. If sitewide needs 50,000 sessions per variant to detect a 3% lift at 80% power, a segment that's 25% of traffic needs four times the calendar duration to reach the same precision. Plan test length around your smallest pre-registered segment.

No. Interim segment peeking compounds the multiple-comparisons problem with sequential-testing inflation. Wait until the test hits its pre-defined duration or sample-size stopping rule, then read all segments at once. If you must peek, use a sequential-testing framework (e.g. mSPRT) that accounts for it.

Lead with the sitewide effect and its confidence interval. Then show the pre-registered segments as a small table with lift, CI, and sample share. Flag exploratory slices separately, with the caveat that they're hypothesis-generating. The ship-decision should reference only the confirmatory layer.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.