Experiment Backlogs
An experiment backlog is the prioritized pipeline of test ideas your CRO program runs from. Here's how to structure it, score it, and keep it from rotting.
Experiment Backlog
A structured, prioritized list of test ideas — each with a hypothesis, surface, score, and status — that feeds an experimentation program.
An experiment backlog is the single source of truth for every test idea your team has considered, from "swap the hero image" to "redesign the checkout shipping step." Each row carries the metadata needed to decide what runs next: a slug, a falsifiable hypothesis, the surface or page it targets, a primary KPI, a priority score (usually ICE or RICE), and a status (idea → queued → live → analysing → shipped/killed).
A backlog isn't a wishlist. It's a working artifact of your experiment prioritization process — disciplined enough that two people independently scoring the same idea land within a few points, and pruned ruthlessly enough that stale ideas don't crowd out fresh ones.
Most teams discover the value of a backlog the hard way. After six months of running tests, nobody remembers what was tried on the product detail page in March, why it lost, or whether the variant copy is worth resurrecting on a different surface. A backlog with clean status history answers all three questions in 30 seconds.
The structural minimum is seven fields per idea: slug (a short URL-safe handle), hypothesis ("if we change X for users Y, metric Z will move because…"), surface (PDP, cart, checkout step 2), primary KPI, ICE or RICE score, status, and owner. Without those, you have a Notion page, not a backlog.
RICE = (Reach * Impact * Confidence) / Effort
Reach
Reach
Number of users or sessions exposed to the change per time period (e.g. monthly visitors to the affected surface).
Impact
Impact
Expected lift on the primary KPI, on a scored scale: 3 = massive, 2 = high, 1 = medium, 0.5 = low, 0.25 = minimal.
Confidence
Confidence
How sure you are the hypothesis is right, as a percentage (0-100%). High = backed by analytics, session replays, or prior wins; low = pure hunch.
Effort
Effort
Person-weeks (design + dev + QA + analysis) to ship the variant. Lower is better.
A Shopify apparel store wants to score a test that replaces a static hero image on the PDP with a 6-second product video. Monthly PDP traffic is 80,000 sessions. The team thinks the impact is medium (1), they're 70% confident based on a prior win on the homepage, and it'll take 2 person-weeks to build.
Reach (monthly sessions): 80000
Impact: 1
Confidence: 0.7
Effort (person-weeks): 2
→ RICE = (80000 × 1 × 0.7) / 2 = 28,000
A RICE of 28,000 is a strong contender for the next sprint. Compare it against the other queued ideas — anything scoring above the median goes into the run queue; everything below sits in 'icebox' until the inputs change.
RICE works better than ICE for stores with several distinct traffic surfaces, because Reach forces you to weight a checkout test (small but high-intent audience) against a homepage test (large but lower-intent). On a single-funnel site, plain ICE is fine and faster to score.
Backlog health benchmarks by program maturity
| Program stage | Ideas in backlog | % scored | Avg idea age (days) | Tests shipped / quarter |
|---|---|---|---|---|
| Just starting (0-3 months) | 10-25 | 60-80% | 30 | 2-4 |
| Established (6-18 months) | 40-80 | 85-95% | 60 | 8-12 |
| Mature (2+ years) | 80-150 | 90-100% | 90 | 15-25 |
| Neglected (warning sign) | 200+ | <50% | 180+ | 0-3 |
The neglected row matters most. A backlog that balloons past 200 unscored ideas is the canonical failure mode — every PM and marketer dumped ideas into it, nobody pruned, and now no one trusts it. Quarterly grooming (kill stale ideas, re-score the top 20, archive anything older than 6 months that hasn't moved) is what separates a useful backlog from a graveyard.
Frequently asked questions
Seven: slug, hypothesis, surface, primary KPI, priority score, status, and owner. Optional but valuable additions are evidence links (analytics screenshot, session replay clip), expected runtime, and a guardrail metric to monitor for negative impact.
Use ICE (Impact × Confidence × Ease) when most ideas target the same surface or audience — it's fast and the scoring is consistent. Use RICE when ideas span very different audience sizes (homepage vs. checkout step 3), because Reach prevents you from over-weighting flashy but low-traffic tests.
Once a quarter at minimum. Re-score the top 20 ideas (Confidence and Effort drift as you learn), archive anything untouched for 6 months, and kill ideas whose hypothesis was invalidated by another test. A 30-minute grooming session every two weeks works even better.
One person — typically the CRO lead or Head of E-commerce. Multiple contributors can add ideas, but a single owner decides what gets queued, what gets killed, and what scores need re-validation. Shared ownership consistently produces neglected backlogs.
Under 30 ideas, a spreadsheet is fine. Past that, you'll want filtering, status workflows, and a results history — Notion, Airtable, or a dedicated tool like Metricuno's experiment hub. The tool matters less than whether you actually update statuses after each test.
A roadmap commits to specific work in specific quarters. A backlog is a ranked pool of options — what runs next depends on score, current learnings, and available capacity. Backlogs are dynamic; roadmaps are deliberately stickier.
Roughly 5:1 to 10:1. If you ship 12 tests a quarter, you want 60-120 live ideas to choose from. Below 3:1 means you're starving the program; above 15:1 means you're hoarding ideas you'll never run.
Move them to a separate 'learnings' or 'shipped' archive — don't delete. Past losers are gold for designing future tests: a checkout test that lost on free-shipping copy in March tells you a lot about what to try (and avoid) in November.
Treat them like any other source — they enter as 'idea' status, need a written hypothesis and evidence link, and get scored before being queued. The advantage is volume and that they're anchored to real funnel drop-off data; the discipline of scoring still applies so the backlog doesn't bloat.
Not scoring on a consistent scale. If one person scores Impact as a 1-5 and another as a 1-10, the priority ranking is garbage. Write a one-page scoring rubric with examples, calibrate two scorers on the same five ideas, and revisit the rubric every six months.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.