Experiment Backlogs

Q: What fields does an experiment backlog need at minimum?

Seven: slug, hypothesis, surface, primary KPI, priority score, status, and owner. Optional but valuable additions are evidence links (analytics screenshot, session replay clip), expected runtime, and a guardrail metric to monitor for negative impact.

Q: Should I use ICE or RICE to score my backlog?

Use ICE (Impact × Confidence × Ease) when most ideas target the same surface or audience — it's fast and the scoring is consistent. Use RICE when ideas span very different audience sizes (homepage vs. checkout step 3), because Reach prevents you from over-weighting flashy but low-traffic tests.

Q: How often should the backlog be groomed?

Once a quarter at minimum. Re-score the top 20 ideas (Confidence and Effort drift as you learn), archive anything untouched for 6 months, and kill ideas whose hypothesis was invalidated by another test. A 30-minute grooming session every two weeks works even better.

Q: Who owns the backlog?

One person — typically the CRO lead or Head of E-commerce. Multiple contributors can add ideas, but a single owner decides what gets queued, what gets killed, and what scores need re-validation. Shared ownership consistently produces neglected backlogs.

Q: Where should I store the backlog — spreadsheet, Notion, or a dedicated tool?

Under 30 ideas, a spreadsheet is fine. Past that, you'll want filtering, status workflows, and a results history — Notion, Airtable, or a dedicated tool like Metricuno's experiment hub. The tool matters less than whether you actually update statuses after each test.

Q: How is an experiment backlog different from a roadmap?

A roadmap commits to specific work in specific quarters. A backlog is a ranked pool of options — what runs next depends on score, current learnings, and available capacity. Backlogs are dynamic; roadmaps are deliberately stickier.

Q: What's a healthy ratio of ideas to shipped tests?

Roughly 5:1 to 10:1. If you ship 12 tests a quarter, you want 60-120 live ideas to choose from. Below 3:1 means you're starving the program; above 15:1 means you're hoarding ideas you'll never run.

Q: Should losing tests stay in the backlog?

Move them to a separate 'learnings' or 'shipped' archive — don't delete. Past losers are gold for designing future tests: a checkout test that lost on free-shipping copy in March tells you a lot about what to try (and avoid) in November.

Q: How do AI-generated test ideas fit into a backlog?

Treat them like any other source — they enter as 'idea' status, need a written hypothesis and evidence link, and get scored before being queued. The advantage is volume and that they're anchored to real funnel drop-off data; the discipline of scoring still applies so the backlog doesn't bloat.

Q: What's the biggest mistake teams make with backlogs?

Not scoring on a consistent scale. If one person scores Impact as a 1-5 and another as a 1-10, the priority ranking is garbage. Write a one-page scoring rubric with examples, calibrate two scorers on the same five ideas, and revisit the rubric every six months.

Metricuno

May 18, 2026

4 min read

Quick answer

An experiment backlog is the prioritized pipeline of test ideas your CRO program runs from. Here's how to structure it, score it, and keep it from rotting.

Definition

Experimentation

Experiment Backlog

A structured, prioritized list of test ideas — each with a hypothesis, surface, score, and status — that feeds an experimentation program.

An experiment backlog is the single source of truth for every test idea your team has considered, from "swap the hero image" to "redesign the checkout shipping step." Each row carries the metadata needed to decide what runs next: a slug, a falsifiable hypothesis, the surface or page it targets, a primary KPI, a priority score (usually ICE or RICE), and a status (idea → queued → live → analysing → shipped/killed).

A backlog isn't a wishlist. It's a working artifact of your experiment prioritization process — disciplined enough that two people independently scoring the same idea land within a few points, and pruned ruthlessly enough that stale ideas don't crowd out fresh ones.

Also known as

test backlog

CRO backlog

experimentation pipeline

Most teams discover the value of a backlog the hard way. After six months of running tests, nobody remembers what was tried on the product detail page in March, why it lost, or whether the variant copy is worth resurrecting on a different surface. A backlog with clean status history answers all three questions in 30 seconds.

The structural minimum is seven fields per idea: slug (a short URL-safe handle), hypothesis ("if we change X for users Y, metric Z will move because…"), surface (PDP, cart, checkout step 2), primary KPI, ICE or RICE score, status, and owner. Without those, you have a Notion page, not a backlog.

Formula

RICE = (Reach * Impact * Confidence) / Effort

Variables

Reach

Number of users or sessions exposed to the change per time period (e.g. monthly visitors to the affected surface).

Impact

Expected lift on the primary KPI, on a scored scale: 3 = massive, 2 = high, 1 = medium, 0.5 = low, 0.25 = minimal.

Confidence

How sure you are the hypothesis is right, as a percentage (0-100%). High = backed by analytics, session replays, or prior wins; low = pure hunch.

Effort

Person-weeks (design + dev + QA + analysis) to ship the variant. Lower is better.

Worked example

A Shopify apparel store wants to score a test that replaces a static hero image on the PDP with a 6-second product video. Monthly PDP traffic is 80,000 sessions. The team thinks the impact is medium (1), they're 70% confident based on a prior win on the homepage, and it'll take 2 person-weeks to build.

Reach (monthly sessions): 80000

Impact: 1

Confidence: 0.7

Effort (person-weeks): 2

→ RICE = (80000 × 1 × 0.7) / 2 = 28,000

A RICE of 28,000 is a strong contender for the next sprint. Compare it against the other queued ideas — anything scoring above the median goes into the run queue; everything below sits in 'icebox' until the inputs change.

RICE works better than ICE for stores with several distinct traffic surfaces, because Reach forces you to weight a checkout test (small but high-intent audience) against a homepage test (large but lower-intent). On a single-funnel site, plain ICE is fine and faster to score.

Benchmark

Backlog health benchmarks by program maturity

Program stage	Ideas in backlog	% scored	Avg idea age (days)	Tests shipped / quarter
Just starting (0-3 months)	10-25	60-80%	30	2-4
Established (6-18 months)	40-80	85-95%	60	8-12
Mature (2+ years)	80-150	90-100%	90	15-25
Neglected (warning sign)	200+	<50%	180+	0-3

The neglected row matters most. A backlog that balloons past 200 unscored ideas is the canonical failure mode — every PM and marketer dumped ideas into it, nobody pruned, and now no one trusts it. Quarterly grooming (kill stale ideas, re-score the top 20, archive anything older than 6 months that hasn't moved) is what separates a useful backlog from a graveyard.

Frequently asked

Frequently asked questions