Experiment Backlog Template Checklist
A structured backlog template for tracking CRO test ideas — hypothesis, surface, ICE/RICE score, status, owner, and results — so your program runs from one source of truth instead of three half-finished spreadsheets.
Experiment Backlog Template
A structured tracker for CRO test ideas — hypothesis, surface, score, status, owner, results — in one shared sheet.
An experiment backlog template is a pre-built spreadsheet (or Notion / Airtable doc) that captures every test idea your team generates, scores them against a consistent prioritisation framework like ICE or RICE, and tracks each one through draft, live, and concluded states. It's the operational backbone of a CRO program: hypotheses live here before they ship, and learnings live here after they end.
Most teams rebuild this themselves in months three to six of their program, usually after losing a great hypothesis in someone's DMs or shipping two contradictory tests in the same week. A template lets you skip that detour.
If you're running more than two tests a month, ideas stop fitting in your head. You'll have one hypothesis half-written in a Loom comment, a second sitting in last quarter's QBR deck, and a third that the email team mentioned in standup but no one wrote down. A backlog is where those go to actually become tests.
A good template does three things: it forces a hypothesis into a testable shape before it gets scored, it keeps scoring consistent across people so the loudest voice doesn't always win, and it leaves a paper trail of what you've already tried — including the losers. That last part is what separates a program that compounds from one that re-tests the same PDP layout every nine months.
ICE scores drift fast
If three people score the same idea, you'll get three different numbers — and within a quarter, the same person scores the same idea differently. Lock the rubric (what does a 7 on Impact actually mean in €?) on the second tab of the sheet, or your backlog becomes a popularity contest. RICE is slightly more defensible because Reach forces a number you can sanity-check against analytics.
What every column does — and why
Hypothesis and surface. Write the hypothesis in the form 'Because we saw [data], we believe [change] for [audience] will cause [metric] to move, measured by [KPI].' Surface is where it ships — PDP, cart drawer, checkout step 2, the post-purchase upsell. Forcing both in writing kills 30% of ideas before they're scored, which is the point.
ICE or RICE score. Pick one and stick with it for at least two quarters before swapping. ICE (Impact, Confidence, Ease) is faster for small teams; RICE (Reach, Impact, Confidence, Effort) is better once you're juggling tests across PDP, checkout, and email at the same time. Both are subjective — the rubric tab matters more than the formula.
Status and owner. Status moves through Draft → Prioritised → In QA → Live → Concluded → Shipped/Killed. Owner is one person, not a team — backlogs with shared ownership stall. A separate 'design owner' column is fine; what isn't fine is leaving the accountability column blank because 'we'll figure it out in standup'.
Results and learning. Capture the lift, the confidence interval, the segment cuts that mattered, and — most importantly — a one-line learning. 'Free shipping bar lifted AOV +4.2% on mobile but flat on desktop; mobile users anchor to threshold messaging more' is worth more in six months than 'winner, shipped'. This column is what makes the backlog an asset rather than a to-do list.
Frequently asked questions
ICE if you're a team of one or two and most tests touch the same surfaces. RICE once you're running across PDP, checkout, email, and ads at the same time — Reach forces you to admit a homepage test affects 100% of sessions while a post-purchase upsell test affects maybe 8%, which fundamentally changes prioritisation.
Spreadsheet until you hit roughly 50 tests in the backlog or three concurrent test owners — Google Sheets handles it. Notion or Airtable once you want filterable views per surface and embedded screenshots. A dedicated tool only earns its seat once you're running 5+ tests in parallel and need governance on who can edit scores.
Healthy backlogs sit around 30-60 ideas with maybe 8-12 prioritised. Below 20 and you're idea-starved — your program will stall the moment a test inconclusively ends. Above 100 and the backlog is a graveyard; archive anything older than two quarters that hasn't been touched.
Everyone — CX, paid, email, even the warehouse team if they spot a returns pattern. Restrict scoring and prioritisation to two or three people. Open intake, narrow triage. This is the single biggest source of test ideas most teams under-use.
A 30-minute weekly triage to score new ideas and move statuses, plus a 60-minute monthly prioritisation to set the next sprint of tests. Quarterly, archive concluded tests and re-score anything older than 90 days that's still sitting in Prioritised — the world moved on.
Yes — they're the most valuable rows. A killed test with a clear learning prevents the same idea coming back next quarter wearing a slightly different hat. Mark them Concluded → Killed and write the learning in plain language, not stats jargon.
The backlog is the inventory; the roadmap is the next 4-8 weeks of what's actually shipping. Roadmap items should all have a backlog ID so you can trace any live test back to its hypothesis, score, and original data signal. If a test on the roadmap doesn't have a backlog row, that's a process smell.
A specific observation tied to a number — 'mobile cart abandonment is 73% on the shipping step, 18pp above desktop' beats 'I think the cart is confusing'. If you can't point to a GA4 funnel, heatmap, session replay, or survey response, the idea goes in a separate 'unvalidated' tab until it has evidence.
Two rules: archive anything in Draft status for more than 60 days, and never let Prioritised exceed three months of capacity. If an idea has been 'next quarter' for two quarters running, either ship it or kill it — leaving it pending tells the team scoring doesn't actually drive what happens.
Yes — the backlog is upstream of where tests run, so it sits happily next to GA4, your heatmap tool, and your testing platform. The connection point is the data-signal column: every prioritised hypothesis should reference a specific report, segment, or replay that justified it, which makes the post-test analysis much faster.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.