Learning Systems
A learning system is the org muscle that captures what each experiment taught, so your CRO program compounds knowledge instead of re-running the same test.
Learning System
The process and tooling that captures what each experiment taught — including losing tests — so the team compounds insight instead of repeating it.
A learning system is the operational discipline that turns individual A/B tests into durable, searchable knowledge for the whole team. It covers how hypotheses are written, how results are documented, where insights live, and how past learnings feed the next test backlog.
Most programs treat tests as one-off projects: ship a variant, declare a winner, move on. A learning system treats each test as an evidence event — winners, losers, and inconclusive runs all contribute. Done well, it cuts repeat tests, sharpens hypothesis quality, and prevents the same lesson being relearned every time a new optimiser joins.
Learning compounds only when it's written down in a format the next person can search. A Slack message, a closed Notion doc, or a screenshot in someone's Drive does not count. The artefact has to survive team turnover and be discoverable in under a minute.
A working learning system has four parts: a hypothesis template that forces a falsifiable claim, a result record that captures lift, confidence, segment splits, and qualitative observations, a tagging schema (page, audience, principle), and a review cadence where past results inform the next backlog. Miss any one and the system leaks.
Knowledge Yield = (Documented Insights / Tests Run) × Reuse Rate
Documented Insights
Documented insights
Number of tests in the period with a complete result record (hypothesis, outcome, segment notes, tags).
Tests Run
Tests run
Total experiments completed in the period, including inconclusive and losing tests.
Reuse Rate
Reuse rate
Share of new hypotheses in the next quarter that explicitly cite a prior documented insight.
A Shopify apparel brand ran 24 tests last quarter on its PDP and checkout. 18 had complete result records; the other 6 were declared and forgotten. Of the 20 hypotheses written for the next quarter, 9 cited a past learning.
Documented Insights: 18
Tests Run: 24
Reuse Rate: 9 / 20 = 0.45
→ 0.75 × 0.45 = 0.34
A knowledge yield of 0.34 is mid-range. Above 0.5 indicates a program where past learning actively shapes future tests; below 0.2 means the team is effectively starting from scratch each quarter.
The metric only matters because it forces two behaviours: documenting losers (which raises the numerator) and citing prior work in new hypotheses (which raises reuse). Teams that track it for a quarter usually find they were over-counting wins and under-counting institutional memory.
Learning-system maturity by program stage
| Maturity stage | Tests documented | Avg. time to find a past result | Repeat-test rate |
|---|---|---|---|
| Ad-hoc (no system) | 20-40% | 20+ minutes or never found | 30-45% |
| Spreadsheet log | 50-70% | 5-10 minutes | 15-25% |
| Tagged repository | 80-90% | Under 2 minutes | 5-10% |
| Insight-driven backlog | 95%+ | Under 1 minute | Under 5% |
The jump from spreadsheet to tagged repository is where most teams stall. The fix is rarely better software — it's assigning one person to own the schema and enforcing a 'no merge without a result record' rule on the experimentation strategy ritual itself.
Learning systems FAQ
Losers tell you which levers don't move the needle on a given page or audience — that's the more valuable half of the data. A documented loss prevents a teammate from re-running essentially the same idea six months later and burning two weeks of traffic on a known dead end.
It's the feedback loop inside the strategy. Your experimentation strategy sets what to test and why; the learning system records what you found and feeds it back into prioritisation. Without it, the strategy degrades into a backlog of guesses within a quarter or two.
Hypothesis statement, primary metric and result with confidence level, segment splits where relevant, two or three sentences on what you think happened, and tags for page type, audience, and design principle. Anything less and the record won't be searchable; anything more rarely gets filled in.
All three work if you enforce structure. A spreadsheet with required columns beats a beautiful Notion page that nobody fills in. Dedicated tools earn their cost once you cross roughly 30 tests a year and need filtering by tag, audience, or page.
One named person — usually the CRO lead or senior optimiser. Shared ownership becomes no ownership. Their job isn't to write every record but to enforce the schema and run the monthly review where past learnings inform next month's backlog.
Tag every insight with the date and the page version it was tested on. When the page is redesigned, mark linked insights as 'context changed'. A learning from a 2022 checkout flow may not apply to your 2024 Shop Pay-enabled flow, and the system should make that obvious.
Monthly for active programs (10+ tests per quarter), quarterly for slower ones. The review should produce three to five candidate hypotheses for the next sprint, each citing a specific prior result. If a review generates zero new hypotheses, your tagging schema is too coarse.
Session recordings, heatmaps, and survey quotes belong in the same record as the quantitative result. They explain why a variant won or lost and travel with the insight when someone searches it later. A test record with only a lift number is half a learning.
Document them with the same rigour as winners and losers. Note the observed effect size, why you couldn't reach significance (low traffic, short run, noisy metric), and what conditions would make the test worth repeating. Inconclusive is information, not failure.
Roughly one quarter for behavioural change (people start citing past tests in standups) and two to three quarters for measurable impact on win rate and velocity. Programs that stick with it typically see hypothesis quality improve before raw win rate does.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.