Canary Releases

Metricuno
May 18, 2026
4 min read
Quick answer

Canary releases route a small slice of production traffic to a new version while everyone else stays on the stable build. Here's how the rollout math and ramp schedules actually work.

Definition
Deployment & Experimentation

Canary Releases

A deployment pattern that routes a small slice of production traffic to a new version while the rest stays on the stable one.

A canary release pushes a new build of your storefront, checkout, or backend service to production but only exposes it to a small fraction of real traffic — typically 1% to 5% at first. You watch error rates, latency, and conversion on that slice; if it holds up, you ramp the percentage. If something breaks, you roll back before most shoppers ever saw it.

The name comes from the canary in a coal mine: the small group is the early warning. Unlike a blue-green deploy (instant 100% cutover) or a feature flag (code-level toggle inside a single build), a canary is an infrastructure-level traffic split between two running versions of the same service.

Also known as
canary deployment
progressive rollout
incremental rollout

Canary releases live in the same family as feature experimentation, but they answer a different question. An A/B test asks: does variant B convert better? A canary asks: does this build break anything in production? The success metric is the absence of regressions, not lift.

On a typical Shopify or headless storefront, the traffic split happens at the CDN or load balancer — Cloudflare Workers, Fastly, or an internal proxy routes a percentage of requests to the new origin. Feature flags, by contrast, ship one build and toggle behaviour inside the running code. You often use both: a canary to derisk the deploy, flags to derisk individual features inside it.

Formula

Exposed Sessions = Daily Sessions × Canary % × Hours Live / 24

Variables

Daily Sessions

Daily sessions

Total store sessions in a normal 24h window.

Canary %

Canary traffic share

Fraction of traffic routed to the new version (e.g. 0.05 for 5%).

Hours Live

Hours the canary has been live

Time elapsed at this traffic share before ramping or rolling back.

Worked example

An apparel store doing 40,000 sessions/day puts a new checkout build behind a 5% canary for 4 hours before deciding whether to ramp.

Daily Sessions: 40,000

Canary %: 5%

Hours Live: 4

≈ 333 exposed sessions

333 sessions is enough to catch hard errors (500s, broken payment) but not enough to detect a small conversion regression. For statistically meaningful conversion comparisons you'd ramp longer or higher.

That formula tells you the blast radius of a bad deploy: how many real shoppers see the broken version before you pull it. Most teams choose canary % so this number stays small for the first hour, then ramp aggressively once the early signals look clean.

Benchmark

Typical canary ramp schedule for a mid-size store

StageTraffic shareMin. durationWhat you watch
1. Smoke1%15-30 min5xx rate, JS errors, checkout API errors
2. Early5%1-2 hoursAdd-to-cart rate, payment success, p95 latency
3. Ramp25%2-4 hoursConversion rate vs control, refund/dispute signals
4. Majority50%4-12 hoursFull funnel parity, mobile vs desktop deltas
5. Full100%n/aDecommission old version, keep rollback ready 24h

The right ramp depends on traffic volume and risk. A €10M Shopify store can push a checkout change through this whole schedule in a day. A smaller catalogue with 5,000 sessions a day needs longer at each stage just to accumulate enough events to trust the signal — or it should canary lower-risk changes only.

Frequently asked

Canary releases FAQ

A canary is an infrastructure split between two builds of your service; a feature flag is a code-level toggle inside one build. Canaries derisk the deployment itself (did this commit break something?), flags derisk individual features (should this UI element be visible to this user?). Most mature teams use both together.

No. A/B tests measure whether a change improves a business metric, with random assignment, statistical significance, and a fixed sample size. Canaries measure whether a change breaks production, with traffic ramped progressively and rolled back on regressions. The infrastructure looks similar, but the success criteria are opposite.

Usually no — Shopify's preview themes plus a staging review handle most theme work. Canaries make sense for changes that only manifest under real traffic: app embeddings, checkout extensions, third-party scripts, or anything touching the order flow on Shopify Plus.

Start at 1% for high-risk changes (checkout, payment, authentication) and 5% for lower-risk ones (PDP layout, marketing pages). The goal is for the exposed-sessions count to be high enough to surface hard errors within 15-30 minutes but low enough that a complete failure is recoverable.

Two tiers. Hard signals: HTTP 5xx rate, JS error rate, payment API errors, p95 latency — any of these spiking vs the control version means roll back now. Soft signals: add-to-cart rate, checkout completion, conversion rate — these need more volume but flag silent regressions.

Most stores do it at the edge: Cloudflare Workers, Fastly VCL, or AWS ALB weighted target groups. Headless setups can use Vercel's edge middleware or Netlify split testing. The router hashes a stable identifier (session ID, customer ID) so the same shopper sticks to one version across requests.

Yes, but layer them carefully. The canary controls who sees the new build; the A/B test runs inside whichever build a session lands on. Don't let the canary and test split on the same dimension or you'll contaminate the experiment. Most teams gate experiments to the stable build until the canary hits 100%.

Below ~5,000 sessions/day, canaries detect hard errors but rarely detect conversion regressions in a useful window. Compensate by keeping the canary at a higher percentage (10-25%) for longer, or restrict canary use to changes where production-only bugs are the main risk — payment integrations, third-party scripts, server-rendered pages.

Long enough to accumulate a few hundred sessions per stage at minimum, and to cover at least one full traffic cycle (e.g. peak hour vs off-peak). For a mid-size store, 15-30 minutes at 1%, 1-2 hours at 5%, and a few hours each at 25% and 50% is a workable default. High-risk changes get longer dwell times.

Canary releases are one technique inside the broader feature experimentation toolkit, alongside feature flags, A/B tests, and multivariate tests. Canaries protect the deploy, flags control exposure, and A/B tests measure impact. Treating them as separate stages of the same release pipeline is how mature teams ship fast without breaking conversion.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.