Summary of "The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)"
High-level summary
Practical, tactical guide to A/B testing and building an experimentation culture from Ronny Kohavi (ex-Amazon, Microsoft, Airbnb). Focus areas covered:
- When to start testing and what to test
- Experiment platform best practices and organizational rollout
- Statistical pitfalls and diagnostic safeguards
- How to make experimentation drive long-term product decisions rather than only local optimizations
Key technological concepts, methods, and product/practice recommendations
Test-everything philosophy
- Treat every code change or new feature as an experiment when possible. Small/bug-fix changes can have surprising, outsized effects—once platform cost is low, you cannot “experiment too much.”
Portfolio approach
- Run many small, incremental experiments and keep a controlled allocation for high-risk/high-reward bets (expect ~70–80% failure for big bets).
When to start A/B testing
- You need volume. Rules of thumb:
- Start when you have tens of thousands of users.
- ~200k users is a practical “magic” point to reliably detect ~5% effects for common consumer metrics.
Overall Evaluation Criterion (OEC)
- Define a single, causal metric (or constrained objective) representative of user lifetime value.
- Use guardrail/countervailing metrics (retention, time-to-success, satisfaction) so short-term revenue increases don’t damage long-term value.
- Consider constraints (e.g., ad real-estate budget) rather than pure revenue maximization.
Long-term effects
- Use long-term holdouts, forecasting models, or lift-to-LTV models when impact arrives later (e.g., post-booking ratings).
- Replicate promising results and combine evidence rather than relying on a single test.
Experiment platform and workflow recommendations
- Drive marginal experiment cost toward zero: self-service setup, templated scorecards/metric sets, automated checks and diagnostics, and fast reliable scorecards so teams don’t need an analyst each time.
- Key platform features:
- Randomization integrity tests (sample ratio checks)
- Automated guardrail and diagnostic metrics
- Variance-reduction tooling
- Searchable past experiments (institutional memory)
- Automated replication follow-ups
- Build vs buy: commercial experimentation platforms are mature; many teams use hybrid approaches (buy core, build integrations or vice-versa).
- Organizational rollout: start with a high-velocity team, demonstrate surprising wins, share learnings, run quarterly reviews of “most interesting” experiments and document results.
Statistical and diagnostic pitfalls (and safeguards)
- Sample Ratio Mismatch (SRM): a common integrity issue where allocation differs from the intended ratio. Causes include bots, instrumentation issues, filtering in data pipelines, sign-in placement, and campaign routing. Always automatically test for SRM and treat results as untrustworthy until diagnosed.
-
Twyman’s law:
“If a figure looks interesting or different, it is usually wrong.”
- Big, surprising lifts are more likely due to experimental flaws than real effects—investigate first.
- Misinterpreting p-values:
- p-value = P(data | null), not P(null | data). Teams often misread 1 − p as “probability treatment is better.”
- Without priors, the false positive risk is higher than p suggests. Example: with an 8% prior success rate, p < 0.05 can imply ≈26% false-positive risk.
- Remedies: lower p-value thresholds (e.g., < 0.01), require replication, or combine experiments (Fisher / Stouffer).
- Early stopping and peeking:
- Real-time p-value checking without correction inflates false positives. Use proper sequential testing methods or pre-specified stopping rules.
- Diagnostic practices:
- Blank or flag invalid scorecards.
- Force engineers/PMs to acknowledge SRM warnings.
- Require replication for marginal or surprising wins.
Ways to speed experiments and improve power
- Variance reduction techniques to lower metric variance and shorten required sample size/duration:
- Cap or handle long-tailed metrics (e.g., cap extremely large purchase values or nights booked).
- Pre-experiment covariate adjustment (CUPED/Cupid-style methods) using pre-period data.
- Focus on metric design (OEC) and use templates so the platform computes guardrails automatically.
- Provide rapid, ideally automated, scorecards to avoid analysis backlogs.
Design and process advice for experiments and launches
- One-factor-at-a-time (OAT): prefer incremental launches and test changes separately; decompose big redesigns rather than bundling many changes.
- Allocate some budget to big bets but accept high fail rates; have abort/replicate plans and data to stop quickly if negative.
- Document and surface both surprising winners and surprising losers—they both teach product and infrastructure lessons.
- Use constraints/optimization framing (e.g., pixel budget for ads) to avoid perverse incentives that boost short-term metrics at long-term cost.
Concrete resources, guides, and artifacts
- Book: Trustworthy Online Controlled Experiments (Kohavi et al.) — practical, focused on trust and operational issues.
- Ronny Kohavi’s papers/talks:
- “Rules of Thumb” (patterns from thousands of experiments)
- Paper on diagnosing Sample Ratio Mismatch (SRM)
- Practical defaults talk (sample sizes and thresholds)
- CUPED / Cupid variance-reduction article
- GoodUI.org — a collection of UI A/B testing patterns and aggregated experiment outcomes (150+ patterns).
- Vendor examples: Mixpanel, EPO (getepo.com)
- Historical caution: early misuse of Optimizely led to inflated false positives—emphasizes the need for correct statistics in tools.
- Methods/tactics: Fisher’s method or pooled-replication methods to combine p-values, automated scorecards, searchable experiment index.
Representative empirical examples
- Bing: tiny UI change (promote second ad line to first line) produced ~12% revenue increase—an unexpected, large win after backlog.
- Bing / Hotmail / MSN: opening links in a new tab increased engagement/revenue—replicated across products and later at Airbnb.
- Bing relevance team: many small, incremental monthly gains compounding to multi-percent yearly improvements.
- Airbnb: ~250 search-relevance experiments produced ~6% revenue improvement overall; 92% of experiments failed to improve the primary metric.
- Microsoft: company-level experiments had ~66% failure rate overall; Bing saw ~85% failure; experiment scale reached ~20–25k experiments/year at one point.
- Amazon email campaigns: adjusting models to account for unsubscribe costs (LTV counterfactual) reduced harmful campaigns and led to features like default unsubscribe-from-author.
Operational tips and cultural advice
- Build experiment “trust”: automated checks, transparent failures, and a platform that flags invalid tests so stakeholders stop trusting flaky outputs.
- Institutionalize knowledge: searchable experiment repository, quarterly “surprising experiments” reviews, and documentation of replications and lessons.
- Convincing skeptics: start with a high-velocity team, surface surprising wins, cross-pollinate learnings, and show how frequent small experiments catch regressions.
- Avoid “ship-on-flat” fallacy: flat (non-significant) experiments shouldn’t be shipped just because work was invested—maintenance and complexity are real costs; only ship when positive or required.
Statistical numbers to remember
- Typical failure rates (ideas not improving primary metric): ~66% (Microsoft overall), ~85% (Bing), and 80–92% in mature domains—winning ideas are a small minority.
- Sample size guidance: tens of thousands of users are a useful starting point; ~200k users to detect ~5% changes reliably.
- False positive example: with an ~8% prior success rate, p < 0.05 corresponds to ≈26% false-positive risk (not 5%).
Practical checklist / short playbook
- Set OEC early and include countervailing/guardrail metrics tied to lifetime value.
- Ensure randomization and entitlement integrity checks (SRM) run automatically; block scorecards if integrity fails.
- Start small: OAT testing; avoid big-bang launches where possible.
- Use variance-reduction and pre-period covariates to speed tests.
- Require replication for surprising/large lifts; combine experiments to reduce false-positive risk.
- Document experiments in a searchable archive and run regular reviews of surprising winners/losers.
- Decide build vs buy based on platform maturity, team resources, and experiment volume.
Resources to follow / learn from
- Trustworthy Online Controlled Experiments (Kohavi et al.) — book
- GoodUI.org — UI experimentation patterns library
- Kohavi’s papers/talks: Rules of Thumb; SRM diagnosis; CUPED/Cupid; practical defaults
- Vendors/platforms to evaluate: Mixpanel (analytics), EPO (experimentation)
Speakers and sponsors
- Guest: Ronny (Ronny) Kohavi — A/B testing & experimentation expert (ex-Amazon, Microsoft, Airbnb), author of Trustworthy Online Controlled Experiments.
- Host: Lenny (Lenny’s Podcast).
- Sponsors / tools mentioned: Mixpanel, Round, EPO (getepo.com), Optimizely (historical example).
Optional follow-ups
- Extract specific papers/links mentioned (SRM paper, Rules-of-Thumb, CUPED) and provide direct links.
- Produce a short 1-page engineer/PM checklist to use before launching experiments.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...