Summary of "LISA17 - Queueing Theory in Practice: Performance Modeling for the Working Engineer"

Overview

Speaker: Eben (engineer at Honeycomb).
Talk goal: Build small experiments plus simple analytic models to extrapolate performance, identify bottlenecks, and guide design/scale decisions. Emphasis on validating models with real data and being explicit about assumptions.

Use small experiments and simple queueing models to reason about latency, capacity, and scaling without huge-scale testing.

Experiment setup:
- Measure latency versus throughput on one CPU core.
- Send requests from many independent clients (arrivals assumed random) and observe latencies.
Model:
- Simplest single-server queueing model: random arrivals at rate λ, constant service time s, single job processing at a time (M/D/1-like assumptions).
- Area-under-curve reasoning gives a closed-form expression for average wait time W as a function of λ and s.
- Model reproduces measured latency curves and reveals the “knee” — the utilization level where latency rapidly increases.
Practical result: When restricted to a consistent performance regime, the simple model fits real data surprisingly well.

Reduce per-request work (service time s): the biggest win for both latency and capacity. For example, halving s can allow much higher throughput while maintaining similar latency.
Variability is harmful: variable inter-arrival times and variable job sizes increase queuing and latency. Constant-size jobs and uniform arrivals minimize queuing.
Mitigations for variability:
- Batching
- Aggressive preemption/timeouts
- Client-side backpressure or concurrency control
- Reducing per-request variance

Naive assumption: N servers give N× single-server capacity — not necessarily true. The outcome depends on task assignment and load balancing.
Optimal assignment (always pick the least-busy server) minimizes queuing at high utilization, but finding and maintaining that choice incurs coordination costs.
Coordination costs:
- Assign-time overhead (α)
- Costs that grow with number of servers (e.g., probing each backend adds β·N)
- As N increases, coordination overhead can dominate and reduce net throughput.
Universal Scalability Law: a compact model that captures throughput versus parallelism including serialization/coordination costs; explains diminishing or negative returns when adding servers without careful design.

Power-of-two-choices (pick-two):
- Pick two random servers and send the task to the less-loaded one.
- Constant overhead independent of N; dramatically reduces maximum load (roughly from ~log N to ~log log N).
- Used in large-scale schedulers/load balancers (and in systems like Sparrow).
Partitioning + hierarchical aggregation (iterative parallelization):
- Avoid aggregating all partial results at a single node (aggregation cost ∝ N).
- Use intermediate aggregators in a tree-shaped reduce to reduce aggregation cost to O(log N).
- Scan time decreases with fan-out, while aggregation cost grows — a tree reduces overall cost and improves scaling.
Randomized approximation and iterative partitioning:
- Balance coordination cost against latency/throughput by trading exactness for lower coordination.

State goals and assumptions explicitly.
Run small tests and microbenchmarks; collect instrumentation data needed to fit models.
Validate simple models against production-like data before trusting extrapolations.
If unsure, draw the system timeline/queueing picture, write a small simulation, or consult standard queueing results/textbooks.
Watch for unbounded queues (which lead to unbounded latency) and minimize variance. The simplest capacity improvement is reducing work per request.

Theoretical concepts:
- Poisson/random arrivals
- Single-server queue (M/D/1-like assumptions)
- Variability effects
- Universal Scalability Law
Algorithms / systems:
- Power-of-two-choices (pick-two) load balancing
- Sparrow scheduler
- Facebook Scuba (and general map-reduce / distributed query partitioning + aggregation)
Papers:
- Literature on two-choice load balancing and related schedulers
Follow-up talk:
- Barron — deeper dive on the Universal Scalability Law

Main speaker: Eben (engineer at Honeycomb) — presented experiments and models.
Referenced speaker: Barron — follow-up on the Universal Scalability Law.
Referenced systems/papers: power-of-two-choices literature, Sparrow scheduler, Facebook Scuba, and papers on randomized load balancing and distributed query aggregation.