Summary of "LISA17 - Queueing Theory in Practice: Performance Modeling for the Working Engineer"
Overview
- Speaker: Eben (engineer at Honeycomb).
- Talk goal: Build small experiments plus simple analytic models to extrapolate performance, identify bottlenecks, and guide design/scale decisions. Emphasis on validating models with real data and being explicit about assumptions.
Use small experiments and simple queueing models to reason about latency, capacity, and scaling without huge-scale testing.
Serial (single-server) modeling
- Experiment setup:
- Measure latency versus throughput on one CPU core.
- Send requests from many independent clients (arrivals assumed random) and observe latencies.
- Model:
- Simplest single-server queueing model: random arrivals at rate λ, constant service time s, single job processing at a time (M/D/1-like assumptions).
- Area-under-curve reasoning gives a closed-form expression for average wait time W as a function of λ and s.
- Model reproduces measured latency curves and reveals the “knee” — the utilization level where latency rapidly increases.
- Practical result: When restricted to a consistent performance regime, the simple model fits real data surprisingly well.
Key lessons from serial modeling
- Reduce per-request work (service time s): the biggest win for both latency and capacity. For example, halving s can allow much higher throughput while maintaining similar latency.
- Variability is harmful: variable inter-arrival times and variable job sizes increase queuing and latency. Constant-size jobs and uniform arrivals minimize queuing.
- Mitigations for variability:
- Batching
- Aggressive preemption/timeouts
- Client-side backpressure or concurrency control
- Reducing per-request variance
Parallel systems and scaling
- Naive assumption: N servers give N× single-server capacity — not necessarily true. The outcome depends on task assignment and load balancing.
- Optimal assignment (always pick the least-busy server) minimizes queuing at high utilization, but finding and maintaining that choice incurs coordination costs.
- Coordination costs:
- Assign-time overhead (α)
- Costs that grow with number of servers (e.g., probing each backend adds β·N)
- As N increases, coordination overhead can dominate and reduce net throughput.
- Universal Scalability Law: a compact model that captures throughput versus parallelism including serialization/coordination costs; explains diminishing or negative returns when adding servers without careful design.
Design patterns to improve scalability
- Power-of-two-choices (pick-two):
- Pick two random servers and send the task to the less-loaded one.
- Constant overhead independent of N; dramatically reduces maximum load (roughly from ~log N to ~log log N).
- Used in large-scale schedulers/load balancers (and in systems like Sparrow).
- Partitioning + hierarchical aggregation (iterative parallelization):
- Avoid aggregating all partial results at a single node (aggregation cost ∝ N).
- Use intermediate aggregators in a tree-shaped reduce to reduce aggregation cost to O(log N).
- Scan time decreases with fan-out, while aggregation cost grows — a tree reduces overall cost and improves scaling.
- Randomized approximation and iterative partitioning:
- Balance coordination cost against latency/throughput by trading exactness for lower coordination.
Practical guidance / workflow
- State goals and assumptions explicitly.
- Run small tests and microbenchmarks; collect instrumentation data needed to fit models.
- Validate simple models against production-like data before trusting extrapolations.
- If unsure, draw the system timeline/queueing picture, write a small simulation, or consult standard queueing results/textbooks.
- Watch for unbounded queues (which lead to unbounded latency) and minimize variance. The simplest capacity improvement is reducing work per request.
References / concepts / systems mentioned
- Theoretical concepts:
- Poisson/random arrivals
- Single-server queue (M/D/1-like assumptions)
- Variability effects
- Universal Scalability Law
- Algorithms / systems:
- Power-of-two-choices (pick-two) load balancing
- Sparrow scheduler
- Facebook Scuba (and general map-reduce / distributed query partitioning + aggregation)
- Papers:
- Literature on two-choice load balancing and related schedulers
- Follow-up talk:
- Barron — deeper dive on the Universal Scalability Law
Main speaker / sources
- Main speaker: Eben (engineer at Honeycomb) — presented experiments and models.
- Referenced speaker: Barron — follow-up on the Universal Scalability Law.
- Referenced systems/papers: power-of-two-choices literature, Sparrow scheduler, Facebook Scuba, and papers on randomized load balancing and distributed query aggregation.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...