The nonlinear relationship between utilization and tail latency
Note | Jupyter notebook for this post: stats-demos/single-queue-sim.ipynb |
Most engineers know that high utilization is bad. What’s less intuitive is how bad — and specifically, that the relationship between utilization and tail latency is nonlinear. A system running at 50% utilization behaves nothing like one running at 90%, even though both sound "fine" by naive capacity planning standards.
This post will use simulations to explore the nonlinear relationship between utilization and tail latency.
Queueing theory gives us a precise language for this. The M/M/1 queue is the minimal model that captures the essential dynamics of any system where jobs compete for a shared resource such as a connection pool, a lock, a CPU core, or a database.
While the analytic solution for M/M/1 exists and is well-known, we opt to use simulations rather than closed-form analysis. Simulation earns its keep when we want to extend the model to non-exponential service times, multiple servers, or priority queues. Building the simulation first gives us the scaffolding for such extensions.
M/M/1 overview
The simplest type of queue is the M/M/1 queue. Its name reflects a special notation for a straightforward concept:
The queue has a single waiting area for jobs, and a single server.
Job arrivals into the queue are exponentially distributed, being generated by a Poisson point process with a fixed rate 𝜆.
Service times are also exponentially distributed by a fixed rate \(\mu\). (So the mean service time is given by \(1/\mu\)).
Schematically:

Simulation runs
In what follows, we hold \(n = 4000\) and \(\lambda = 16\) fixed across all three runs, varying only \(\mu\). This isolates utilization as the independent variable. The three runs correspond to utilization \(\rho \approx 0.5\), \(\rho \approx 0.9\), and \(\rho > 1.0\) — light, moderate, and overloaded.
Before we get into the specific runs, let’s take a look at the arrivals for the first run, which are basically the same as those for the second and third runs too since \(n\) and \(\lambda\) are shared.

The interarrival times are the times between successive jobs. The parameter 𝜆=16 means that on average, 16 jobs arrive per unit time. (The unit of time is totally arbitrary.) This means that the average interarrival time is 1/𝜆 = 0.0625. You can see the exponentially distributed interarrival times in the first histogram above, and the more or less constant inflow of jobs over the course of the simulation. The arrivals happen in roughly 250 time units since 4000/16 = 250. Keep in mind that the generated numbers vary a bit from the model parameters since this is, after all, a simulation.
Run #1: Lightly loaded system (n=4000, 𝜆=16, 𝜇=32)

The chart above says "partial view" because there’s not enough space to plot all 4,000 jobs. I just grabbed 40 jobs (job IDs 2000 through 2039) and display for each the wait time (gray) and service time (green). The system is lightly loaded, so there’s not much contention for the server. (We can see a bit of contention around t=128.0 and a little more around 128.7 or so.) There are plenty of gaps in the timeline where the system doesn’t contain any jobs at all.
We can use a plot to see job behavior more holistically:

For the most part, the number of jobs is flat—it’s just a hair over 1.0, though it’s hard to see that in the plot. Occasionally there are spikes. These are due to the random variance in the arrival and service times.
Response time is wait time plus service time:

In the histograms above, the majority of wait times are 0, meaning that the system was empty when the job arrived. There are some nonzero wait times too, but the wait times are still tiny even when that happens. For practical purposes the wait times here are zero, so the exponentially distributed service times translate to exponentially distributed response times.
Simulation statistics
---------------------
total_duration = 254.6792
arrival_rate_mean = 15.7136
interarrival_time_mean = 0.0636
response_time
mean = 0.0646
var = 0.0048
p50 = 0.0428
p75 = 0.0878
p90 = 0.1469
p95 = 0.1979
p99 = 0.3307
wait_time_mean = 0.0331
service_rate_mean = 31.7642
service_time_mean = 0.0315
num_jobs_in_system_mean = 1.0150
num_jobs_in_queue_mean = 0.5206
throughput_mean = 15.6863
utilization = 0.4945
Little's Law: E[N] = lambda * E[T]
----------------------------------
num_jobs_in_system_mean = 1.0150
arrival_rate_mean * response_time_mean = 1.0155 (= 15.7136 * 0.0646)
Utilization Law, version 1: rho_i = lambda_i / mu_i
---------------------------------------------------
utilization = 0.4945
arrival_rate_mean = 15.7136
service_rate_mean = 31.7642
arrival_rate_mean / service_rate_mean = 0.4947 (= 15.7136 / 31.7642)
Utilization Law, version 2: rho_i = X_i * E[S]
----------------------------------------------
utilization = 0.4945
throughput_mean = 15.6863
service_time_mean = 0.0315
throughput_mean * service_time_mean = 0.4938 (= 15.6863 * 0.0315)These are essentially performance metrics for the system. We care about measuring things like the response time mean, variance and tail behavior; the mean number of jobs in the overall system; the job throughput and the server utilization.
Note that num_jobs_in_queue refers specifically to the waiting area in the queue. (Sometimes "queue" means the waiting
area plus the server, and sometimes it just means the waiting area. I wasn’t entirely consistent in my terminology;
hence this clarification.)
After the simulation statistics, we have a couple so-called operational laws: Little’s Law and two flavors of the Utilization Law. These are some key results in queueing theory (there are others as well) that constrain the relationship between the quantities involved as long as 𝜆 < 𝜇. Since we’re dealing with a simulation, we use sample means to estimate the expected values. We can see that our simulation’s behavior is consistent with both Little’s Law and the Utilization Law.
From the Utilization Law, we can see that at \(\rho \approx 0.5\), wait time is effectively zero. Response time is just service time.
Run #2: Moderately loaded system (n=4000, 𝜆=16, 𝜇=18)
There are two ways to increase load: either increase the arrival rate or decrease the service rate. Either way leads to decreased performance. Here I opted to leave the arrival rate the same and simply decrease the service rate from 𝜇=32 to 𝜇=18.
Here’s the result:

If you compare the plot above to the corresponding plot in the first run, you can see that the job density is higher here, which means that there is more contention for the server. We can see the same thing below:

The job counts go quite a bit higher here than they did in run #1. Sometimes they get quite backed up, either because lots of jobs arrived in a short span of time or else because some of the jobs just took a lot longer to service. Or both.
Now let’s see how this translates to response times:

Wait time 0 is still the most common, but now there are more jobs that have nonzero wait time. The service times are longer since we decreased 𝜇, but they’re still cleanly exponential. But adding the two together, we can see unsteadiness starting to emerge in the response times. The curve isn’t quite as smooth, and the tail is a lot longer.
Caution | The following statistics include the warm-up phase, during which the simulation starts from an empty queue and hasn’t yet reached steady-state. At \(\rho \approx 0.9\) this introduces a modest optimistic bias — true steady-state statistics would show somewhat higher queue depths and longer response times. A more rigorous treatment would discard the warm-up period before computing statistics. |
Here are the stats for run #2:
Simulation statistics
---------------------
total_duration = 247.8173
arrival_rate_mean = 16.1498
interarrival_time_mean = 0.0619
response_time
mean = 0.4744
var = 0.1787
p50 = 0.3437
p75 = 0.6958
p90 = 1.0541
p95 = 1.3654
p99 = 1.8403
wait_time_mean = 0.4187
service_rate_mean = 17.9385
service_time_mean = 0.0557
num_jobs_in_system_mean = 7.6579
num_jobs_in_queue_mean = 6.7581
throughput_mean = 16.1290
utilization = 0.8998
Little's Law: E[N] = lambda * E[T]
----------------------------------
num_jobs_in_system_mean = 7.6579
arrival_rate_mean * response_time_mean = 7.6621 (= 16.1498 * 0.4744)
Utilization Law, version 1: rho_i = lambda_i / mu_i
---------------------------------------------------
utilization = 0.8998
arrival_rate_mean = 16.1498
service_rate_mean = 17.9385
arrival_rate_mean / service_rate_mean = 0.9003 (= 16.1498 / 17.9385)
Utilization Law, version 2: rho_i = X_i * E[S]
----------------------------------------------
utilization = 0.8998
throughput_mean = 16.1290
service_time_mean = 0.0557
throughput_mean * service_time_mean = 0.8991 (= 16.1290 * 0.0557)Notice that wait_time_mean is much larger than before, which in turn increases num_jobs_in_system_mean and
num_jobs_in_queue_mean. Also notice that utilization has gone up significantly. That’s because there’s much less dead
time between jobs than there was in run #1.
At \(\rho \approx 0.9\), mean response time is ~7x higher than run #1 despite only a 2x reduction in \(\mu\). That’s the nonlinearity.
Run #3: Overloaded system (n=4000, 𝜆=16, 𝜇=15)
In an overloaded system we have 𝜆 > 𝜇. In this run, the difference in rates isn’t that large, but the difference in system dynamics is huge:

In the chart above, we can see that by the time we get to job ID 2000, the jobs are totally backed up. The jobs are arriving faster than the server can service them. Here’s how it looks when we plot the number of jobs over time:

In this case there are only 4,000 jobs total, so when the jobs stop arriving at t=4000/16=250, the job count eventually drops off as the server catches up.
But now look at how the backup negatively impacts the wait time and hence the response time:

Finally, here are the stats for run #3:
Simulation statistics
---------------------
total_duration = 269.3577
arrival_rate_mean = 16.0300
interarrival_time_mean = 0.0624
response_time
mean = 11.9850
var = 34.1118
p50 = 12.8022
p75 = 15.7642
p90 = 20.2362
p95 = 21.0827
p99 = 22.6113
wait_time_mean = 11.9178
service_rate_mean = 14.8768
service_time_mean = 0.0672
num_jobs_in_system_mean = 177.9786
num_jobs_in_queue_mean = 176.9804
throughput_mean = 14.8148
utilization = 0.9982
Little's Law: E[N] = lambda * E[T]
----------------------------------
num_jobs_in_system_mean = 177.9786
arrival_rate_mean * response_time_mean = 192.1188 (= 16.0300 * 11.9850)
Utilization Law, version 1: rho_i = lambda_i / mu_i
---------------------------------------------------
utilization = 0.9982
arrival_rate_mean = 16.0300
service_rate_mean = 14.8768
arrival_rate_mean / service_rate_mean = 1.0775 (= 16.0300 / 14.8768)
Utilization Law, version 2: rho_i = X_i * E[S]
----------------------------------------------
utilization = 0.9982
throughput_mean = 14.8148
service_time_mean = 0.0672
throughput_mean * service_time_mean = 0.9958 (= 14.8148 * 0.0672)This time, wait_time_mean and response_time_mean are extreme. Unsurprisingly, server utilization is pegged at nearly
100%.
The operational laws break down here since the system is overloaded. Little’s Law assumes steady-state — a condition an overloaded system never reaches — so the breakdown here is expected, not a simulation artifact.
Above \(\rho = 1.0\), the queue would grow without bound for an infinite-job M/M/1 with \(\lambda > \mu\), but we have finite jobs (\(n = 4000\)), so the queue peaks and then drains. Response time is no longer a property of the server — it’s a property of how long the system has been overloaded.
What this means in production
The three runs tell a clear story. A 2x reduction in service rate — from \(\mu = 32\) to \(\mu = 18\) — produced a 7x increase in mean response time and a proportionally worse tail. Moving from \(\rho \approx 0.5\) to \(\rho \approx 0.9\) didn’t just degrade performance; it changed the character of the latency distribution. That’s the utilization cliff in practice.
M/M/1 is an idealization. Real systems have non-exponential service times, multiple servers, and correlated arrivals. But the qualitative behavior holds: latency is a convex function of utilization, and the tail degrades faster than the mean. Engineers who set SLOs based on mean response time at current load, without headroom for utilization growth, are implicitly assuming a linearity that doesn’t exist.
The practical implication: treat utilization as a leading indicator, not a lagging one. By the time p99 is hurting, you’re already past the cliff. A system running at 70-80% utilization on a shared resource deserves scrutiny — not because it’s broken, but because the margin before nonlinear degradation is thin.