$ wwblog


Latency Regression Detection is a Change Point Detection Problem

For latency optimization efforts — where every 5–10 ms improvement matters — it’s important not to give those gains back to regressions. Detecting such regressions, however, is harder than it seems. Latency distributions are typically long-tailed and volatile, making it difficult to notice when something like an additional SQL round-trip quietly adds 3 ms to request latency.

A natural approach is to treat latency regression detection as an anomaly detection problem. In the simplest form, we define a threshold and look for excursions in the latency time series. If the series has seasonal components, we can deseason it first and apply thresholds to the residuals. Either way, the framing is the same: regressions are treated as unusually large observations.

Unfortunately, this framing is wrong. Latency regressions usually arrive as step changes, not spikes.

Latency series are naturally volatile, so thresholds tuned tightly enough to catch small regressions generate many false positives (poor precision). But thresholds loose enough to avoid false positives miss the small regressions that accumulate over time — the classic "death by a thousand cuts" problem (poor recall).

Change point detection provides a better framing. Instead of watching for rare events, we look for distributional changes. In practice this means identifying spans where the latency level shifts upward or downward, regardless of whether the shift is large or small.

Some shifts will correspond to real regressions, while others may reflect traffic changes or other external effects. That’s OK — we’re looking for good precision and recall, not perfect attribution.

Note

Variance shifts can also matter. In many systems variance correlates more strongly with perceived performance than mean latency. There are variance-based kernels for change point detection, but level shifts are the dominant regression signature in practice. For that reason, we focus here on level shifts.

A note on scope. Please observe that this post focuses specifically on latency regression detection. For other signal types — error rates, for example — anomaly detection can be highly effective. A sudden error rate spike is genuinely a rare event, and threshold-based detection is often the right tool. The argument here is about framing fit, not about anomaly detection being universally wrong.

Experimentation via simulation

To explore the behavior of these approaches we use simulations involving synthetic latency series.

Latency time series

We generate the latency series from a mixture model. The mixture structure produces the heavy tail and spike behavior that characterizes latency time series and makes anomaly detection on such series problematic.

The mixture model has three log-normal components:

ComponentExplanation

Fast path

Dominant happy path. 80% of requests. Tight variance — roughly ±5ms around a 60ms baseline.

Normal path

Slightly slower requests. Cache misses, heavier queries, minor contention. 15% of requests.

Slow path

Tail events — GC pauses, cold connections, lock contention. 5% of requests. Widest variance.

σ is specified in log-space, so it scales proportionally with the level rather than being a fixed ms value. All medians shift proportionally when an intervention changes BASE + offset.

The interventions

Each simulation offers various interventions to perturb the time series in interesting ways:

InterventionBehavior

Transient Spike

Adds +120ms to a single sample. Tests whether the detector fires on an isolated outlier.

Level Shift

Adds +40ms to the baseline while held, then snaps back. A persistent but reversible regression.

Micro Shift

Adds +12ms to the baseline while held. A subtle regression near the threshold of detectability.

Gradual Ramp

Adds +3ms per tick while held, capped at +300ms, then snaps to zero on release. Models a slow resource leak or gradual degradation.

Rebase

Permanently shifts the baseline to a user-selected target level until reset. Models a deployment that genuinely changes process performance — up or down.

Let’s start with the anomaly detection simulation.

Anomaly detection (constant threshold)

Experiment with the following constant threshold-based anomaly detector.

Algo parameters

There are two parameters: a constant threshold in milliseconds, and the number k of consecutive excursions required before the detector declares an anomaly.

Note
There are other approaches, such as m of the previous n, but we keep things simple here.

Experiment with the simulation above. Keep in mind that our goal is regression detection. Step changes — both large and small — are therefore interesting, while transient spikes are not. Gradual ramps matter too: sometimes they are traffic-driven, but sometimes they involve a progressive rollout.

Algo behavior

Here is the typical behavior when k = 1. You can experiment with other values of k.

InterventionBehavior

Transient Spike

🔴 Frequent false positives. (Spikes are not regressions.)

Level Shift

🟡 Algo fires intermittently. Regression boundaries not clearly identified.

Micro Shift

🔴 Often undetected, even when the shift is visually apparent.

Gradual Ramp

🟡 Detection delayed until ramp crosses the threshold.

Rebase

🟡 The threshold usually has to be reset, either manually or automatically.

Overall, anomaly detection based on constant thresholds is a poor match for latency regression detection. It is too sensitive to transient tail events, yet not sensitive enough to small persistent shifts. So now let’s turn to change point detection.

Note

There are, of course, other anomaly detector types, such as detectors with adaptive or dynamic thresholds. These can behave better in certain cases (for example, by accommodating "new normals" that occur when rebasing). But fundamentally, the problem with anomaly detection in a latency detection context is that it flags rare data points, when what we care about is persistent changes.

Change point detection

The following simulation allows you to experiment with change point detection. (Specifically, this uses the CUSUM online variant.)

Algo parameters

Here are the algorithm parameters:

ParameterDescriptionTuning

Allowance (k)
default = 0.12

Sensitivity to deviations from the segment mean. Each sample contributes (log(x) - μ ± k) to the CUSUM statistic. Higher k requires larger deviations to accumulate score.

Increase to reduce false positives. Decrease to detect smaller shifts.

Threshold (h)
default = 0.6

CUSUM score level at which an alarm is raised. Detection fires when the score exceeds h for confirm consecutive ticks.

Lower to detect smaller regressions at the cost of more false positives. Raise for larger expected shifts.

Confirm ticks
default = 5

Consecutive ticks the score must remain above threshold before a change point is declared. Guards against transient spikes.

Increase to suppress spike-driven false positives. Decrease for faster response.

Min segment
default = 15

Minimum samples a segment must contain before a change point can be declared. Prevents degenerate single-sample segments.

Increase if spurious short segments appear between closely spaced change points.

Algo behavior

InterventionBehavior

Transient Spike

🟢 Algo ignores transient spikes

Level Shift

🟢 Level shifts detected

Micro Shift

🟢 Micro shifts detected with appropriate param settings.

Gradual Ramp

🟢 Early detection

Rebase

🟢 Algo automatically adapts

The table above highlights the key ways in which change point detection is a good match for latency regression detection. It is not, however, a silver bullet. Online algo variants have to make point-in-time detection decisions, which means they can’t take advantage of the global view available to an offline algo. And it can be tricky to get the parameters right, regardless of variant. Seasonal effects introduce further complications.

We’ve now had the chance to experiment with two approaches to regression detection: one based on constant threshold anomalies, and another based on change points. The interactive simulations allowed us to conduct highly specific experiments. Let’s now zoom back out to understand what it all means.

Takeaways

The key takeaway is that latency regression detection is a change point detection problem, not an anomaly detection problem. To be clear, we can improve regression detection with supplemental techniques, such as signal cross-correlation and event correlation. Even anomaly detection is useful as a supplemental signal: it can provide early visibility of large pops/drops for high-value signals, where we might be prepared to tolerate a higher false positive rate. But fundamentally, latency regression detection involves distributional shifts rather than rare events.

It’s hard to overstate the operational value of getting the frame right. The right frame determines what you build, how you measure it, and whether engineers trust it. Anomaly detection is likely to lead to a large number of false positives, which trains engineers to ignore the alerts. Improving precision/recall allows you to establish a workable process for handling regression alerts, which will drive down the MTTR. Also, adopting change point detection requires less manual intervention than anomaly detection typically does.

Change point detection isn’t magic. It doesn’t always get the boundaries right. And dialing in precision/recall might require you to accept a higher time to detect (TTD), since a longer observation allows you to better distinguish persistent changes from transient events. Fortunately, in a regression detection context, it’s often the case that you have more time than you would have in a live site context, since builds make their way through test and staging environments before landing in production.

In a future post, I’ll show how you can use change points as events that you can correlate with other events, such as other change points, deployments, kill switches, and so forth. This allows you to build a causal story around your time series data. Until then, enjoy experimenting with the simulations.