$ wwblog


A research lab for automating incident triage

For some time now, I’ve been wanting to set up a research lab for automating incident triage. Well, I finally did it. (I mean that I finally set up the lab, not that I finally automated incident triage.)

In this post, I’ll cover the following:

  • incident triage and why we might want to automate it,

  • why we need a research lab and what its capabilities should be, and

  • just a little bit about how I implemented it.

Let’s dive in.

Incident triage, and why we want to automate it

When an incident happens, we want to resolve it quickly. The typical workflow goes roughly like this:

New incident → Triage → Investigate → Mitigate → Resolve → Post-mortem

Each of the steps is important, and triage is no exception. Triage is where uncertainty becomes directed investigation.

Here are some things we care about during triage:

  • From a user or customer perspective, what are the scope and impact of the incident?

  • How severe is the incident, and how much priority should we attach to it?

  • Which part of the technical system appears to be the right starting point for the investigation, and who owns that part of the system?

It’s important to answer these questions quickly. We might get it wrong, but rapid forward motion is what really matters. Basically, we want to route the incident to the right team, attaching any information that might help to accelerate the investigation and get us to a mitigation.

With that background in place, one reason for automating triage is immediately clear:

  • It should speed things up.

But there are two more reasons:

  • It should allow us to be more systematic in the approach, decreasing variance and improving quality.

  • It should reduce cognitive load for the on-call engineers.

OK, that’s the case for automating triage. Of course, it’s an open question whether we can actually do it. That’s where having a lab comes in handy.

The research lab: motivation and capabilities

I have an idea I want to explore. I don’t know whether it will work, which is precisely why I frame this as a research program. Without getting too deep in the weeds, I want to model incidents as trajectories through a state space induced by graph representations of the system. Concretely, this means that a failure in one service manifests as a localized perturbation that propagates along dependency edges and appears in traces as a structured sequence of spans. The intuition is that we can understand the unfolding incident at a high level by pattern matching its initial trajectory in an observable state space.

Note

In future posts, I’ll go into more of the modeling specifics. But for our current purpose, it’s enough to understand that I think that graph representations will be fruitful, and that’s what’s driving the initial lab design.

For readers interested in the mathematical direction, I plan to explore graph Laplacians and their eigenbases as a representation for incident dynamics.

So why do we need a lab?

Why we need a lab

One key reason for having a lab is to be able to run repeatable, controlled experiments in a context where the ground truth is known. For example, if I want to try out different approaches to modeling node or edge health, and evaluate those approaches based on how close they get me to ground truth, that’s much easier to do in a lab environment.

In a lab simulation, I can introduce controlled faults with known causes, observe their propagation, and evaluate diagnostic approaches against ground truth.

A second important reason for a lab is that production environments are epistemically hostile. What I mean is that production environments make it hard to know what’s going on:

  • Production environments are noisy.

  • The infrastructure and components are heterogeneous.

  • We rarely have access to ground truth.

  • Failures are costly and often not repeatable.

  • Causality is ambiguous.

  • Interventions perturb the system in a way that’s harder to predict and interpret (e.g., restarting a service may clear caches, rebalance queues, or trigger autoscaling)

  • Observability signals are incomplete and often lossy. For example, sampling, missing spans, and aggregation can obscure a failure’s true propagation path.

There is, of course, a tradeoff in relying upon simulations: we lose fidelity. That’s OK though. We use the lab as an instrument for exploring incident dynamics under controlled conditions, and then try it out in production after it looks like it’s working. The lab is therefore not intended to replicate production perfectly, but to serve as an epistemic instrument for studying incident dynamics.

Now let’s consider the capabilities we’d like to realize.

Lab capabilities

Here are some general desiderata for the research lab:

  • a system simulation with a diverse set of components and behaviors

  • generate incidents in a controlled way

  • synthetic and controllable load

  • experiments should be isolatable and repeatable

  • high-fidelity measurements

  • ground truth on interventions and their effects (e.g., health)

And here are some additional needs, based upon my specific, graph-based approach to modeling:

  • visible and near real-time topology, available for querying

  • support dynamic topology (i.e., nodes and edges blinking in and out of existence, like in the cloud—e.g., autoscaling events, rolling deployments, and ephemeral job workers)

  • infer topology from trace data, as opposed to requiring manual efforts

Together, these capabilities allow the lab to function as a controllable generator of incident dynamics rather than merely a demonstration environment.

The good news is that I was able to put something together that meets all these needs. Let’s take a look.

High-level implementation notes

In the intro to this post, I noted that I "finally" set up a lab. What was standing in the way is that it’s a lot of work to build all of that.

Fortunately, it turns out that I didn’t have to: the OpenTelemetry team put together an OpenTelemetry Demo that serves my purposes wonderfully. Specifically, it provides the following capabilities:

  • incident generation, via flagd feature flags (e.g., toggle partial payment failure, or toggle a shopping cart failure)

  • synthetic load, via the Locust load generator

  • everything repeatable and controllable

  • we have measurements and ground truth

The demo includes Jaeger, and my initial assumption was that it would serve my needs. Jaeger provides a visualization for service dependency graphs. Moreover, Jaeger computes the dynamic topology based on traces.

But despite those features, the Jaeger graph turned out to be too limited. First, it’s a snapshot, and I want to be able to watch system evolution. Second, Jaeger computes the graph client-side without exposing a queryable representation.

To that end, I implemented my own service dependency graph, which I call DepViz. At this point it’s more like a simple proof of concept than a robust service dependency graph, but it provides the key capability I need: a queryable, near-real-time topology representation.

Figure 1 shows the current DepViz visualization.

DepViz service dependency graph
Figure 1. DepViz service dependency graph

You can peruse the repo yourself, but here are some key aspects of the implementation:

  • It runs alongside the OpenTelemetry demo in Kubernetes.

  • The OTel collector exports trace data to the DepViz receiver, which builds the service dependency graph and makes it available for querying.

  • It updates in near-real time so we can observe how toggling feature flags results in health changes.

Figure 2 shows the high-level lab architecture, with implementation choices.

Experimental loop architecture of the research lab
Figure 2. Experimental loop architecture of the research lab

Looking ahead

The lab I’ve described here is intentionally modest in scope. Its purpose is not perfect production fidelity, but rather to provide a setting for running controlled experiments as regards incident dynamics. We need to be able to generate incidents, observe the impacts, and reason about the experiment in light of known ground truth.

In future posts, I plan to explore how different representations of system structure and behavior influence our ability to detect, classify, and route incidents. In particular, I am interested in modeling incidents as trajectories through graph-induced state spaces and evaluating whether these trajectories exhibit patterns that can inform automated triage.

For now, the important result is simply that the experimental substrate exists. We have a system that can generate incidents, expose their propagation through observable topology, and provide a reference point for evaluating diagnostic approaches against ground truth. That is enough to begin asking more interesting questions. For example, can early trace trajectories distinguish between localized failures (e.g., a single payment service outage) and systemic degradation (e.g., cascading retries across services) before human triage begins?