$ wwblog


Incident Geometry: Why Service Topology Constrains Failure States

In A Research Lab for Automating Incident Triage, I introduced the idea that service topology constrains incident behavior. The idea is intuitive: faults propagate across edges, so topology shapes the space of possible failures. The implications, however, are less obvious.

This post develops a propagation-aware lens for representing and reasoning about incident dynamics. We’ll cover the concepts and mathematics required to make these dynamics intuitive and operationally useful.

Motivation: too much data, not enough meaning

All operators are familiar with the problem of too much data and not enough meaning. There are too many dashboards, dropdowns, charts, and tables, but a lot of it feels like noise. You can spend minutes or even hours trying to find actionable insights.

The standard response to this missing meaning is more dashboards and more data. While some of this is vendor-driven, it’s natural to think that adding more data will produce the missing meaning. Unfortunately, the additional data often exacerbates the issue.

An underappreciated source of the "too much data, not enough meaning" problem is redundancy. The redundancy is informational: the charts might be different, but the information is the same. For example, suppose that every time a certain scheduled job runs, the database slows down, and then web latency increases. From a data perspective, the corresponding charts highlight different time series behaviors. But they all mean the same thing: that scheduled job and its missing database index are slowing things down again. These correlated signals are in fact projections of a single propagation event.

For the operator, it would accelerate understanding and reduce cognitive load if the observability system led with the underlying propagation behavior, and then presented the charts as supporting artifacts. Instead, most real-world systems present lots of charts and leave interpretation to the operator. The result is noise, confusion, and misdiagnosis, ultimately slowing incident response and increasing demands on attention.

Here’s the fundamental mismatch: real-world observability systems present node-level representations, but not the propagation behavior most immediately useful for reasoning about the incident. Mathematically, these systems expose node-space coordinates, while operators need propagation-space coordinates.

From nodes to propagation

The following is an incident simulation. On top is a node-space view of errors over a simple service dependency graph for a hypothetical e-commerce checkout service. On the bottom are propagation-space views of the same incident. Press Play to watch the incident unfold or drag the slider for finer control. Don’t worry if it doesn’t make complete sense yet — we’ll unpack the ideas shortly.

The simulation offers a new way to see the geometry underlying failure propagation. On the node space pane, services with elevated error rates appear red. Blue indicates that relatively fewer errors are happening, and white is somewhere in between.

The bottom panes express the same incident in terms of propagation patterns that reveal how failure spreads through the system. Each pane shows the time-varying contribution of a single propagation mode. Together they expose the structured dynamics that combine to produce node-level behavior.

Importantly, the service graph structure determines the propagation modes. The incident itself plays no role in defining the modes; it merely excites them. The service topology defines the geometry of the incident state space, and the incident itself traces a trajectory within that geometry.

Let’s take a closer look at the operator behind these propagation patterns: the graph Laplacian.

The graph Laplacian: intuitions and behavioral patterns

This section approaches the graph Laplacian from two angles:

  • how it measures "disagreement" between neighboring nodes, and

  • how this disagreement drives dynamic fault propagation behavior.

We’ll start with a simulator to build some intuitions about the operator’s behavior. After that, we’ll highlight the behavioral patterns most relevant for incident diagnostics.

Building intuitions

Try holding down various nodes with centered colors enabled. You can wait for the node activations to decay naturally or press Reset if you’re impatient. Be sure to observe the behavior of the various propagation modes as you experiment.

Behavioral patterns for incident diagnostics

In the simulation, holding down a node causes it and its immediate neighbors to turn red, while more distant nodes turn blue. Neighboring nodes tend to take on similar values, reflecting agreement. Intuitively, when nodes in a service graph break, their neighbors tend to break too.

Note

In real systems, directionality matters: dependencies tend to break their dependents, but not the reverse. For now we make the simplifying assumption that edges are undirected. We’ll revisit this point toward the end of the post.

The Laplacian’s fundamental action is to remove disagreement along edges. Densely connected regions tend to remove disagreement more quickly. This action drives each of the following behaviors.

Neighbor coupling. The Laplacian captures neighbor coupling. When you inject energy (a localized disturbance) into a given node, it spreads to neighboring nodes. Over time, this local coupling produces diffusion-like propagation whose extent reflects the underlying topology.

Essentially, the energized node pulls neighbors upward while the neighbors pull the node downward. Diffusion is mutual, which explains propagation and decay.

Cluster coherence. Dense regions behave as coherent units under diffusion: energizing any node within the cluster quickly equalizes activation across the region, causing the cluster to move as a single structural component. This supports service grouping and blast-radius reasoning.

Tip
Hold down node C or node N to see cluster coherence in action.

Bottlenecks. A bottleneck is a weakly connected bridge between two dense regions. Diffusion inside each region is fast (see cluster coherence above), but diffusion across the bridge is slow.

Tip
Hold down node H to observe bottleneck behavior and observe modes 2 and 3. Also try holding a node on either side of H to see mode 2 activate. Notice that for mode 2, H separates positive values from negative values: H is the bottleneck.

The core mechanic is that dense regions enforce agreement through averaging, whereas sparse bridges restrict it. Bottlenecks preserve disagreement by creating structural resistance to propagation. This explains partial outages.

Smooth vs sharp disturbances (spatial gradients). When you energize a node, diffusion behavior depends on structural features of the graph. If you energize a node in a clique-like region, energy spreads across many redundant paths, causing neighboring nodes to take on similar values. This produces a spatially smooth, low-frequency disturbance. By contrast, energizing a sparsely connected node creates strong contrasts with its neighbors because few averaging paths exist. The resulting disturbance is spatially sharp and activates higher-frequency modes.

Note
Frequency here means spatial variation across neighbors, not temporal oscillation.
Tip
To see a high-frequency mode, scroll to mode 15 and then hold down node E.

Smoothness involves agreement across neighbors, while sharpness involves disagreement across neighbors.

The graph Laplacian: mathematics and spectral interpretation

The behavioral patterns in the previous section — neighbor coupling, cluster coherence, bottlenecks, smooth and sharp disturbances — all emerge from a single underlying mathematical operator: the graph Laplacian. To understand how, we need to examine what it measures, how it drives dynamics, and how its eigenbasis provides a natural coordinate system for incident states.

We’ll cover the mathematics in three parts: the static view (what the Laplacian measures), the dynamic view (how it drives diffusion), and the spectral view (how its eigenbasis decouples propagation into independent modes). With that machinery in place, we’ll revisit the behavioral patterns for a deeper interpretation.

The static view

The graph Laplacian is a linear operator acting on node-level signals. For a given vector of node values, it computes how much each node disagrees with its neighbors. Small outputs mean that the node’s value closely matches the values of its adjacent nodes, while large outputs indicate the opposite. Effectively, the Laplacian measures local curvature or roughness over the graph.

This static interpretation already explains much of what we observed in the simulator. Dense regions tend to suppress disagreement quickly because many edges enforce averaging. Sparse bridges, by contrast, allow disagreement to persist.

The formal definition makes this intuition concrete.

Definition. For an undirected graph with weighted edges, the (unnormalized) graph Laplacian is defined by

\[L = D - W\]

where \(D\) is the weighted degree matrix and \(W\) is the weight matrix. For unweighted edges, set the weights to 1.

In coordinates, this means

\[(Lx)_i = d_i x_i - \sum_{j \sim i} w_{ij} x_j\]

so each component reflects the degree-weighted disagreement between node \(i\) and its neighbors.

Note
There is also a normalized variant for degree-heterogeneous graphs, but we won’t cover that here.

The dynamic view

The Laplacian also has a dynamic interpretation. We evolve the system in the direction of decreasing disagreement, which spreads diffusion. The system flows toward states that reduce edge-wise contrast. We saw this behavior in the simulation.

As above, we can consolidate the intuition by considering the underlying mathematics. A simple diffusion dynamic takes the form

\[\frac{dx}{dt} = -Lx\]

or, in discrete time,

\[x_{t+1} = x_t - \alpha L x_t.\]

where \(\alpha\) is a global diffusion rate (or step size).

Because \(Lx\) measures disagreement along edges, the negative sign drives the system toward states that reduce edge-wise contrast. In other words, the dynamics smooth the signal over the graph.

Propagation modes and the eigenbasis

The Laplacian is a linear operator. That means there are special directions in state space that it does not mix — along those directions, it simply scales. These special directions are the Laplacian’s eigenvectors.

Each eigenvector represents a distinct pattern of agreement and disagreement over the graph. Those with lower eigenvalues vary slowly across the topology (smooth, low-frequency patterns). Those with higher eigenvalues flip sign rapidly across edges (sharp, high-frequency patterns). The graph topology induces these patterns. If you go back to the simulations above, you can see the eigenvector visualizations at the bottom.

When we express an incident state in this eigenbasis, the dynamics decouple. Instead of tracking a coupled system over nodes, we track independent activations of propagation modes. Under diffusion, each mode simply decays at a rate determined by its eigenvalue.

The eigenbasis provides a natural coordinate system for propagation dynamics: it is the coordinate system in which the operator becomes the simplest. The eigenbasis rotates node space into a basis aligned with the operator’s intrinsic propagation modes.

Note

By the spectral theorem, the Laplacian decomposes as

\[L = V \Lambda V^{\mathsf T}\]

where \(V\) contains the orthonormal eigenvectors and \(\Lambda\) the eigenvalues. In this basis, the operator becomes diagonal: \(V^{\mathsf T}\) rotates into propagation coordinates, \(\Lambda\) scales them independently, and \(V\) rotates back.

If we write \(z = V^{\mathsf T} x\), the diffusion dynamics reduce to

\[\frac{dz}{dt} = - \Lambda z,\]

so each mode decays independently at a rate determined by its eigenvalue.

Because each coordinate corresponds to a structurally meaningful propagation pattern, this basis also acts as a diagnostic filter: low-frequency modes capture coherent, system-level behavior, while high-frequency modes isolate localized disagreement and noise.

Revisiting propagation patterns

Neighbor coupling as primary operator action. Neighbor coupling is the primitive behavior upon which everything else builds. Mechanically, \(L = D - W\) (weighted disagreement with neighbors) means that \(-L\) removes weighted disagreement with neighbors, which drives coupling.

Seen through the eigenbasis, this coupling has a natural spectral interpretation: neighbor coupling is what determines which modes get excited. A localized disturbance at a single node has high spatial contrast with its neighbors, which means it projects onto high-frequency eigenvectors. The \(-L\) action then decays those high-frequency components fastest, which is how localized energy spreads — the sharp initial disturbance smooths out as high-frequency modes collapse first.

Dense regions as low-frequency subspaces. We saw earlier that a cluster in the graph tends to act as a coherent unit. This is because dense regions enforce agreement by suppressing high-frequency modes.

To see this mathematically, consider the Rayleigh quotient, given by

\[\lambda = \frac{v^{\mathsf T} L v}{v^{\mathsf T} v} = \frac{\sum_{i \sim j} w_{ij}(v_i - v_j)^2}{\sum_i v_i^2}\]

\(\lambda\) measures the "frequency" of an arbitrary vector \(v\), where the frequency is the weighted mean squared disagreement across edges. When \(v\) is an eigenvector, then \(\lambda\) is its eigenvalue.

A smooth eigenvector — one that varies slowly across the graph — produces small differences \((v_i - v_j)^2\) at most edges, and hence \(\lambda\) is small. A dense cluster forces many edges to contribute near-zero terms, so any vector that’s roughly constant within the cluster is automatically low-frequency.

The implication: the low-frequency eigenvectors are exactly the ones that respect cluster structure. The first few eigenvectors after the zero mode are approximately cluster indicator functions. Dense clusters suppress disagreement so fast that only cluster-level variation survives as slow dynamics.

Bottlenecks and the Fiedler mode. The eigenvector corresponding to the second-smallest eigenvalue, \(\lambda_2\), is called the Fiedler mode of the graph Laplacian matrix. It is the first non-trivial mode, and it has a specific geometric meaning: it’s the smoothest non-constant pattern the graph supports. The eigenvalue \(\lambda_2\) itself is called the algebraic connectivity of the graph.

For a graph with a bottleneck, the Fiedler mode takes positive values on one side of the bridge and negative values on the other. It’s the graph’s natural "fault line". A small \(\lambda_2\) means the graph is nearly disconnected; that is, a strong bottleneck exists.

Note

If you hold node H, you will see that the Fiedler mode doesn’t activate. H is the bottleneck not because H activates the Fiedler mode strongly, but because H sits at the bottleneck’s zero crossing. This tells us not only that a cut exists but where it is. If you hold either G or I, you will see that H sits at the boundary between two coherent regions.

Smooth vs. sharp disturbances as spectral support. We saw earlier that energizing a node in a dense cluster produces a smooth disturbance, while energizing a sparsely connected node produces a sharp one. The eigenbasis makes this precise.

A smooth disturbance (one that varies slowly across the graph) has small Rayleigh quotient, meaning its energy concentrates in low-frequency eigenvectors. A sharp disturbance (strong local contrast with neighbors) has large Rayleigh quotient, meaning its energy projects onto high-frequency eigenvectors. The spectral support of a disturbance is simply which eigenvectors carry its energy.

Under diffusion, high-frequency modes decay faster because their eigenvalues are larger — recall that each mode decays at rate \(\lambda_k\). A sharp disturbance therefore smooths out quickly: the high-frequency components collapse first, leaving only the slow, coherent modes behind. A smooth disturbance decays slowly because its energy was already concentrated in low-frequency modes to begin with.

This connects directly to classification. Two incidents with different spectral support have structurally different propagation signatures, even if their node-space representations look similar at a given instant. The modal energy distribution — which eigenvectors carry how much energy — is a topology-aligned fingerprint of the disturbance type.

Diagnostic implications. The spectral view of propagation behavior has direct diagnostic value. Four implications stand out.

Blast-radius detection. The Fiedler mode identifies structural boundaries in the service graph. When an incident activates modes that concentrate energy on one side of a bottleneck, the blast radius is structurally bounded — the incident is unlikely to cross the cut. This gives triage systems a topology-grounded answer to "how far will this spread?" without waiting for symptoms to materialize downstream.

Incident classification via spectral support. The modal energy distribution is a topology-aligned fingerprint of the incident type. Incidents that activate similar modes in similar proportions are structurally similar, even if their node-space representations differ. This enables classification based on propagation geometry rather than surface-level symptom patterns.

Noise suppression. High-frequency modes decay fastest and are most sensitive to localized, transient disturbances. Trimming high-frequency modal contributions from an incident state filters out noise while preserving the coherent, system-level behavior that actually requires a response.

Structural vs. local failures. Low-frequency activation indicates a coherent, system-wide disturbance — a structural failure propagating across the graph. High-frequency activation indicates a localized anomaly confined to a small neighborhood. The spectral support of an incident state distinguishes these two failure classes, which have different triage paths.

Finally, the topology constrains which incident states are physically realizable. Not all node-level patterns are equally plausible — the service graph rules out large classes of symptom combinations that would require energy to propagate against the grain of the topology. Representing incidents in propagation space makes this constraint explicit, filtering the hypothesis space that a triage system needs to consider.

Spectral compressibility hypothesis

The diagnostic implications above rest on an assumption worth making explicit: that real incident states actually concentrate their energy in a small number of propagation modes. If incident error states were spectrally diffuse — energy spread uniformly across all modes — the eigenbasis would still be a valid coordinate system. But it would offer no compression, no natural separation of signal from noise, and no low-dimensional fingerprint for classification.

The spectral compressibility hypothesis asserts the opposite: that topology-constrained failure propagation tends to produce incident states whose energy concentrates in a small number of low-frequency modes. Formally, if we project an incident error state \(x\) into the eigenbasis via \(z = V^{\mathsf T} x\), we expect most of the energy \(\|z\|^2\) to be carried by a small number of coordinates.

This would follow naturally from the structure we’ve already described. Failures propagate along edges, clusters act as coherent units, and bottlenecks bound the spread. These constraints mean that real incidents don’t excite the full state space uniformly — they follow the grain of the topology, which is precisely what the low-frequency modes capture.

If the hypothesis holds, the implications are concrete: incident states can be represented faithfully in a low-dimensional propagation space, noise can be suppressed by discarding high-frequency residuals, and classification can operate on compact spectral fingerprints rather than high-dimensional node vectors.

Modeling assumptions

The hypothesis fundamentally posits incident compressibility into a low-dimensional spectral space. For simplicity this post makes a number of modeling assumptions that are not themselves part of the hypothesis. Relaxing them is part of the research agenda. For now, they offer a starting point for the investigation.

Here are some initial assumptions:

AssumptionComments

Undirectedness:
Edges are undirected.

The symmetric Laplacian captures structural coupling but not causal direction, which can produce spurious backward propagation.

Linearity:
The diffusion dynamic \(dx/dt = -Lx\) is linear.

Real failure propagation is likely nonlinear: e.g., thresholds, circuit breakers, load shedding. Nonlinear operators don’t have the same clean eigenbasis decomposition.

Time-invariant topology:
Graph topology is fixed over time.

Many real systems have evolving topology: e.g., auto-scaling, circuit breakers, service mesh changes. The eigenbasis shifts with the topology.

Homogeneous edges:
All edges are structurally equivalent.

Real edges have different latency, bandwidth, error rate and criticality characteristics that impact propagation differently.

Node signal model:
There is a single scalar per node.

Real services have multivariate signals — latency, throughput, error rate, saturation — that propagate differently along the same edge.

Stationarity:
Incidents are drawn from a distribution that’s stable enough to cluster.

System architectural changes can break stationarity.

These considerations motivate the first experiment: measuring the spectral compressibility of incident error states across a range of simulated failure scenarios, and asking how much energy is retained as we restrict to the top \(k\) modes.

Why this matters operationally

The machinery developed in this post — graph Laplacian diffusion, eigenbasis decomposition, spectral support — might seem abstract relative to the day-to-day reality of incident response. The operational payoff is concrete.

The core problem, stated at the outset, is that real observability systems present node-level representations while operators need propagation-level understanding. Every implication in this post is a consequence of closing that gap.

Noise reduction. Most observability noise is high-frequency: transient, localized disturbances that don’t reflect coherent system behavior. Projecting incident states into propagation space and trimming high-frequency modal contributions filters this noise structurally, without hand-tuned thresholds or signal-specific heuristics.

Incident classification. Node-space representations make classification hard: two structurally identical incidents can look different at the node level depending on which services happen to be involved. Propagation-space representations normalize for this — incidents with similar spectral fingerprints are structurally similar regardless of their node-space appearance.

Blast-radius prediction. The Fiedler mode and bottleneck structure give triage systems a topology-grounded answer to blast-radius questions before downstream symptoms materialize. This shifts incident response from reactive (waiting for failures to propagate) to anticipatory (predicting where they will propagate).

Topology-aware anomaly detection. Standard anomaly detection operates on individual time series. Propagation-space anomaly detection operates on graph signals, distinguishing structural disturbances from localized noise and identifying which propagation modes are anomalously activated.

Cognitive load reduction. Fewer, more meaningful signals. The propagation-space representation surfaces what’s really happening — the underlying structural dynamics — and presents node-level charts as supporting artifacts rather than the primary view. This is the inversion the motivation section called for.

Taken together, these capabilities point toward a topology-aligned representation layer sitting between raw telemetry and the operator. Not a replacement for existing observability tooling, but a lens that organizes its output around propagation geometry rather than node-level coordinates.

Representing incident states in propagation space provides a topology-aligned coordinate system for graph signals. This enables more reliable classification of incident types, improved prediction of failure propagation and recovery dynamics, and topology-aware anomaly detection that distinguishes structural disturbances from localized noise. The value arises not merely from dimensionality reduction, but from aligning the representation with the underlying geometry imposed by service dependencies.

This motivates the first experiment: measuring spectral compressibility of incident error states.