AI/ML observability that cut noise and revealed true root causes

Case Study:
Telco Service Assurance

Designing confidence through root cause clarity
Designing confidence through root cause clarity
Overview

Telco Cloud Service Assurance was positioned as an operational control system for multi-domain telecom networks that had crossed a complexity threshold: physical infrastructure, virtualized network functions, containerized workloads, and service layers were now interdependent, yet owned and operated by different teams with different tooling, incentives, and accountability.

The product’s promise was not incremental improvement to alerting. It was a structural shift in how outages were understood and acted upon: from symptom-driven response to causal investigation that could be trusted under production pressure.

From a design standpoint, the core problem was not whether correlation algorithms existed. It was whether operators could see and believe the system’s reasoning fast enough to change behavior. In environments where mean time to restore service is a management metric and customer impact escalates quickly, opaque automation is functionally equivalent to noise.

This work reframed service assurance as an investigation system, not an alerting surface. The experience was designed to make causal structure legible across domains so operators could build shared understanding, align on the true failure point, and act with warranted confidence — even when upstream visibility was incomplete.

The result was not simply a better interface. It was a shift in how the organization treated automation: from a black box that operators learned to work around, to a reasoning partner that could be challenged, validated, and trusted.

Problem

Telecommunications operators were operating networks that had crossed a practical complexity boundary. Physical transport, virtualized network functions, containerized platforms, and service layers were no longer separable in day-to-day operations. Failures routinely propagated across domains that were monitored by different tools, managed by different teams, and governed by different escalation paths.

Traditional assurance experiences flattened this complexity into high volumes of disconnected alerts. While correlation logic often existed somewhere in the stack, operators primarily experienced the system as a stream of symptoms. The result was not just inefficiency. It created a structural failure in how incidents were understood: teams responded to what was loud, not what was causal.

Under production pressure, operators were forced to manually assemble mental models of failure from fragmented signals. This pushed diagnosis into tribal knowledge and individual expertise rather than shared, auditable reasoning. Two operators could see the same incident and arrive at different conclusions, depending on which tools they trusted and which symptoms they prioritized.

This fragmentation had direct operational consequences. It increased the likelihood of misdirected remediation, extended outages when the true underlying cause was not addressed, and made it difficult for teams to learn systematically from incidents. The system optimized for surfacing alarms, not for building a coherent understanding of failure.

The problem, therefore, was not alert volume alone. It was cognitive load under time pressure. Operators were being asked to perform causal reasoning in environments where the tooling actively obscured causal structure.

From an organizational perspective, this created a compounding risk: as networks became more software-defined and multi-vendor, the cost of poor correlation experiences increased. When outages spanned organizational and vendor boundaries, the absence of a shared causal narrative turned technical incidents into coordination failures.

COMPANY

VMware

2023

ROLE

Staff UX Designer — Led UX strategy, innovation design, prototyping, and system design for AI/ML observability in 5G & SD-WAN

EXPERTISE

UX strategy, information architecture, interaction design, prototyping, design systems, AI/ML observability

Hands on telco service assurance persona

Role

I led experience design for Telco Cloud Service Assurance across the investigation lifecycle: real-time monitoring, cross-domain correlation, anomaly detection, and root cause analysis. My accountability was not limited to interface design. It included shaping how automated intelligence, topology context, and service impact data were operationalized into workflows that operators could trust under escalation pressure.

My role sat at the boundary between platform capabilities and operational reality. Engineering teams were building correlation and detection systems that could operate across physical infrastructure, virtual network functions, and containerized platforms simultaneously. Operations teams were accountable for restoring service quickly and explaining outages to leadership and customers. The design work had to reconcile these pressures: translating system intelligence into decision support that fit how incidents were actually handled.

This meant defining investigation flows that aligned with escalation paths, interaction models that reflected how operators reason about topology and services, and experience principles governing when the system should assert conclusions versus when it should expose uncertainty.

A key part of the role was negotiating how much authority the system was allowed to claim. Product and platform stakeholders were incentivized to present stronger automation outcomes. Operations stakeholders were accountable for the consequences of being wrong. The experience design became the mechanism for resolving that tension: encoding rules for confidence, explanation, and operator override into the product itself.

In practice, this required working across product management, platform engineering, and operations-facing teams to ensure that the experience reflected real incident dynamics — not just idealized automation flows.

Recommended solutions for console notifications

A-ha Moment — Noise vs. Root Cause

A critical inflection point emerged during operational reviews and early deployments: the same underlying correlation capabilities produced radically different operator behavior depending on how results were surfaced.

When alerts appeared as flat, unstructured signals, operators treated the system as another noise source. Even when correlation logic was technically correct, the experience did not make the reasoning visible. Operators could not tell why a particular alert was being promoted, which forced them to fall back on familiar tools and personal heuristics.

In these scenarios, automation did not reduce cognitive load. It added another layer of arbitration. Operators spent time validating the system rather than using it, effectively duplicating work under time pressure.

In contrast, when the experience surfaced causal structure — showing how symptoms were related across domains and how they converged on a likely failure point — operator behavior changed materially. The same alerts were no longer treated as noise. They became time-saving orientation tools.

The difference was not algorithmic accuracy. It was legibility. Operators needed to see enough of the system’s reasoning to trust that it aligned with their mental model of the network. Confidence without explanation was experienced as noise. Explanation created shared understanding.

This reframed the design problem. The question was no longer “How do we generate better alerts?” It became “How do we make causal structure visible fast enough to change how incidents are worked?”

That reframing shifted investment away from alert optimization and toward investigation experience: how chains were visualized, how topology and service impact were integrated, and how operators could validate or challenge the system’s conclusions in real time.

Filtering user interface pattern that encouraged browsing

Core Tension

The platform’s technical strength was its ability to correlate fault and performance data across layers — from physical infrastructure and transport, through virtual infrastructure and container platforms, to network functions and services. In theory, this enabled a unified view of failure across domains.

In practice, that power created a new class of risk.

Operations teams were accountable for restoring service and meeting service level commitments. Product and platform teams were incentivized to demonstrate automation maturity and efficiency. These incentives converged on a pressure to present stronger, more definitive system conclusions — to appear faster, more automated, and more decisive.

At the same time, upstream visibility was often partial. Multi-vendor environments, incomplete integrations, and delayed telemetry meant that correlation could be directionally correct without being fully certain. In those conditions, over-assertive conclusions were operationally dangerous.

The tension was not abstract. It was measurable:

     ·  Faster, more assertive conclusions could reduce apparent time to action.

     ·  Incorrect conclusions increased the likelihood of misdirected remediation, secondary escalations, and extended outages when the real cause was not addressed.

     ·  Slower, more transparent reasoning improved trust and correctness, but risked higher perceived time to respond in executive and customer-facing metrics.

This created a structural design dilemma: how to encode confidence in a way that balanced speed, correctness, and organizational accountability.

The experience became the enforcement layer for this trade-off. Decisions about what the system asserted, what it suggested, and what it explicitly marked as uncertain were not cosmetic. They determined whether automation reduced operational risk — or amplified it.

Designing for this tension required treating confidence as a first-class interaction concept, not a byproduct of correlation accuracy.

Internal mobility new hire journey mapping session with the usability engineering team

Approach

The experience design treated automated detection as an input to investigation, not as an authoritative verdict. This was not a philosophical stance. It was a structural response to how incidents actually unfold in multi-domain, multi-vendor networks.

Three system-level rules governed the approach.

Correlation before action.
Alerts were not allowed to stand alone. The system grouped events into causal structures anchored in topology and service impact before they were promoted as actionable. This reduced raw noise, but more importantly, it established a shared object of reasoning: a chain that operators could inspect, discuss, and validate.

Root cause as orientation, not proclamation.
Likely underlying causes were surfaced to focus attention, but they were always presented in the context of related symptoms and contributing signals. The system showed its work. Operators could see how the system arrived at a conclusion and decide whether it matched their understanding of the network.

Progressive commitment.
The interface supported different depths of engagement depending on risk. In low-impact scenarios, operators could act quickly on high-confidence patterns. In high-impact or ambiguous scenarios, the system made it easy to slow down, inspect evidence, and challenge assumptions before committing to remediation.

These rules encoded a hierarchy of confidence into the product. They created explicit gates between detection, orientation, and action — reflecting how experienced operators already behaved, but making that structure visible and shared.

This approach aligned with platform realities. End-to-end assurance depended on assembling signals from multiple collectors and sources, often with gaps. By designing investigation as a staged process, the system could be useful even when certainty was unavailable.

The result was not simply better UX. It was a redefinition of what “automation” meant in practice: not replacing operator judgment, but scaffolding it so that reasoning could be faster, more consistent, and more defensible.

Sample from storyboard with developmental language lowered resistance compared to role-change framing.

When Visibility Was Incomplete

Partial visibility was not an edge case. It was a normal operating condition.

In multi-vendor environments, telemetry coverage lagged behind operational reality. Integrations were uneven, some domains emitted delayed or coarse-grained signals, and certain failure modes could be detected without enough upstream context to determine a single, definitive cause.

Early scenarios made this concrete. The system could reliably identify abnormal behavior and surface correlated symptoms, but it could not always assert a root cause with high confidence. In those situations, a naïve automation posture would have been to still present a single “best” answer.

That path was explicitly rejected.

If the system presented a definitive diagnosis under partial visibility, operators would act on it — and in several observed cases, that would have sent remediation in the wrong direction. The failure would not have looked like a tooling issue. It would have looked like operator error, even though the system had over-claimed certainty.

Instead, the experience was designed to surface contributing signals and their relationships, making uncertainty visible and actionable. Operators could see what the system knew, what it did not, and how confident it was in each inference.

This preserved the system’s usefulness without training operators to blindly trust it. It also changed escalation dynamics. When incidents crossed teams or vendors, operators could bring a structured causal narrative to the conversation, even if a final root cause was still under investigation.

The design decision reframed uncertainty from a weakness into a shared artifact. Rather than hiding gaps, the system made them explicit — enabling better cross-team coordination and reducing the risk of confidently wrong actions.

Outcome

What’s New in Oracle HCM Cloud – 2020 Spring

The experience changes altered how incidents were worked, not just how alerts were displayed.

With causal structure visible and investigation flows standardized, operators were able to orient faster and converge on a shared understanding of failure. Instead of parallel, tool-specific interpretations, teams increasingly worked from the same causal narrative.

This reduced the operational overhead of incident arbitration. Fewer cycles were spent reconciling conflicting interpretations across tools and teams. The product became a coordination surface, not just a monitoring surface.

More importantly, it shifted where mistakes occurred. When errors happened, they were more often errors of judgment with visible evidence, rather than errors caused by hidden system assumptions. This made incidents easier to review, explain, and improve.

The platform’s role evolved accordingly. Telco Cloud Service Assurance was no longer experienced primarily as an alert source. It became a system of record for investigation: a place where reasoning was constructed, validated, and shared.

This repositioning mattered organizationally. As networks modernized and cross-domain dependencies increased, the cost of opaque tooling grew. The experience direction aligned the product with how operators and managers evaluated operational maturity: not by alert volume, but by how quickly and coherently teams could explain what was happening and why.

The product's role shifted. Telco Cloud Service Assurance stopped being experienced primarily as an alert source and became a surface where incident reasoning was constructed, shared, and referred back to. In environments where outages crossed team and vendor boundaries, that shift had direct coordination value.

Second Stratum — Risk Prevented

One failure mode posed a systemic adoption risk: presenting authoritative root cause conclusions when upstream visibility was incomplete.

There was sustained internal pressure to surface stronger, more definitive diagnoses. From a product perspective, this supported narratives of automation maturity and operational efficiency. From an operations perspective, it created a different risk: confidently executing the wrong remediation in environments where visibility gaps were known and unavoidable.

This was not hypothetical. In early scenarios, partial telemetry would have produced plausible but incorrect root cause assertions. Had those been presented as authoritative, operators would have acted on them. The system would have appeared fast — and wrong.

I enforced a concrete rule: the system could not present a single authoritative root cause when required upstream evidence was missing. In those cases, it had to surface competing contributing signals and explicitly mark confidence limits.

This rule blocked a class of product behaviors that would have optimized for perceived speed at the cost of long-term trust. It also changed internal decision-making. Product discussions shifted from “How do we make this look more automated?” to “What evidence threshold is required to claim certainty?”

The prevented failure mode was not only incorrect remediation. Operators who experience a system as confidently wrong do not partially discount it — they route around it entirely. A rule blocking over-assertion under partial visibility protected adoption as much as it protected accuracy.

This decision traded short-term metrics optics for durable adoption. It preserved the system’s role as a reasoning partner rather than allowing it to become a brittle authority that operators learned to ignore.

RODERICK SLOAN

career@sloan.design

Email copied!

+1 650 863 4382

Mobile copied!

My current time

Cupertino (GMT)

1209 Sanchez Avenue Burlingame, California

Copyright © 2026 R Sloan

RODERICK SLOAN

career@sloan.design

Email copied!

+1 650 863 4382

Mobile copied!

1209 Sanchez Avenue Burlingame, California

Copyright © 2026 R Sloan

RODERICK SLOAN

career@sloan.design

Email copied!

+1 650 863 4382

Mobile copied!

1209 Sanchez Ave

Burlingame, California

Copyright © 2026 R Sloan