London

June 2–3, 2026

New York

September 15–16, 2026

Berlin

November 9–10, 2026

Observability tools weren’t built for AI debugging

AI debugging has a data problem.
May 12, 2026

Estimated reading time: 10 minutes

Key takeaways:

  • AI debugging has a data problem, not a model problem. Smarter models won’t fix bad observability. Garbage in, garbage out.
  • You’re drowning in telemetry but starving for signal.
  • Collect the right data upfront, not the wrong data at scale. Session-based, correlated collection gives AI agents the complete context they need.

Most engineering teams are using AI to write code faster than ever. However, now they are also shipping bugs with equal speed.

Here’s what that workflow actually looks like end to end:

[Auto-instrument everything via OTel]
→ [Collector samples/filters some]
  → [Store remaining data]
  → [Developer notices bug]
    → [Manually copy-paste error OR query via MCP server]
    → [AI gets incomplete/noisy context]
      → [AI suggests fix based on partial data]
      → [Human reviews]
        → [Code “looks plausible”]
        → [Deploy]
          → [Discover edge case not addressed by AI]
                        → [More bugs in production]

There are several places this workflow breaks down. Ultimately, the root cause is a data and context problem. Until we fix it, no amount of AI capability is going to close the gap.

If I, as a developer, don’t have the right data from my observability tools out of the box, there’s no magic solution that will come from passing it to an AI agent. The problem is what “the right data” actually looks like and how far current tools are from providing it.

Current observability tools are pretty good at gathering:

  • System-level metrics: uptime, latency, error rates.
  • Service-to-service traces: which services called which, and how long each hop took.

However, they consistently leave gaps where it matters most for debugging complex issues:

  • Sampled traces and logs: aggressively thinned out to control storage costs.
  • Session-based telemetry: no correlation of user journeys with backend behavior.
  • Request/response payloads: missing for internal services and external dependencies.

As this information isn’t provided by default, AI tools end up working with a partial view of reality. They can still be genuinely useful, pattern-matching against common issues and getting it right for well-scoped, well-understood problems.

AI debugging falls apart the moment you hit a complex, distributed systems issue. The kind where you need full runtime context to understand what actually went wrong. In those cases, engineers still need to fire up the local dev environment, pull in all the tooling, and debug manually.

Current observability tools are fundamentally broken for AI debugging and the gap is about to get much worse.

According to Harness’s State of Software Delivery 2025 report, 67% of developers now spend more time debugging AI-generated code than they did before they started using AI tools. Stack Overflow’s 2025 Developer Survey tells the same story from a different angle: 66% of developers say they’re spending more time fixing “almost-right” AI-generated code, and 45% list it as their top frustration with AI tools overall.

Think about what happens as more teams adopt AI-coding tools or implement agentic AI to build their applications. The volume of code goes up. The speed of shipping goes up. Yet, the observability tooling underneath hasn’t changed. The result? More bugs, harder to diagnose, and a (unnecessary) proliferation of observability tools.

Here’s what fixing it actually looks like.

Where the process breaks down

Too much irrelevant data

Companies’ approach to telemetry collection is preventative. They collect EVERYTHING because they don’t know what they’ll need when an issue happens. Instrumenting on demand (after an issue exposes a gap) is expensive and time-consuming, so the path of least resistance is to cast the widest net up front.

The problem is that casting a wide net and actually catching the right fish are two very different things. The result is companies hoarding massive volumes of generic telemetry and trying to wrangle the storage costs that come with it.

Observability is eating our infrastructure budget” is a complaint I hear regularly on customer calls and it’s not an anomaly – it’s the norm.

OpenTelemetry has made this worse, not better. Its democratized instrumentation is a genuine win for standardization, but it accidentally created a tragedy of the commons. When it’s trivially easy for every team to auto-instrument everything, everyone does. Nobody stops to ask whether the data is actually useful.

You end up with a system that’s drowning in telemetry but starving for signal. Michele Mancioppi, head of product at Dash0, put it bluntly during a recent LeadDev panel on agentic observability: “At KubeCon, I shared that, of the logs getting into observability tools, less than 1% are fatal errors. All the rest is probably a waste of money.”

This is where AI debugging hits its first wall. You can’t just point an AI agent at your observability data and expect it to work. There are two reasons why.

First, there’s cost. AI models charge per token processed, and feeding them a firehose of telemetry is expensive. Second, there’s a hard technical limit. AI models have a finite context window. You have to choose what you feed them. If you feed them raw OTel traces and logs, the agent either gets lost in the noise, or it misses the 1% that actually matters because that slice was sampled away or was never collected in the first place.

Missing critical debugging data

Even if your AI agent isn’t drowning in irrelevant data, it still doesn’t have the full picture. The context it needs is either scattered across disconnected tools or was never collected. For example:

  • Request/response payloads aren’t captured by default due to privacy and cost concerns. You can see that an Application Programming Interface (API) call was made, but not what data was sent or what came back.
  • Headers (authentication tokens, routing info, custom metadata) are often redacted for security reasons, removing critical debugging context.
  • External API exchanges are black boxes. Distributed tracing stops at your system boundary. When your backend calls Stripe, Twilio, or AWS, you see the call happened and how long it took, but not what you sent them or what they returned.
  • Session correlation is fragmented at best. Frontend monitoring (RUM) lives in one tool, backend observability in another. Connecting “this user clicked A → then B → then backend failed” requires manual stitching across platforms.

Current observability tools can show you that something happened and where it happened, but they struggle to automatically connect the dots across your entire stack. Getting full correlation requires extensive instrumentation and gets prohibitively expensive at scale. Especially when capturing the actual data (payloads, headers, full context) flowing through the system.

Let’s assume for a moment that all the data exists and nothing is missing. It’s just siloed across different tools.

In theory, an AI agent could query multiple observability platforms (frontend errors in Sentry, backend traces in Datadog, external API logs in Stripe, session replays in LogRocket) and correlate them automatically. In practice, this almost never happens (at least, not yet).

Each tool has different APIs, authentication methods, and data formats. More critically, the correlation keys that tie events together – request IDs, trace IDs, session IDs – rarely propagate across tool boundaries. The agent would need to guess which frontend error corresponds to which backend trace based on timestamps, an error-prone approach that breaks down under any real load or clock skew.

So even with perfect API access to every tool, the agent still can’t see the full picture unless the data was already correlated when it was collected.

This leaves developers doing the correlation work manually: copying request IDs between tools, matching timestamps across dashboards, piecing together the story, and only then pasting the relevant fragments into an AI tool for analysis.

Human code reviews can’t keep up with AI code generation

As every engineer knows, reading code is harder than writing it. You have to reverse-engineer someone else’s thought process, understand why they made certain decisions, and spot edge cases they might have missed. Even reviewing your own code weeks later isn’t much easier: our memory is fallible, and documentation is rarely as complete as it should be.

Now add an AI assistant that writes code faster than any human can. AI can generate 100 lines of plausible-looking code in 30 seconds. A thorough review of those same 100 lines could take 15-20 minutes. The math doesn’t work. Teams either bottleneck on review or let things slip through.

And things do slip through. AI-generated code often works for the happy path the model imagined, but it doesn’t think defensively the way experienced developers do. It pattern-matches based on what it’s seen before rather than reasoning about edge cases, error states, or unexpected inputs.

The result? Even if the bug rate per line of code stays constant, you’re shipping more code faster, which means more bugs in absolute terms. As AI-generated code compiles cleanly and looks structurally correct, bugs are harder to catch in review. The code looks fine. It just fails in production under conditions the AI never considered.

This is where the “AI makes you 10x more productive” narrative breaks down. “It compiles on the first try” doesn’t mean the code is correct. It means you’ve deferred the debugging work to later when the bug hits production.

The observability gap compounds this problem. Without complete runtime context (the actual payloads, the external API responses, the session state) you’re reverse-engineering logic you didn’t write, based on incomplete information, under time pressure.

What needs to change?

The observability industry is starting to recognize these problems, but the solutions being proposed address different parts of the puzzle.

Approach 1: smart filtering at the collector layer

Some vendors are advocating for intelligent collection pipelines:

[App]
→ [Send everything]
  → [Smart AI collector]
  → [Decides what to keep/drop/escalate]
    → [Storage]
      → [AI gets clean data]

The idea is to use AI at the collector layer to filter out noise before it hits storage. Keep fatal errors, drop health checks, and intelligently sample based on what looks important.

This solves the “too much irrelevant data” problem. Storage costs go down because you’re not keeping everything. AI agents get cleaner data because the noise has been filtered out.

However, it doesn’t solve the “missing critical debugging data” problem, nor does it address ingestion costs. Also, if the collector drops a request because it looked unimportant, and that request turns out to be the one that caused a production incident, the data is gone.

Approach 2: AI agents embedded in observability platforms

Other vendors are building AI directly into their platforms: agents that live inside the tool and have native access to all the telemetry data collected there.

[App]
→ [Collect everything]
  → [Observability Platform Storage]
    → [Built-in AI Agent queries platform data]
      → [AI answers questions using available data]

This partially solves the data correlation problem: if a vendor collects session replays, metrics, logs, and traces in one place, their AI agent can correlate across all of it without needing to query multiple APIs.

However, we’re back to the “too much irrelevant data” problem, Missing data is still missing, not to mention that now we also have vendor lock-in to contend with.

Approach 3: session-based, on-demand collection

There’s a third approach that flips the model: instead of collecting everything and deciding what to keep later, only collect what you need, when you need it, for the specific context that matters.

The model looks like this:

[User reports bug]
→ [Trigger recording for user-specific session]
  → [Capture everything: frontend + backend + request/response payloads]
  → [Auto-correlate by session]
    → [AI gets complete context]

This solves for the ingestion and storage costs and everything captured during a session is already correlated by default, with nothing sampled away or redacted after the fact.

LDX3 London 2026 agenda is live - See who is in the lineup

AI debugging has a data problem

These are indisputable facts:

  1. We’re collecting too much irrelevant (and costly) telemetry data.
  2. Were missing critical debugging data due to instrumentation costs and siloed tooling.
  3. AI tools need better data (unsampled and session-correlated) to be useful when debugging.

What’s missing is the recognition that this is fundamentally a data and context problem, not an AI problem. You can’t fix it by throwing smarter models at bad data.

AI agents are only as good as the data they can access. Giving an agent a firehose of sampled, noisy telemetry doesn’t help. Giving it correlated, complete session data, where every piece is already linked and nothing is missing, changes the game.

Instead of “here are some logs and a trace ID, figure out what happened,” you can hand the AI: “here’s the full recording of exactly what this user did, what the system sent and received, and where it broke.”The path forward is collecting the right data and context, at the right time, in a format that’s already correlated and ready to use. Whether the consumer is a human debugging an issue or an AI agent helping them do it faster.