Missing sauce for your agents
a thought piece on engineering culture, agentic systems, and what separates great products from everything else
There is a ritual that plays out in engineering teams across the world, dozens of times a day. A pull request gets merged. A deployment goes green. Someone posts a celebrate emoji in Slack. The team moves on. This moment is treated as the finish line. Velocity is celebrated. Sprint points are counted. Roadmaps inch forward. And somewhere, a user is silently having a broken experience that no one on the team will ever know about.
The best engineering teams in the world don’t build this way. For them, shipping is not the finish line. It is the starting gun. The real work the unglamorous, obsessive, never-ending work begins the moment code touches production. And the teams that understand this at a cultural level are the ones that build products that are genuinely, durably great. The central question that separates good products from great ones isn’t “how fast can we ship?” (especially with agents, everyone can ship fast) It is “how deeply do we understand what our product is doing right now, for real users, in the real world?”
The Obsession No One Talks About
Ask most people what makes a great engineering team, and you’ll hear answers about talent, technical architecture, agile processes, or engineering velocity. These things matter. But they are not the differentiator. They are table stakes.
The defining characteristic of truly great engineering cultures is something far less glamorous: an obsession with knowing. Knowing what users are experiencing. Knowing where the system is struggling. Knowing what paths people take, where they abandon, where they get confused, where something silently breaks. This obsession is not a project that gets prioritized occasionally but it is a constant, daily, cultural practice that permeates everything the team does. Dashboards are their morning newspaper. Engineers don’t start the day by opening email or jumping into tickets, they open dashboards. Not because someone told them to, but because they are genuinely curious about what happened while they were sleeping. Metrics are their shared language, the way the team talks about the product, makes decisions, and argues about priorities. User sessions are not something the product manager watches; engineers watch them too, because stories are often hiding in the recordings that no metric will ever surface.
What This Obsession Actually Looks Like in Practice
Great engineering teams don’t just set up alerts and wait for them to fire. They go looking for trouble. Proactively. Regularly. Engineers manually scroll through session replays not because there’s a known bug, but because they want to see what users are actually doing and that observation always, without exception, surfaces something worth fixing. They read system logs the way a careful reader reads a novel, looking for patterns and anomalies that no automated alert was configured to catch, because you can only configure alerts for problems you already know about.
The majority of engineering effort in these teams isn’t writing new features. It is studying behavior, closing feedback loops, and making the existing product more reliable, more understandable, and more predictable. Internal tooling gets built not to ship to customers, but to give the engineering team better visibility into what is happening. Dashboards are not the exclusive domain of the on-call engineer but they are open on screens across the team, all day, because everyone feels a shared responsibility for what the system is doing right now.
When something goes wrong, the response isn’t just “fix the bug and deploy.” It is “understand why this happened, understand who was affected, understand what the user experienced, and build a check so this class of problem is visible earlier next time.” Every incident is a deposit into institutional knowledge. Every user session reviewed is a window into reality that no design doc or product spec could have predicted.
This is what the obsession looks like in practice. Not a culture of heroes fixing fires. A culture of detectives who are perpetually curious and perpetually unsatisfied
When Software Costs Nothing, Quality Is Everything
Here is where this conversation gets urgent in a way it wasn’t five years ago. The cost of writing software is collapsing toward zero. AI-assisted development means that features that used to take weeks take days, and features that used to take days take hours. The marginal cost of shipping code is approaching zero in a way that would have been unimaginable a decade ago. This is genuinely exciting, and it is also genuinely dangerous because it means the old competitive advantages are evaporating fast. You cannot compete on “we built it faster” anymore, because everyone can build fast. You cannot compete on “we have more features,” because features are cheap now. The differentiator the only moat that is actually hard to replicate is reliability, trust, and quality. It is the product that works, consistently, under pressure, in edge cases, at scale, every time. That is hard. That is what takes years of compounding investment in monitoring and feedback loops to achieve. And that is exactly what most teams are not doing.
When the cost of building goes to zero, the value of what you build is determined almost entirely by how well it works and how much users trust it. In that world, your monitoring culture and your feedback loops are not an engineering function, they are a business strategy. Teams that have spent years building this obsession will have an enormous, durable advantage over teams that treated monitoring as an afterthought.
ship fast, watch obsessively, and compound your reliability over time.
Agents Are a Different Beast Entirely
Everything above applies to traditional software. For agentic systems, the stakes are higher by an order of magnitude and most teams building agents today are dramatically underestimating this.
A button either works or it doesn’t. A form submits or it fails. Traditional software fails loudly, visibly, in ways that users notice and developers can track. Agentic systems fail in an entirely different way: they fail quietly, plausibly, and with apparent confidence. An agent isn’t a single operation it is a chain of decisions, each one building on the last. Each step in that chain can introduce a subtle error that compounds downstream. The agent might complete a task. It might even appear to complete it correctly. But somewhere in the middle of that decision chain, it took a wrong turn that the user won’t discover for hours, or days, or maybe never. It might have sent an email that wasn’t quite right, booked the wrong thing, summarized a document with a critical omission, or made an assumption that seemed reasonable but wasn’t. This is categorically different from a broken UI or a failed API call. There is no error code. There is no red screen. The agent just quietly did the wrong thing with full conviction. And if you are not watching deeply, obsessively, at the behavioral level you will have no idea it is happening.
For agentic systems, behavioral monitoring is not important. It is existential. You aren’t just monitoring uptime and error rates. You are monitoring intent alignment, did the agent do what the user actually needed? You are monitoring decision quality, did the reasoning chain hold up? You are monitoring task completion integrity, did the outcome match the goal? These are not metrics that come out of the box. They require teams to build deep, custom, thoughtful observability around the specific behaviors of their agents. And almost no one is doing this well yet. For Agents, “Good Enough” Monitoring Is a Lie.
For Agents, “Good Enough” Monitoring Is a Lie
The standard monitoring playbook, p99 latency, error rates, uptime dashboards, is not just insufficient for agentic systems. It is actively misleading. A system can have perfect uptime and a 0% error rate while an agent is systematically doing the wrong thing for every single user. This is the uncomfortable truth that teams building agents need to confront: your existing monitoring tells you whether your system is running. It tells you almost nothing about whether it is working.
What agentic systems actually require is a completely different category of observability. You need to know what the agent decided at each step, and whether that decision was appropriate for the context. You need to know where the agent deviated from the expected path, and whether that deviation was a reasonable adaptation or a failure mode. You need to know what the downstream consequence of each decision was. And you need to review sessions manually, at significant volume, because there is no automated system sophisticated enough to catch every subtle behavioral failure that matters to users.
The best teams use a combination: automated behavioral evaluations, human review pipelines, red-teaming, and structured logging of reasoning chains. They treat every unusual agent decision as a signal worth investigating. They build internal tools specifically designed to make agent behavior legible to engineers and product teams. They are paranoid in the best possible way.
The Case for a Real-Time Reliability Engine
Here is where we want to push the argument even further, because we believe the industry is not thinking big enough yet. Monitoring, even the deep, obsessive, behavioral kind we’ve been describing, is still fundamentally retrospective. You watch what happened. You review what the agent did. You find the failure, trace it back, and fix it for next time. This is enormously valuable and most teams aren’t even doing it. But for agentic systems operating at scale, on behalf of real users, with real consequences, retrospective monitoring is not the ceiling. It is the floor.
We would argue that the next frontier for teams building serious agentic products is a real-time reliability engine a dedicated layer that doesn’t just watch what the agent is doing, but actively intervenes in the moment.
Think about what this means in practice. As the agent moves through its decision chain retrieving context, reasoning, taking actions, calling tools, producing outputs, the reliability engine is running alongside it, evaluating each step in real time. It is asking: does this action align with the user’s actual intent? Does this decision fall within acceptable boundaries? Is this output coherent and consistent with what came before? And critically if the answer to any of those questions is no, the reliability engine doesn’t just log it for later review. It intervenes. It halts the action before it is taken. It surfaces a check to the user before something irreversible happens. It triggers a correction mid-task rather than waiting for a failure report.
This is a fundamentally different architecture from monitoring. Monitoring assumes a separation between observation and action, you watch, you log, you review, you fix. A real-time reliability engine collapses that separation. Observation and intervention are simultaneous. The system is not just aware of what the agent is doing, it is an active participant in ensuring what the agent does is trustworthy.
Consider the kinds of failures this makes possible to prevent. An agent about to send an email to the wrong recipient. An agent about to delete records it was not authorized to touch. An agent that has drifted mid-task, hallucinated a fact three steps back, and is now confidently executing downstream actions based on that hallucination. An agent that is technically completing the task it was given, but in a way that the user, if they could see the plan, would immediately want to stop. These are not hypothetical edge cases, they are the kinds of failures that erode user trust permanently, often without the user ever understanding exactly what went wrong. A real-time reliability engine is the architecture that prevents them before they become user experiences at all.
Building this is hard. It requires investing in an evaluation and intervention layer that runs with low enough latency to not destroy the user experience, sophisticated enough to catch subtle behavioral failures and not just obvious ones, and calibrated carefully enough to intervene when it should without over-blocking the agent’s legitimate autonomy. These are genuinely difficult engineering problems. But they are the problems that the best teams building production-grade agentic systems will be solving over the next few years and the teams that solve them first will build agents that users actually trust with consequential tasks.
Monitoring tells you what went wrong. A real-time reliability engine stops it from going wrong in the first place. For agents that are acting in the world on behalf of users, sending communications, making decisions, triggering workflows, handling sensitive data, the difference between those two things is the difference between a product users cautiously tolerate and a product users genuinely trust.
Rewire Your Culture
Here is the question worth sitting with honestly: what percentage of your engineering team’s effort right now is going into understanding what your product is actually doing, in production, for real users, today? Not building new features. Not refactoring old code. Not planning the next sprint. Understanding. Watching. Reviewing. Closing loops.
If the answer is less than a third, you are building blind. You are shipping into a void and hoping for the best. You are accumulating silent failures that users are experiencing right now that your team will never know about unless you change something fundamental about how you operate. The missing sauce is not a tool, a framework, or a methodology. There is no library you can install that fixes this. It is a cultural shift, deep, uncomfortable, and essential from “ship and move on” to “ship and watch obsessively.” It means promoting engineers who are great at understanding system behavior, not just shipping features. It means building dashboards before you build features. It means reviewing user sessions as a team ritual. It means treating every anomaly as a mystery worth solving rather than an edge case to be ignored. And for agentic systems, it means accepting that behavioral observability is not a product feature, it is the foundation without which nothing else you build can be trusted.
Your agent probably sucks right now. Not because your engineers aren’t talented. Not because your architecture is wrong. But because you’re not watching it closely enough to know exactly how it sucks, which means you have no idea how to make it better.
Start watching. That is the missing sauce