Dispatch №04Mar 13, 202625 min read

Stop Shipping Code. Start Building Factories.

The next generation of software companies will not be defined by the code they write, but by the automated assembly lines they build.

By Nivedit JainOriginally published on jainnivedit.substack.com

Contents · 6 sections

01II. The Toyota Moat
02III. Inside the Factory: Anatomy of an AI-Driven Assembly Line
03IV. The Burden of Proof (Evidence as the Factory’s Currency)
04V. Quality and Reliability as First-Order Concerns
05VI. The Sensors on the Floor (Observability as Factory Instrumentation)
06VII. The New Role of the Engineer

We are living through what should be a golden age of software delivery. AI coding assistants can scaffold a microservice in seconds. Agents write tests, draft pull requests, and suggest architectural patterns before a human engineer has finished their morning coffee. The raw act of generating working code, logic, structure, syntax has been largely automated. By every intuitive measure, the pace of software improvement should be compounding exponentially.

So why aren’t customer experiences getting dramatically better, faster?

This question deserves a serious answer, because the disconnect is real and it is growing. The gap between what AI can generate and what companies can actually ship to users is widening, not narrowing. Understanding why requires an honest breakdown of where software development time actually goes.

Breaking Down the 60%

Here is the uncomfortable truth that most engineering organizations have never had to confront directly: writing code has always been the cheapest part of building software.

Writing code accounts for roughly 20% of a typical feature’s journey from conception to customer. The remaining 80% is something else entirely, process. The bureaucracy of correctness. Every gate that exists not to produce software, but to verify it.

Consider what that 80% actually contains. There is the CI pipeline: the battery of automated checks that run after every commit, which at many mature organizations takes 20 to 45 minutes per run, sometimes longer. There is the security review, often a manual process where a human reads through changes looking for vulnerabilities, injection points, or unintended data exposure. There is the QA process, functional testing, regression testing, edge case validation all of which historically required human judgment to define scope and interpret results. There is infrastructure provisioning, which in cloud environments might be automated in theory but in practice involves configuration reviews, cost approvals, and environment parity checks. And then there is the deployment window itself: the scheduled, nerve-wracking moment when code finally moves to production, often gated by a change advisory board or a release manager coordinating across teams.

Each of these steps exists for a reason. They are not bureaucratic waste. They are the accumulated scar tissue of every outage, every data breach, every regression that deleted a customer’s data in production. The 80% is the craft of software engineering, expressed as process.

Now consider what AI has done to this picture. It has turbocharged the 20%. Agents can generate code faster, in greater volume, with fewer syntactic errors than most human engineers. What it has not done is touch the 80% that surrounds that code. The result isn’t faster software. It is a traffic jam, more raw material entering a factory that has not been redesigned to process it.

II. The Toyota Moat

To understand where software engineering is headed, it helps to look at an industry that crossed a similar inflection point fifty years ago: automotive manufacturing.

In the 1970s, Western automakers were obsessed with the product, the car. Design cycles, engine specifications, horsepower numbers. Toyota, operating with fewer resources and less market share, made a different bet. They became obsessed with the system that built the car.

The Toyota Production System was not a manufacturing tweak. It was a complete philosophical reorientation: the factory floor as the primary competitive artifact. Kanban boards to manage work-in-progress and prevent overproduction. Just-in-time inventory to eliminate waste and surface bottlenecks in real time.

Competitors could study Toyota’s car designs. They could copy the specifications. What they couldn’t replicate was decades of accumulated process intelligence, embedded in every stage of the assembly line, held in the tacit knowledge of every engineer who had spent years tuning the factory’s internals. The factory was the moat. Not the car.

Software is arriving at exactly this inflection point.

The companies that will define the next decade of software are not going to win because they have the most talented individual engineers writing the best code by hand. They are going to win because they invested in building the most rigorous, automated, evidence-demanding Software Factories and they treated the factory itself as the irreplaceable, compounding, defensible asset.

The software itself is no longer the primary engineering innovation. It is the system that builds the software that will define winners and losers.

III. Inside the Factory: Anatomy of an AI-Driven Assembly Line

Recasting software development as a factory problem changes everything you optimise for. The factory does not begin with code. It begins with intent the structured understanding of what needs to be built and why, which arrives from both human judgment and machine-generated insight about where the live system is failing its users. What emerges at the far end of the factory, if it is doing its job, is verified, secure, reliable software that addresses a known, evidence-grounded need. Between those two points lies the entire substance of modern engineering excellence.

Let us walk the factory floor.

Stage 1: The Intake (Requirements as the True Starting Point)

The most important conceptual correction to make before walking the factory floor is this: the factory does not begin with code. It begins with requirements and in the Software Factory model, requirements arrive from two fundamentally different origins that together form a closed loop.

The first origin is the human. Product managers, designers, and domain experts articulate functional intent: what the system should do, who it is for, what constraints it must respect. This is the deliberate, forward-looking half of the requirements pipeline. It is the human exercising judgment about what to build next.

The second origin is the production environment itself. An always-on Trace Analysis Agent continuously ingests the telemetry streaming from live systems user session traces, error rates, drop-off points in flows, latency spikes at specific interaction steps, support ticket patterns and reasons over them to surface where customers are actually struggling. Not where the product team thinks customers are struggling. Where the data shows they are, with specificity: checkout completion rate drops 34% at the address validation step on mobile; users are retrying the document upload endpoint an average of 2.3 times before success; session traces show 18% of users abandoning mid-onboarding after hitting a specific error state that resolves on reload.

This agent doesn’t just surface observations. It generates structured tickets pre-scoped, prioritized by impact signal, and tagged with the relevant trace evidence and posts them directly to the task board. These are not vague “improve the experience” asks. They are grounded, evidence-backed specifications of where the system is failing its users right now.

Both streams feed into a Requirements Intake stage with a Human-in-the-Loop gate. A human reviews the generated tickets alongside the manually authored requirements, validates prioritization, resolves conflicts, and approves the work queue. The HITL gate here is not a bottleneck it is the point at which human judgment about product strategy intersects with machine-generated intelligence about product reality. The human isn’t generating the backlog from scratch. They are governing a backlog that the factory is substantially writing itself.

Only once requirements have cleared this gate does the factory produce its first line of code. The code generation agent receives a scoped, approved, evidence-grounded specification, not a vague instruction and produces the raw artifact that will enter the assembly line. Code, in this model, is the output of understanding requirements, not the starting point of the factory’s work.

Stage 2: The Heavy Machinery (Massive CI Pipelines)

Once the code generation agent produces a raw artifact from an approved, evidence-grounded requirement, it enters the assembly line’s heaviest machinery. This is where the factory earns its name.

The modern Software Factory runs multi-stage, compute-intensive CI pipelines that are designed specifically to handle the volume and variance that AI-generated code introduces. Where a human engineer might contribute two hundred lines of code in a day, an agent working at scale might produce ten times that volume. The factory must absorb this without sacrificing the signal quality of any verification stage.

These pipelines are not just faster versions of old pipelines. They are fundamentally richer. They execute thousands of unit tests in parallel across multiple environment configurations. They run static analysis not just for syntax errors but for architectural violations, dependency risks, and complexity thresholds. They profile performance characteristics (memory allocation, query execution plans, latency distributions etc.) against a baseline and reject changes that degrade them beyond defined tolerances. They run security scanning tools that flag known vulnerability patterns, outdated dependency versions, and insecure cryptographic primitives.

This is not overhead. This is the factory floor. Every stage is a quality gate that exists because the cost of defects compounds the further downstream they travel.

Stage 3: The Autonomous Task Board

The Jira ticket is undergoing a quiet but significant transformation. In the factory model, task boards are not tools for human engineers to track their own work. They are the operational ledger of the factory itself, primarily written, updated, and resolved by agents.

When a feature enters the factory, an orchestrating agent doesn’t just create a single PR. It decomposes the work into a precise sequence of verifiable intermediate steps and logs each one as it executes. The task board becomes a live audit trail: branch published, PR created, unit tests executed, positive test suite passed, negative test suite passed, regression baseline confirmed, data profiling completed, security checks initiated, security checks passed, DevOps compatibility verified, review requested, review completed, integration test suite passed, QA sign-off recorded.

Each of these is not a status label. It is a claim, attached to evidence. The agent doesn’t mark “unit tests passed” it attaches the test run report, the coverage delta, and the execution log. The task board becomes a structured, query able record of everything the factory did to this artifact and what proof it produced.

Humans reading the board are not managing work. They are auditing a factory run.

Stage 4: The Micro-Agent Verification Layer (Blind by Design)

This is the design insight that separates a genuine Software Factory from a sophisticated autocomplete loop, and it is worth spending time here because it is counter-intuitive.

In a well-designed Software Factory, the agents responsible for verification are strictly segregated by function and by information access. The QA agent, the SecOps agent, and the DevOps agent do not share prompts, do not share chain-of-thought reasoning, and cannot see each other’s verdicts until they have independently committed to their own. They share only one thing: the artifact under review.

The isolation is not a paranoid quirk. It is a principled engineering decision, and the reasoning is worth making explicit.

Consider what happens in the alternative. If the QA agent knows that the SecOps agent has already approved a PR, it is subtly incentivized to rationalize away a borderline edge case. After all, if security is satisfied, the risk feels lower. This is confirmation bias in agent form, and it degrades the factory’s output quality in ways that are very difficult to detect until something breaks in production.

Blind verification eliminates this dynamic entirely. Each agent is an adversary, not a collaborator. The SecOps agent doesn’t know the feature is elegant, that the deadline is Friday, or that the QA agent already approved the PR. It knows only the code and its mandate: find security vulnerabilities. Full stop.

Human-in-the-loop gates appear at designated checkpoints: when the agent collective has reached unanimous consensus and the change crosses a defined risk threshold, or when the agents reach an unresolvable conflict. Humans are not in the flow of production. They are supervisors of the floor, intervening when the machinery surfaces something that genuinely requires human judgment.

IV. The Burden of Proof (Evidence as the Factory’s Currency)

When code generation is effectively free, the scarcest resource becomes verified correctness. The factory does not operate on trust, not in agents, not in prior approvals, not in the reputation of the engineer who wrote the original logic. It operates on evidence.

This is a philosophical shift as much as a technical one. In the artisanal model of software engineering, a senior engineer’s approval carried weight because you trusted their judgment. In the factory model, the factory trusts no judgment, only proof. This is not a downgrade. It is a more honest acknowledgment that judgment, however skilled, is fallible at scale.

But here is what nobody tells you when they sketch this model on a whiteboard: building these systems of evidence is extraordinarily hard. It is, arguably, the hardest engineering problem the industry has collectively underinvested in. The concept sounds clean. The execution is brutal.

Why Evidence Systems Are So Difficult to Build

The difficulty starts with a deceptively simple question: what does “proven correct” actually mean for this specific system, in this specific organization, at this specific point in its evolution? There is no universal answer. Every codebase carries years of implicit assumptions about data shapes, about timing guarantees, about which callers are trusted, about which invariants the team has tacitly agreed to preserve but never written down. Before you can build a system that proves these properties hold after every change, you first have to surface and formalise properties that have existed only in engineers’ heads. That archaeology project alone mapping the undocumented behavioural contract of a mature system is months of work before a single automated check is written.

Then there is the problem of test quality. Most engineering organizations discover, when they try to build rigorous evidence systems, that their existing test suites are far weaker than they believed. Coverage numbers are misleading a line of code being executed by a test is not the same as its behaviour being verified by a test. A test that asserts response.status == 200 and nothing else covers the line but proves almost nothing. Building a test suite that constitutes genuine evidence of behavioural correctness requires rethinking tests not as a developer’s sanity check but as a formal specification of expected behaviour and then systematically filling the gaps between what is claimed and what is actually asserted. At scale, with AI-generated code arriving continuously, keeping this test suite coherent and non-contradictory is itself a significant engineering discipline.

The non-regression problem compounds this. Proving that existing behaviour is unchanged sounds like running the existing test suite but that is only proof that the code behaves the same way on the paths the tests already cover. AI-generated changes can introduce subtle behavioural shifts in edge cases that no existing test exercises, in data access patterns that no profiling check monitors, in performance characteristics that only manifest under specific load distributions. Building true regression evidence requires not just running tests but continuously expanding the scope of what is tested as the system grows which means the evidence infrastructure itself must compound in sophistication alongside the system it is certifying.

Security and data integrity evidence is harder still. Proving that no unintended data is reachable through a new code path requires a complete, up-to-date model of every data access pathway in the system which in a microservices environment, with dozens of services calling each other through APIs and event streams, is itself a non-trivial graph to maintain. A single change in a data transformation layer can create an unintended exposure several service hops away from where the change was made. Detecting this automatically means having instrumentation not just in the changed service but across every service that might be affected and having a baseline model of permitted data flows precise enough that deviations can be detected with low false-positive rates. Too many false positives, and engineers start dismissing alerts; too many false negatives, and the gate provides false confidence. Calibrating this is genuinely difficult.

And then there is the organisational challenge, which is perhaps the hardest of all. Evidence systems require discipline from everyone who touches the factory. An agent that is allowed to mark a check as “not applicable” without justification, or a human reviewer who approves a PR with incomplete evidence because the deadline is Friday, or a team that carves out an exception for a legacy module “just this once” each of these erodes the integrity of the entire system. The value of evidence gates is that they are unconditional. The moment they become conditional on circumstance, they stop being evidence and start being theatre. Maintaining that unconditional discipline, across teams, across time, against the constant pressure of shipping velocity, is a leadership and culture challenge as much as an engineering one.

This is why the evidence infrastructure is the factory’s most valuable, most hard-won, and most defensible asset. A competitor can adopt the same CI tools. They can prompt similar agents. They cannot easily replicate the two years of iteration, calibration, and organisational discipline that produced a system of evidence gates that genuinely mean something when they pass.

Before a PR is merged, the agent collective must produce three categories of concrete, inspectable evidence.

Proof of Functionality means a passing, comprehensive behavioural test matrix not just the happy path. The agent must demonstrate that the feature does exactly what it claims, and that it doesn’t do what it doesn’t claim. Positive tests confirm the intended behaviour across normal inputs and expected edge cases. Negative tests confirm that malformed inputs, boundary violations, and invalid states are handled gracefully rather than catastrophically. The coverage threshold is not a soft target. It is a gate.

Proof of Non-Regression is the factory’s immune system. Software systems accumulate surface area over time thousands of features, API contracts, data transformations, and integration points, all of which need to keep working when new code lands. Before any merge, the factory must produce cryptographic or log-based confirmation that the existing feature surface is intact. Not “probably intact.” Not “we ran the usual suite.” Demonstrably intact, with an auditable record. This is particularly critical in the AI-generated code era, where an agent making a structural change might affect ten downstream behaviours it was never asked to consider.

Proof of Security and Data Integrity is the hardest gate and, increasingly, the most important one. The factory runs an automated profiling pass over data access patterns, API exposure surfaces, and PII handling pathways. Any change that introduces new data access must produce an attestation that no unintended data flows have been created, no personally identifiable information is reachable through new vectors, and no new attack surface has been opened. This is not a security team reviewing a PR. It is the factory producing machine-readable proof.

If evidence is lacking at any gate, the artifact is rejected and returned. The factory does not ship defects. It surfaces them early, when they are cheap to fix, rather than late, when they are expensive to explain.

V. Quality and Reliability as First-Order Concerns

There is a version of the future where AI-generated code creates a world of extremely fast, extremely brittle software systems that appear to work in demos and collapse under production load, edge cases, or adversarial inputs. This is the failure mode of a world that optimised for code generation speed without building the factory to match.

The Software Factory is the answer to this failure mode, but only if quality and reliability are treated as first-order design constraints of the factory itself, not as checklist items applied at the end.

Quality in the factory model is not a vibe. It is a set of measurable, enforceable, machine-checkable properties: test coverage above a threshold, performance regression within a tolerance, security posture unchanged or improved, data integrity provably maintained. These properties are binary at the gate. The code either demonstrates them or it doesn’t. There is no “mostly secure” or “probably won’t regress.”

Reliability deserves its own paragraph because it is the property most likely to be underweighted in the excitement around AI generation velocity. And here the story gets more interesting, because reliability in the factory model is not only a property of the code it is a property of the runtime environment that code is deployed onto.

This is a distinction that matters enormously. You can have a perfectly verified, fully evidence-backed artifact emerge from your factory and still have it fail in production because the distributed system it runs inside is inherently unreliable. Networks partition. Processes crash mid-execution. A payment workflow that was three steps in gets interrupted and nobody knows whether the charge went through. And then there is the failure mode that is entirely new to this era: a customer-facing agent hallucinated. It confidently told a user their refund was processed. It wasn’t. It summarised a policy that doesn’t exist. It took an action and booked a flight, cancelled a subscription, submitted a form based on a model output that was plausible but wrong. Unlike a network partition, which is deterministic and recoverable, a hallucination leaves no obvious error signal. The system didn’t crash. It just did the wrong thing with complete confidence, and the runtime had no mechanism to detect the divergence.

In the artisanal engineering world, handling the infrastructure failure modes was the job of the individual developer, who would write retry logic, idempotency keys, and compensating transactions by hand, usually after a production incident taught them they needed to. The hallucination problem didn’t exist at all. Now it sits alongside the rest, demanding the same rigour: the runtime must be capable of detecting when an agent’s output violates known constraints, flagging the divergence explicitly, and either triggering a compensating action or escalating to a human before the bad state propagates further downstream.

The factory model demands something better: runtimes that make reliability a first-class, structural guarantee rather than a developer responsibility. This is precisely what platforms like Temporal have demonstrated is possible for deterministic software. Temporal’s durable execution, deterministic workflows, automatic state persistence across failures means that when a process crashes, the runtime doesn’t lose the execution context. It resumes from exactly where it left off, with a complete, inspectable history of every step it took. If something failed, there is an explicit record of what failed, what the mitigation was, and how the workflow recovered. The factory ships code; the runtime holds it accountable after it ships.

But Temporal was designed for a world of deterministic code. Its guarantees are powerful precisely because the functions it orchestrates, given the same inputs, will always produce the same outputs. Agents break this assumption entirely. An agent invoked twice with identical context may reason differently, choose a different tool, or produce a subtly different output and Temporal has no primitive for expressing “this step was taken by a non-deterministic reasoning model; validate its output against these constraints before proceeding.” Its retry semantics, its replay model, its workflow history all of these were built for functions, not for inference and agents.

This is a wide open space. The reliability layer for agentic software largely does not yet exist as a mature, general-purpose platform. What it needs to look like is a runtime that combines Temporal’s durable execution guarantees with agent-native primitives: output constraint validation that fires after every model call, semantic divergence detection that flags when an agent’s response violates the known state of the world, automatic compensating actions when an agent step is found to be incorrect after the fact, and a human escalation path that triggers before bad state propagates rather than after a user reports it. Building this layer the reliability substrate that sits between the factory’s verified artifacts and the production environment’s unpredictable reality is one of the most consequential unsolved infrastructure problems in software engineering right now. The team that builds it well will own a platform primitive that every agentic application will eventually need.

This is the full picture of reliability in the Software Factory: verified code entering a runtime that is architected to be deterministic about failures, transparent about recovery, and auditable about every execution path. The factory builds reliability in. The runtime sustains it.

One more thing needs to be said clearly here, because it is easy to miss in all the discussion of pipelines and verification gates: building an automated factory is not an excuse for degrading the customer experience. If anything, it is the opposite commitment. The entire premise of the factory is that it removes the human bottlenecks and coordination friction that have historically slowed down the cycle from insight to customer impact. A factory that is rigorous, automated, and evidence-driven should be accelerating the pace at which customers experience improvement not giving engineering teams a new set of infrastructure concerns to hide behind.

If the factory is functioning correctly, customers should feel it. Features should ship more frequently. Bugs should be caught before they reach production, not discovered by users. Performance regressions should be stopped at the gate, not patched in the next sprint. The factory is not a slowdown with better tooling. It is the mechanism by which continuous, compounding improvement in customer experience becomes structurally possible because quality and reliability are enforced by the machinery, freeing every human in the loop to think harder about what to build, not whether what they built works.

VI. The Sensors on the Floor (Observability as Factory Instrumentation)

A factory running blind is not a factory. It is a gamble.

Deep observability, distributed tracing, structured logging, anomaly detection, real-time alerting is not a post-production concern in the factory model. It is wired into the factory floor as a first-class component, as essential as the CI pipeline or the micro-agent verification layer. And crucially, in the Software Factory model, observability does not terminate at an alert. It feeds directly back into the requirements intake at the start of the factory, thus, closing the loop entirely.

This is the architecture that separates a factory from a pipeline. A pipeline moves work in one direction: from intention to production. A factory is a closed system: production continuously generates signal that the factory uses to direct its own next actions. The Trace Analysis Agent sitting at the requirements intake is not a separate tool bolted onto the side. It is the mechanism by which the factory learns what to build next from what the live system is telling it. Every session trace, every error pattern, every customer struggle signal that production emits is potential raw material for the next work queue.

When a production anomaly fires, the observability loop does not just page an on-call engineer with an error code and a stack trace. It produces a complete, structured narrative automatically: this latency regression in the payments service traces to a merge that happened two hours ago; the DevOps agent’s evidence for that merge showed no performance regression; here is the discrepancy between what the agent measured in the test environment and what production is showing now. From that narrative, the factory does not wait for a human to take notes. A ticket is automatically created in the task board pre-populated with the anomaly signature, the traced origin, the implicated PR, the relevant agent verdicts, and a suggested remediation scope. The on-call engineer wakes up to a structured brief, not a blank incident.

Deeper still, the factory can dispatch a dedicated RCA (Root Cause Analysis) agent. This agent is not a chatbot answering questions about the outage. It is a reasoning process that ingests the full production trace, the factory lineage of the implicated change, the agent evidence from every gate that approved it, and the historical pattern of similar anomalies in this system and produces a structured hypothesis about what failed, why the factory did not catch it, and what gate or threshold needs to change to prevent the same class of failure from reaching production again. It is the factory auditing itself.

And then there is the slower, quieter loop that the Trace Analysis Agent runs in parallel not responding to acute failures, but continuously reading the signal embedded in normal production behaviour. Where are users retrying? Where are sessions ending without conversion? Where are error states silently recovering in ways that mask user frustration from the surface metrics? These patterns, accumulated over time, become the raw material for improvement tickets that flow back into the requirements intake, pass through the HITL review gate, and eventually re-enter the factory as a new approved work item. The factory is not just reacting to what breaks. It is watching what struggles, and quietly scheduling its own improvement in response. This is the closed loop that distinguishes a Software Factory from a deployment pipeline: production intelligence continuously directs the next round of production work, without waiting for a quarterly planning cycle or a frustrated customer to file a support ticket.

This is how the factory compounds over time not by generating more code, but by becoming more precise, through every incident, in its understanding of what “verified” actually means for this specific system, at this specific scale, in this specific organisation. Each outage makes the sensors more sensitive. Each RCA tightens a gate. The factory that has been running for three years is not just faster than the one that launched last quarter. It is smarter and that accumulated intelligence is not something a competitor can copy by switching to a better AI model.

VII. The New Role of the Engineer

None of this makes the engineer obsolete. It makes the artisan engineer obsolete the one whose primary output is lines of code, who context-switches between writing logic, debugging, responding to Slack messages, and updating Jira tickets by hand.

The engineer who builds and tunes the factory is something fundamentally different. They are a systems thinker. Their domain is pipeline architecture, agent prompt design, gate threshold calibration, observability instrumentation, evidence format definition, and feedback loop integrity. Their output is not a feature. It is factory throughput an invisible multiplier on everything the organization ships.

This is the Toyota insight applied to software. Toyota’s best engineers were not the ones who could assemble a car the fastest by hand. They were the ones who could look at an assembly line and see a bottleneck three stages upstream from where the defects were appearing, redesign the Kanban flow for that stage, and produce a measurable improvement in the quality of every car that line would ever produce.

The Software Factory is the core engineering innovation of the next decade. Not the AI model. Not the code it generates. The factory and the expertise required to build, maintain, and mature it for the specific needs of a specific organization is where defensible, compounding competitive advantage lives.

Companies that understand this early will build factories. Their competitors will still be arguing about which AI coding assistant writes cleaner React.