Andrej Karpathy‘s “March of Nines” offers a blunt mathematical case for why enterprise AI teams building on demo-level reliability are, by his framing, nowhere near ready for production.
The core argument: a workflow that succeeds 90% of the time has merely reached the first nine. Each additional nine — moving from 90% to 99%, then 99.9%, then 99.99% — demands roughly equivalent engineering effort, according to the announcement. “Every single nine is the same amount of work,” Karpathy has said.
The math compounds fast. In a 10-step agentic workflow — covering intent parsing, context retrieval, planning, tool calls, validation, formatting, and audit logging — end-to-end success equals per-step reliability raised to the power of 10. A workflow where each step succeeds 90% of the time completes successfully just 34.87% of the time overall. At 10 workflows per day, that means roughly 6.5 interruptions daily. The report classifies that as “prototype territory.”
Push per-step reliability to 99% and the 10-step workflow still fails roughly once per day. At 99.9% per step, failures drop to approximately once every 10 days — still frequent enough to feel unreliable. Only at 99.99% per step does the workflow hit roughly one failure every 3.3 months, the threshold the analysis describes as “dependable enterprise-grade software.”
Turning Reliability Into Measurable Targets
The framework recommends converting reliability into service-level objectives before choosing any technical fix. Suggested indicators include workflow completion rate, tool-call success within defined timeouts, schema-valid output rate, policy compliance rate covering PII and security constraints, p95 end-to-end latency, cost per workflow, and fallback rate to safer models or human review.
Error budgets should govern experimentation, with SLO targets tiered by workflow impact — low, medium, or high.
Nine Engineering Levers
The analysis identifies specific controls for adding reliability incrementally:
- Constrain autonomy with an explicit workflow graph — a state machine or DAG where each node defines allowed tools, retry limits, and success conditions, with idempotent keys enabling safe replays.
- Enforce contracts at every boundary using JSON Schema or protobuf validation before any tool executes, with canonical IDs and normalized time (ISO-8601) and units (SI).
- Layer validators across syntax, semantics, and business rules — schema checks catch formatting; semantic checks handle referential integrity and numeric bounds; business rules gate write actions and enforce data residency.
- Route by risk using confidence signals, classifiers, or second-model verifiers to direct high-impact actions toward stronger models or human approval.
- Treat tool calls as distributed systems — per-tool timeouts, backoff with jitter, circuit breakers, concurrency limits, and versioned schemas to catch silent API breakage.
- Make retrieval observable as a versioned data product, tracking empty-retrieval rates and coverage metrics to keep responses grounded.
The underlying point is structural. Correlated failures — authentication outages, rate limits, connector instability — tend to dominate in production unless shared dependencies are hardened independently. “When you get a demo and something works 90% of the time, that’s just the first nine,” Karpathy said. The gap between that and software users trust is not a prompt fix.
This article is a curated summary based on third-party sources. Source: Read the original article