Karpathy’s March of Nines: Why 90% AI Reliability Falls Short

alex2404
By
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Andrej Karpathy‘s “March of Nines” offers a blunt mathematical case for why enterprise AI teams building on demo-level reliability are, by his framing, nowhere near ready for production.

The core argument: a workflow that succeeds 90% of the time has merely reached the first nine. Each additional nine — moving from 90% to 99%, then 99.9%, then 99.99% — demands roughly equivalent engineering effort, according to the announcement. “Every single nine is the same amount of work,” Karpathy has said.

The math compounds fast. In a 10-step agentic workflow — covering intent parsing, context retrieval, planning, tool calls, validation, formatting, and audit logging — end-to-end success equals per-step reliability raised to the power of 10. A workflow where each step succeeds 90% of the time completes successfully just 34.87% of the time overall. At 10 workflows per day, that means roughly 6.5 interruptions daily. The report classifies that as “prototype territory.”

Push per-step reliability to 99% and the 10-step workflow still fails roughly once per day. At 99.9% per step, failures drop to approximately once every 10 days — still frequent enough to feel unreliable. Only at 99.99% per step does the workflow hit roughly one failure every 3.3 months, the threshold the analysis describes as “dependable enterprise-grade software.”

Turning Reliability Into Measurable Targets

The framework recommends converting reliability into service-level objectives before choosing any technical fix. Suggested indicators include workflow completion rate, tool-call success within defined timeouts, schema-valid output rate, policy compliance rate covering PII and security constraints, p95 end-to-end latency, cost per workflow, and fallback rate to safer models or human review.

Error budgets should govern experimentation, with SLO targets tiered by workflow impact — low, medium, or high.

Nine Engineering Levers

The analysis identifies specific controls for adding reliability incrementally:

  • Constrain autonomy with an explicit workflow graph — a state machine or DAG where each node defines allowed tools, retry limits, and success conditions, with idempotent keys enabling safe replays.
  • Enforce contracts at every boundary using JSON Schema or protobuf validation before any tool executes, with canonical IDs and normalized time (ISO-8601) and units (SI).
  • Layer validators across syntax, semantics, and business rules — schema checks catch formatting; semantic checks handle referential integrity and numeric bounds; business rules gate write actions and enforce data residency.
  • Route by risk using confidence signals, classifiers, or second-model verifiers to direct high-impact actions toward stronger models or human approval.
  • Treat tool calls as distributed systems — per-tool timeouts, backoff with jitter, circuit breakers, concurrency limits, and versioned schemas to catch silent API breakage.
  • Make retrieval observable as a versioned data product, tracking empty-retrieval rates and coverage metrics to keep responses grounded.

The underlying point is structural. Correlated failures — authentication outages, rate limits, connector instability — tend to dominate in production unless shared dependencies are hardened independently. “When you get a demo and something works 90% of the time, that’s just the first nine,” Karpathy said. The gap between that and software users trust is not a prompt fix.

Photo by Tyler on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Share This Article