The public narrative about AI in late 2023 is dominated by demos. GPT-4 Vision demos. Code generation demos. Multi-modal reasoning demos. These are impressive and they convey something real about the underlying capabilities. But they tell you almost nothing about what it is like to actually run AI systems in production at scale.

We spend a lot of time talking to engineering teams who are doing exactly this — running LLM-based systems in production, at scale, for paying customers. The picture that emerges from those conversations is significantly more complicated and more interesting than what the demo culture suggests.

The Operational Reality

The first thing that teams running production AI report is that the majority of their engineering time is not spent on model integration. It is spent on the surrounding systems: the pipeline that assembles context, the infrastructure that manages rate limits and retries, the evaluation system that detects regressions, the monitoring that surfaces unexpected behavior, and the feedback loops that connect model output quality back to the prompt and retrieval systems that feed it.

The ratio varies by team and product, but a common pattern we hear is something like 20% of engineering time on model integration and 80% on everything else. That 80% is where the production engineering challenge lives, and it is where the infrastructure companies in our portfolio are doing their most important work.

The Reliability Gap

The second thing that production teams report consistently is the reliability gap between demo performance and production performance. A capability that works in 95% of demo cases may work in 70% of the cases that arise organically in a real product used by real users. Closing this gap requires rigorous evaluation against a test set that reflects the real user input distribution — which itself requires significant investment in data collection and annotation infrastructure.

Teams that are doing this well have built what we call an evaluation loop: a continuous process of collecting production inputs, annotating a representative sample, running their system against that annotated sample, and identifying where the system degrades. This is table-stakes infrastructure for any serious AI application. It is also infrastructure that almost nobody has out of the box; it is built, not bought, in the current ecosystem.