Evaluating AI Agent Reliability — Hearthstone Ventures

In conventional software, reliability is usually defined in binary terms: the service is up or it is down, the function returns the correct output or it does not, the API call succeeds or it fails. This binary framing works because the system's behavior is deterministic. Given the same input, you get the same output.

LLM-based agents break this framing completely. Given the same input, you may get different outputs across runs — some correct, some partially correct, some confidently wrong in different ways each time. Reliability for agentic systems is not a boolean property. It is a distribution over inputs, contexts, model versions, and time.

What Evaluation Actually Requires

Evaluating an agent reliably requires four things that are each difficult and that most teams underinvest in. First: a diverse test set that covers the real distribution of inputs the agent will encounter in production — not just the easy cases that the team thought of during development. Second: ground truth labels that reflect what "correct" actually means for the task, which is often ambiguous and contested even among human evaluators. Third: automated evaluation that can run continuously, not just when someone remembers to run the test suite. Fourth: monitoring that surfaces distribution shift — the point at which the production input distribution diverges from the test set.

Most teams have some version of the first two and very little of the third and fourth. This means they are flying blind in production, relying on user complaints as their primary signal that something has gone wrong.

LLM-as-Judge

One of the most interesting technical developments in AI evaluation over the past year is the use of a strong language model as the judge in automated evaluation pipelines. Instead of requiring human annotators to evaluate every output, the team defines evaluation criteria in natural language and has a capable model score outputs against those criteria at scale.

This approach is not without problems — the judge model has its own biases and failure modes — but it dramatically lowers the cost of evaluation and makes continuous evaluation tractable for teams that cannot afford human annotation at scale. The companies building evaluation infrastructure around this pattern are working on a real and important problem. We expect this space to consolidate around two or three significant companies over the next few years.