The AI Quality Mantra
Benchmark First. Evaluate and Analyze Errors. Learn and Improve Continuously.
Evaluation is creation: hear it, you creators! Evaluating is itself the most valuable treasure of all that we value. It is only through evaluation that value exists: and without evaluation the nut of existence would be hollow.
Friedrich Nietzsche
AI quality has become one of the central challenges that any team adopting LLMs and LLM-powered agents must address.
Everyone agrees evaluation matters. Fewer agree on how to do it well. Even fewer have turned it into a disciplined, repeatable engineering practice. Why do teams often struggle to effectively instrument AI quality practices in their workflows?
Part of the confusion comes from importing a classical software testing mindset into a probabilistic world. In traditional software engineering, quality gates are discrete. LLM-powered systems do not behave that way. Their outputs are open-ended and failures are contextual. “Fixing” one issue can introduce another elsewhere.
In information visualization, Ben Shneiderman’s “Visual Information Seeking Mantra” — Overview first, zoom and filter, then details-on-demand — influenced decades of design thinking. It is a cognitive scaffold that has provided designers with a principled orientation: a way to think before building.
We need a similar mantra that AI practitioners should internalize, practice, and repeat as they build, augment, and maintain AI systems:
Benchmark first.
Evaluate and analyze errors.
Learn and improve continuously.
At the first glance, the mantra might seem quite obvious to seasoned practitioners. However, time and again, I have seen how not being able to appreciate the value of this mantra can lead teams building AI systems to a rocky road.
Benchmark First: Clarity Before Optimization
A benchmark is a formal articulation of what the system claims to do, under what constraints, and within which boundaries. It guarantees that progress is anchored to a defined surface area of responsibility.
In classical ML, this might be a labeled dataset. In modern agent systems — coding agents, multi-tool orchestration agents, enterprise assistants — the situation is more demanding. A meaningful benchmark is no longer just input-output pairs. It must include the task itself, the criteria for success, and the environment in which the task unfolds.
For an enterprise assistant, it means realistic user queries, ambiguous requests, boundary conditions, and scenarios where abstention is preferable. Threfore, a benchmark is a structured representation of reality.
Without an explicit benchmark, improvement becomes reactive. The system evolves in response to the most recent complaint, the loudest stakeholder, or the most visible failure.
Evaluate and Analyze Errors
With the establishment of a benchmark, evaluation can begin. The metrics should be grounded on the AI product being built rather than generic metrics that do not provide any actionable insights.

However, aggregate evaluation metrics alone are not enough. This is where error analysis becomes essential both on benchmarks and on production traffic.
In probabilistic systems, not all failures are equal. A formatting inconsistency is categorically different from a confident hallucination. An answer that is obviously wrong but recoverable differs profoundly from one that silently misleads.
Error analysis restores structure to failure.
When failures are grouped into types — hallucination, retrieval failure, reasoning breakdown, ambiguity mishandling — patterns emerge. Those patterns reveal systemic weaknesses rather than isolated incidents. They illuminate which component is responsible: the retriever, the planner, the prompt structure, the tool interface, or the model itself.
Learn and Improve: Recalibrate Continuously
In the world of AI quality, improvement is “always on”.
A benchmark provides a defined capability surface. Error analysis helps reveal where that surface is weak. However, the follow-up improvement is not limited to simply fixing behavior.
Every prompt change redistributes behavior. Every model upgrade alters reasoning patterns. Every new capability introduces new interactions. Distribution shifts are inevitable as user expectations evolve. What was once an edge case becomes common.
Consequently, each expansion introduces new degrees of freedom — and new failure modes. The old benchmark cannot fully measure the new system. So improvement forces a recalibration — the benchmark must expand, the evaluation goals must evolve (redefine existing and add new metrics), and the error taxonomy must absorb new categories (update and augment taxonomy). A specific example is LLM-judges — with any expansion the judge instructions and rubrics may need to change and a follow-up recalibration is required.
We should reconcile with the fact that the cycle is neverending!
Concluding Remarks
My first hand experience of driving the AI quality initiatives at Adobe Experience Platform — supporting the GA of Adobe Agent Orchestrator and Agents — and other engagements such as the Adobe Brand Concierge, have continued to remind me of the value of the AI quality mantra. Operationalizing this mantra requires building frameworks, platformizing quality workflows, and above all a quality-first mindset. Sharing a few relevant resources as recommended reading on how to operationalize this mantra:



