Measuring AI Performance: From Testing to Quality Control

When a customer-support assistant I deployed began fielding 18,000 chats per week, two numbers decided whether we kept it online: a 2.4% hallucination rate that generated 160 escalations, and a p95 latency of 1.1 seconds that customers tolerated. Shrinking hallucinations to under 1% and keeping p95 below 1.0 seconds dropped escalations by 28% and saved roughly 40 agent-hours per day.

You want a practical path from testing to quality control. Below, I share how I measure AI performance end-to-end: the right metrics, a test harness that survives reality, concrete tactics to reduce hallucinations, and the monitoring and governance that keep models trustworthy over time.

Define Success: What To Measure

The simplest rule I use: choose one outcome metric per user promise. For classification or routing, balanced accuracy or macro-F1 captures minority classes; for ranking, mean reciprocal rank and top-k recall; for generation, task completion rate verified by rubric. My acceptance bars: macro-F1 above 0.85 for critical routing, top-3 recall above 0.9 for search-heavy tasks, and rubric-validated task success above 80% before a staged rollout.

For RAG, retrieval quality often dominates model quality. I require recall@5 above 0.85 and mean reciprocal rank above 0.6 on a curated set before tuning prompts. If recall@5 sits at 0.7, hallucinations usually spike because the model invents when context is thin. A quick diagnostic: compare answers with k=1 versus k=5; if k=5 reduces unsupported claims by >30%, retrieval is your bottleneck, not the LLM.

Human judgment remains the ground truth for usefulness. I run pairwise preference tests (A vs B) with at least 300 prompts and compute win rates with bootstrap confidence intervals; a 5–7 percentage point advantage is typically worth a rollout. For Likert rubrics (accuracy, helpfulness, tone), I target an average above 4.2/5 and inter-annotator agreement (Cohen’s kappa) above 0.7. Anything lower and my “improvements” tend to evaporate in production.

Operational metrics decide whether performance is affordable. I track p50/p95/p99 latency (budgets: chat p95 under 1.0 seconds, p99 under 2.5 seconds), cost per answer (hard cap per session), refusal rate (healthy floor around 1–3% for safety), and calibration error (ECE under 0.05 for high-stakes, under 0.1 for general use). I also watch tokens per answer and context utilization; a window filled with low-relevance text quietly increases cost and error.

Build A Test Harness That Survives Reality

Start with a golden dataset built from real queries. For intents or FAQ, I like 1,000–2,000 labeled examples covering tail cases; for generation, 500–1,000 prompts with expected outcomes or reference passages. Include “hard” slices: ambiguous wording, outdated facts, and adversarial phrasings. To detect a 5 percentage point improvement with 80% power in a binary success metric, you usually need around 1,000 samples per variant; for smaller lifts, scale accordingly.

Label quality destroys or saves your roadmap. I double-label at least 20% of examples, calculate Cohen’s kappa, and pause if kappa falls below 0.7. Ambiguity gets resolved by writing sharper rubrics, not by voting away disagreements. Where exact answers are impossible, I encode rubrics with specific checks: “Does the answer cite a provided source?” “Are all numbers traceable to context?” “Did the system abstain appropriately?”

Make tests deterministic. Freeze prompts, model version, and tool availability; set temperatures to 0–0.2 for evaluation. For generative models, run pairwise head-to-heads on 200–400 prompts; significance comes from bootstrap resampling of wins, not vibes. Use seeded sampling to keep regression suites stable week to week, and reserve a separate exploration set for new ideas so you do not overfit your tests.

RAG needs its own harness. Annotate document relevance with graded labels (high, medium, none) and compute precision/recall at k before you even ask the model to answer. Test chunk sizes between 200 and 800 tokens with 10–20% overlap and measure the “answer token hit rate”: the fraction of answer sentences supported by retrieved spans. If retrieval latency pushes p95 above 700 ms, you will feel it in abandonment even when the model is fast.

Reducing Hallucinations: Prevention, Detection, Response

Prevention beats detection. Two levers pay repeatedly: better retrieval and stricter prompts. In one deployment, lifting recall@5 from 0.72 to 0.91 by adding a cross-encoder reranker and re-chunking reduced hallucinations from 7.4% to 3.1% without touching the model. On prompts, instructions that force citations, limit scope (“answer only using the provided context”), and define abstention rules typically cut unsupported claims by 20–40% compared to generic chat prompts.

Constrain generation where possible. Function calling for numbers, dates, and entity lookups replaced guesswork with API truth, removing entire classes of errors. Constrained decoding with regex-like schemas protects against format drift. Self-consistency (sample multiple chains and vote) sometimes improves factuality, but evidence is mixed and costs can double; I prefer selective self-checks on high-risk prompts. A simple win: require the model to provide a source line for each factual claim and refuse if none is available.

Detect what slips through. Entailment-based checkers compare each sentence to retrieved passages and mark unsupported spans; on legal summaries, I flag any sentence not entailed at a 0.8 threshold for review. Numeric guardrails catch reconciliation errors by re-parsing outputs and cross-checking with context. Confidence proxies like logit entropy or ensemble variance help routing: when uncertainty exceeds a tuned threshold, the system abstains or asks a clarifying question. Expect 5–10% latency overhead for these checks.

Respond with escalation and learning loops. I route low-confidence or uncited answers to humans, targeting an auto-resolution rate above 90% while capping escalations under 5%. I sample 1–5% of production answers for labeling each week, tag hallucinations and near-misses, and feed them into a weekly fix cycle: update retrieval filters, adjust prompts, or add tools. I track mean time to remediate hallucination patterns, aiming for under 72 hours from detection to patch.

Production Quality Control: Monitoring, Alerts, And Governance

Quality in production starts with explicit SLOs. My defaults: task success rate above 80%, hallucination proxy rate below 1% using automatic checks, refusal rate between 1% and 5% depending on policy, p95 latency under 1.0 seconds, and cost per 100 sessions within budget. Alerts fire when any metric breaches for 10 consecutive minutes, when query distribution drifts (KL divergence beyond a tuned threshold), or when embedding centroids shift by more than two standard deviations week over week.

Deploy gradually. Shadow-test new models on 5–10% of traffic without user exposure, compare against control on matched prompts, then canary to 10–20% with auto-rollback if success drops by 3 percentage points or latency rises 20% at p95. For A/Bs, you usually need 5,000–20,000 sessions to detect a 2–3 percentage point improvement. Caching answers with semantic similarity (cosine > 0.9) often yields 40–70% hit rates, cutting cost and tail latency while stabilizing behavior.

Governance is plumbing for trust. Version everything: data, prompts, models, retrieval indices, and tools. Keep immutable audit logs of inputs, outputs, citations, and decisions; this enables incident reconstruction and compliance checks. Scan outputs for PII and secrets; redact and hash sensitive fields before storing, or avoid storage entirely where policy demands. Maintain a source-of-truth registry for documents allowed in RAG to reduce stale or unauthorized citations.

Goodhart’s law (Charles Goodhart): when a measure becomes a target, it ceases to be a good measure.

To avoid metric gaming, track pairs that trade off: helpfulness and refusal, recall and precision, speed and cost, accuracy and coverage. Rotate a “hidden” audit set each quarter to catch overfitting to known tests. In weekly reviews, celebrate wins only when at least two independent indicators improved—say, user task success and human preference—while costs and safety remained within thresholds.

FAQ

Q: If I could track only one metric, which should it be?

No single metric covers everything, but “task success rate” tied to the user’s goal is the strongest anchor. Pair it with a hallucination control: unsupported-claim rate for knowledge tasks or abstention-on-unknown for safety-critical work. If forced to pick one number for a chatbot, I use goal completion rate verified by a rubric, because it correlates with retention and cost savings better than BLEU/ROUGE or latency alone.

Q: How many test samples do I need to detect improvements?

For a binary success metric, a rough rule is n ≈ 16·p·(1−p)/d² per variant, where p is baseline success and d is the minimum detectable difference. With p=0.8 and d=0.05, n ≈ 512. For pairwise preferences, 300–500 prompts with bootstrap confidence intervals usually stabilize win-rate estimates within ±3–4 percentage points. When traffic is cheap, oversample; when labels are expensive, prioritize high-variance slices that matter most.

Q: Do public benchmarks predict my product’s performance?

They are necessary but not sufficient. Benchmarks like MMLU or GSM8K rank models and surface gross capability gaps, but domain transfer is uneven. I use public scores to narrow candidates, then run a private, domain-grounded evaluation; in my experience, the ordering among top models often changes once retrieval, tools, latency, and cost constraints are applied. Treat benchmarks as filters, not contracts.

Q: What is an acceptable hallucination rate?

It is context-dependent. For finance, legal, medical, or compliance use, target effectively zero unsupported claims and rely on abstention plus human oversight. For customer support with clear documentation, I set a threshold below 1% with automatic detection and escalation. For general-purpose chat, 2–5% may be tolerable if content is low-risk and flagged. If you cannot measure hallucinations, default to stricter abstention and citations.

Q: Should I retrain, reprompt, or fix retrieval first?

Follow the cheapest-first rule. If recall@k is below 0.85 on known questions, improve retrieval: chunking, indexing, reranking, and query rewriting. Next, tighten prompts and add tools or citations. Retraining or fine-tuning pays off when you see systematic style or domain errors that prompts cannot fix, and when you have thousands of high-quality examples. I only retrain after retrieval and prompts plateau, to avoid locking in upstream mistakes.

Conclusion

Start with one outcome metric per user promise, add a hallucination control and a latency/cost budget, and enforce them in a deterministic test harness before shipping. In production, monitor paired metrics to resist gaming, escalate uncertainty, and close the loop weekly with labeled samples. When retrieval is strong, prompts are strict, and governance is boring, Measuring AI Performance stops being mystical and starts looking like ordinary, reliable engineering.