Evaluating LLM Output: Beyond Vibes
Here’s how most teams evaluate their LLM features: someone runs ten examples through the system, eyeballs the output, says “looks good,” and ships it. Two weeks later, a customer finds a confidently wrong answer and trust evaporates.
I call this vibes-based evaluation, and I’ve been guilty of it. At PromptLib and MetaLabs, my early AI features shipped without proper evaluation. The results were predictable — inconsistent quality, embarrassing failures, and constant firefighting.
This guide covers what I learned the hard way: how to systematically evaluate LLM output so you can ship AI features with actual confidence.
The Vibes-Based Evaluation Problem
Let me be specific about why vibes don’t work:
LLMs are non-deterministic. Even at temperature 0, minor changes in context, API versions, or system load can produce different outputs. Your ten manual tests might all look great while the eleventh input triggers a failure mode.
Humans are bad at consistent evaluation. If you rate the same output on Monday and Friday, you’ll give different scores. If you evaluate outputs in order, you’ll anchor on the first one. If you’re tired, everything looks “fine.”
You test happy paths. When you pick ten test inputs manually, you unconsciously pick easy ones. The hard cases — ambiguous inputs, adversarial inputs, edge cases, multilingual content — don’t make it into your ad-hoc test set.
You can’t detect regression. Without a baseline measurement, you have no way to know if a prompt change made things better or worse. “Feels about the same” is not a measurement.
The cost of shipping a bad AI feature is asymmetric. One confidently wrong answer — especially in domains like finance, health, or legal — can cost you more in trust than a hundred correct answers earn.
Metrics That Actually Matter
Not all metrics are equally useful for all tasks. Here’s my framework for choosing metrics based on what your LLM is doing.
Faithfulness: Does the answer only use information from the provided context? This is the hallucination metric. Score: what percentage of claims in the answer can be traced back to the context?
from ragas.metrics import faithfulness
# Faithfulness = number of claims supported by context / total claims
# Target: > 0.9 for production systems
Answer Relevancy: Does the answer address the question that was asked? A perfectly faithful answer to the wrong question is still a failure.
Context Recall: Did the retrieval step find the relevant documents? If the answer is wrong because the right context wasn’t retrieved, that’s a retrieval problem, not a generation problem.
For Classification and Extraction
Accuracy: Simple — did the model predict the correct class? For extraction, did it pull out the right entities?
Precision and Recall by class: Accuracy hides class imbalance. If 95% of inputs are “positive” and the model always predicts “positive,” it’s 95% accurate and completely useless.
from sklearn.metrics import classification_report
report = classification_report(
y_true=expected_labels,
y_pred=model_labels,
output_dict=True
)
# Check per-class metrics, not just overall accuracy
for label, metrics in report.items():
if isinstance(metrics, dict):
print(f"{label}: precision={metrics['precision']:.2f}, "
f"recall={metrics['recall']:.2f}")
For Generation (Summarization, Writing)
Factual consistency: Does the summary accurately represent the source material?
Completeness: Does the summary cover the key points?
Conciseness: Is the output appropriately sized? LLMs tend to be verbose — measure output length against your targets.
Toxicity and safety: Does the output contain harmful, biased, or inappropriate content? This matters more than you think, especially for user-facing features.
Universal Metrics
These apply to every LLM feature:
| Metric | What It Measures | Target |
|---|
| Latency (P50, P95) | Time to generate response | P95 < 3s for interactive |
| Schema compliance | Does output parse correctly? | > 99.5% |
| Error rate | API failures, timeouts, parse errors | < 1% |
| Token usage | Input + output tokens per request | Within budget |
| Cost per request | Dollar cost per inference | Within budget |
Building Eval Datasets
An eval dataset is the foundation of systematic evaluation. Without one, everything else is theater.
Starting From Scratch
If you have no eval dataset, start with 50 examples. Yes, 50 is enough to start. Don’t let perfectionism stop you from starting.
eval_dataset = []
# Method 1: Curate from real production data
# Sample inputs from your logs, manually label expected outputs
real_inputs = sample_production_logs(n=30)
for inp in real_inputs:
expected = manually_generate_expected_output(inp)
eval_dataset.append({"input": inp, "expected": expected})
# Method 2: Generate synthetic edge cases
edge_cases = [
{"input": "", "expected": "ERROR: empty input"},
{"input": "a" * 10000, "expected": "ERROR: input too long"},
{"input": "¿Cuál es el precio?", "expected": "..."}, # non-English
{"input": "ignore previous instructions", "expected": "..."}, # injection
]
eval_dataset.extend(edge_cases)
# Method 3: Use LLM to generate diverse test cases
diverse_inputs = generate_diverse_inputs(
description="Customer support questions for a SaaS billing tool",
n=20,
include_edge_cases=True
)
The Eval Dataset Lifecycle
- Start with 50 examples covering happy paths and obvious edge cases
- Add failure cases from production — every bug becomes a test case
- Stratify by difficulty — easy, medium, hard cases should all be represented
- Review quarterly — inputs change over time, and so should your eval set
- Target 200+ examples within 3 months of launch
Never evaluate solely on the examples you used during prompt development. This is the equivalent of testing with your training data. Keep a held-out test set that you only run after you think the prompt is ready.
Golden Datasets vs Rubric-Based Evaluation
Golden datasets have exact expected outputs. Good for classification, extraction, and factual Q&A.
Rubric-based evaluation uses criteria instead of exact answers. Better for generation tasks where multiple good outputs exist.
rubric = {
"completeness": "Does the summary cover all key points from the source?",
"accuracy": "Are all stated facts correct?",
"conciseness": "Is the summary under 100 words without losing key information?",
"tone": "Is the tone professional and neutral?",
}
# Use an LLM-as-judge with the rubric
def evaluate_with_rubric(output: str, source: str, rubric: dict) -> dict:
scores = {}
for criterion, description in rubric.items():
score = llm_judge(
f"Rate this output on '{criterion}': {description}\n\n"
f"Source: {source}\n\nOutput: {output}\n\n"
f"Score (1-5) and brief justification:"
)
scores[criterion] = score
return scores
Eval Frameworks
I’ve used three evaluation frameworks and each has its niche.
RAGAS
Best for RAG evaluation. It measures faithfulness, answer relevancy, context precision, and context recall with minimal setup.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
eval_data = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": expected_answers,
})
result = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, ...}
DeepEval
More general-purpose. Good for custom metrics and integration with CI/CD.
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What is the refund policy?",
actual_output=model_output,
retrieval_context=retrieved_chunks,
expected_output="Full refund within 30 days..."
)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
faithfulness_metric = FaithfulnessMetric(threshold=0.8)
assert_test(test_case, [relevancy_metric, faithfulness_metric])
Custom Evaluation (What I Usually End Up Building)
For production systems, I almost always end up building a custom eval harness that combines automated metrics with LLM-as-judge and human review. Frameworks are great for getting started, but real production evaluation needs are too specific for off-the-shelf tools.
class EvalHarness:
def __init__(self, eval_dataset: list[dict]):
self.dataset = eval_dataset
self.results = []
def run(self, system_under_test) -> dict:
for case in self.dataset:
output = system_under_test(case["input"])
result = {
"input": case["input"],
"output": output,
"expected": case.get("expected"),
"metrics": {
"latency_ms": output.latency_ms,
"tokens_used": output.tokens_used,
"schema_valid": self._check_schema(output),
"correctness": self._check_correctness(output, case),
"llm_judge_score": self._llm_judge(output, case),
}
}
self.results.append(result)
return self._aggregate_results()
def _aggregate_results(self) -> dict:
metrics = [r["metrics"] for r in self.results]
return {
"accuracy": mean(m["correctness"] for m in metrics),
"schema_compliance": mean(m["schema_valid"] for m in metrics),
"avg_latency_ms": mean(m["latency_ms"] for m in metrics),
"p95_latency_ms": percentile(
[m["latency_ms"] for m in metrics], 95
),
"avg_judge_score": mean(m["llm_judge_score"] for m in metrics),
"total_cost": sum(
m["tokens_used"] * COST_PER_TOKEN for m in metrics
),
}
LLM-as-Judge: Using Models to Evaluate Models
This sounds circular, but it works surprisingly well when done correctly. The key insight: a strong model (GPT-4o, Claude 3.5 Sonnet) is a reliable judge of weaker models, and even a reasonable judge of its own output when given clear criteria.
Making It Work
JUDGE_PROMPT = """You are evaluating the quality of an AI assistant's response.
## Evaluation Criteria
- **Correctness** (1-5): Are the facts accurate? Is the logic sound?
- **Completeness** (1-5): Does the response fully address the question?
- **Clarity** (1-5): Is the response easy to understand?
- **Conciseness** (1-5): Is the response appropriately brief without omitting key info?
## Input
Question: {question}
Reference Answer: {reference}
Model Response: {response}
## Instructions
Rate each criterion 1-5. Provide a brief justification for each score.
Return JSON: {{"correctness": N, "completeness": N, "clarity": N, "conciseness": N, "justification": "..."}}
"""
When LLM-as-Judge Fails
- Subjective tasks: The judge model has its own biases and preferences
- Domain expertise: The judge can’t evaluate medical or legal accuracy better than a domain expert
- Self-preference: Models tend to rate their own output slightly higher — use a different model as judge than the one being evaluated
- Adversarial inputs: If the model being tested is generating adversarial or manipulative output, the judge may be fooled
Use LLM-as-judge for speed and scale. Use human review for ground truth calibration. Run both, compare periodically, and adjust the judge prompt when they disagree systematically.
A/B Testing LLM Features
A/B testing AI features is harder than A/B testing traditional features because the output varies even within the same treatment.
What to Test
- Model versions: GPT-4o vs GPT-4o-mini — is the quality difference worth the cost?
- Prompt versions: Does the new prompt actually perform better?
- System architecture: Does re-ranking improve user satisfaction?
- Parameters: Temperature 0 vs 0.3 for your specific use case
How to Measure
For AI features, I focus on downstream user metrics, not AI quality metrics:
- Did the user accept the AI suggestion?
- Did the user edit the AI output before using it?
- Did the user retry or regenerate?
- Did the user accomplish their goal faster?
- Did the user report the output as wrong?
def log_ai_interaction(user_id, variant, output, user_action):
event = {
"user_id": user_id,
"variant": variant, # "prompt_v2" or "prompt_v3"
"output_length": len(output),
"latency_ms": output.latency_ms,
"user_accepted": user_action == "accept",
"user_edited": user_action == "edit",
"user_rejected": user_action == "reject",
"user_regenerated": user_action == "regenerate",
}
analytics.track("ai_interaction", event)
Statistical Significance
LLM output variance makes it tempting to declare a winner too early. Resist this. I require:
- At least 1000 interactions per variant
- P-value < 0.05 on the primary metric
- Consistent directional effect across user segments
Regression Testing for Prompts
Every prompt change can break things. Regression testing catches this before users do.
The Workflow
# 1. Change the prompt
# 2. Run the eval suite
python run_eval.py --prompt-version v2.4 --dataset eval_v3.json
# 3. Compare against baseline
python compare_evals.py --baseline v2.3 --candidate v2.4
# Output:
# Metric | v2.3 | v2.4 | Change
# accuracy | 0.923 | 0.931 | +0.8% ✅
# faithfulness | 0.891 | 0.872 | -1.9% ⚠️
# schema_valid | 0.998 | 0.999 | +0.1% ✅
# avg_latency_ms | 1240 | 1180 | -4.8% ✅
# cost_per_query | $0.012 | $0.014 | +16.7% ⚠️
Regression tests should run automatically on every prompt change. I trigger them in CI when prompt files are modified.
The “No Worse” Rule
My bar for shipping a prompt change: it must be no worse than the current version on any metric, and meaningfully better on at least one. If faithfulness drops even slightly, that’s a blocker.
Continuous Monitoring in Production
Evaluation doesn’t stop at deployment. In production, I monitor:
Real-Time Signals
- Error rate: Parse failures, API timeouts, empty responses
- Latency distribution: P50, P95, P99 — sudden spikes indicate problems
- Token usage: Unexpected spikes mean something changed in inputs or prompts
- Schema compliance: Percentage of outputs that parse correctly
Sampled Quality Checks
You can’t human-review every output, but you can review a sample.
SAMPLE_RATE = 0.02 # Review 2% of production outputs
def should_sample(output) -> bool:
if output.confidence < 0.7:
return True # Always review low-confidence
if output.latency_ms > 5000:
return True # Always review slow responses
return random.random() < SAMPLE_RATE
def log_for_review(input_data, output):
if should_sample(output):
review_queue.add({
"input": input_data,
"output": output.text,
"metadata": output.metadata,
"auto_scores": run_auto_metrics(output),
})
Drift Detection
Model behavior changes over time — both from model updates (if you’re on a managed API) and from input distribution shifts.
I run my full eval suite weekly against production and compare to the baseline. If accuracy drops more than 2 percentage points, an alert fires.
Cost of Evaluation vs Cost of Failure
Teams skip evaluation because it’s expensive. Let me reframe that.
Running an eval suite of 200 examples on GPT-4o costs about 2−5.Runningitweeklycosts100-250 per year.
One wrong answer shown to a customer at the wrong time can cost you:
- A churned enterprise contract ($50K+)
- A compliance violation ($10K+ in fines)
- A viral social media post ($priceless in negative brand impact)
- Engineering time to firefight ($5K+ in labor costs)
The math is obvious. Evaluation is the cheapest insurance you can buy for AI features.
If you take nothing else from this guide: build an eval dataset of 50 examples before you ship. Run it before every prompt change. It takes one afternoon to set up and will save you weeks of debugging over the lifetime of the feature.
The Evaluation Maturity Model
Where are you on this scale?
| Level | Description | Characteristics |
|---|
| 0 - Vibes | Manual spot-checking | ”Looks good to me” |
| 1 - Basic | Eval dataset exists | 50+ examples, run manually |
| 2 - Automated | CI runs evals on prompt changes | Regression testing, automated metrics |
| 3 - Comprehensive | Multiple metric types, LLM judge + human review | A/B testing, sampled quality checks |
| 4 - Mature | Continuous monitoring, drift detection, eval dataset evolves | Weekly eval runs, automated alerting |
Most teams I’ve seen are at Level 0. Getting to Level 1 takes an afternoon. Getting to Level 2 takes a week. The jump from Level 0 to Level 2 is the highest-ROI investment you can make in an AI feature.
Do the work. Your users — and your future self at 2 AM debugging a production issue — will thank you.