Prompt Evaluation

Evaluation: How to Test If Your Prompt Is Actually Working

Most people “evaluate” prompts by vibes: This answer looks good… I think? That’s fine for casual use. But if you’re shipping prompts into a workflow—sales emails, support macros, RAG assistants, SQL agents—you need something sturdier.

Prompt evaluation is simply the practice of answering one question:

Does this prompt reliably produce the outcome I want across realistic inputs?

Let’s make that measurable.

The Goal

A prompt is “working” when it performs well across a small but representative set of inputs—not just the one example you tested.

Define Success First

Before you test anything, define what “good” means. Pick 3–6 criteria max.

Common criteria (mix and match):

Accuracy (facts match the input / retrieved sources)
Completeness (covers required fields)
Format (valid JSON/table/email structure)
Tone (professional, friendly, firm)
Safety/compliance (no sensitive data, no disallowed content)
Latency/cost (short enough to run at scale)

Engineers: treat this like an API contract. Non-technical teams: treat this like a quality checklist.

Build a Tiny Eval Set (10–30 items)

You don’t need 10,000 examples. You need the right 10–30 examples.

Include:

Happy paths (typical cases)
Edge cases (missing info, ambiguity, long inputs)
Adversarial cases (prompt injection-like content, hostile tone, weird formatting)

If your prompt is for customer support, grab 10 real tickets. If it’s for SQL generation, collect 10 real questions + schemas. If it’s for recruiting outreach, collect 10 candidate profiles and roles.

Choose a Scoring Method

Use one of these (or combine them):

Checklist scoring (human)
Quick: mark criteria as pass/fail.
Rubric scoring (1–5)
More nuance for tone or quality.
Programmatic checks (automated)
Validate JSON schema, word count, presence of required fields, banned phrases, etc.
LLM-as-a-judge (carefully)
Useful for tone and coherence, but still calibrate with human review.

Example 1: Non-Technical Eval (Sales Email Prompt)

You want a prompt that drafts outreach emails. Don’t test it once—test it across personas.

Text

Context: You are evaluating a sales outreach prompt for consistency.
Instruction: Create an eval set of 12 prospects (different industries) and score outputs.
Input Data:
Criteria checklist:
- Mentions the prospect’s role and company context (1 point)
- One clear value prop (1 point)
- CTA is a single question (1 point)
- Under 130 words (1 point)
- No fluff or hype ("revolutionary", "game-changing") (1 point)
Output Indicator:
Run the prompt on all 12 prospects and output a table:
Prospect | Score (0–5) | Fail Reasons | Revised Prompt Suggestion (if score <4)

Even if you don’t automate it, this gives you a repeatable process—and a clear threshold for “ship it.”

Example 2: Technical Eval (Structured Output + Guardrails)

You’re building a prompt that extracts entities into JSON. Great—now test validity.

Text

Context: You are evaluating an extraction prompt used in a production pipeline.
Instruction: Test the prompt on 15 inputs and measure format reliability.
Input Data:
Target JSON schema:
{ "customer": string, "issue": string, "priority": "low"|"medium"|"high", "next_steps": string[] }
Eval cases include:
- short ticket
- long ticket with signature blocks
- ambiguous priority
- injection-like text inside the ticket
Output Indicator:
For each input, return:
1) JSON_VALID (true/false)
2) SCHEMA_VALID (true/false)
3) If false, show the minimal fix to the prompt that would prevent the failure.
Also report overall pass rates.

This aligns with how engineers actually ship prompts: define schema, run a batch, compute pass rates, iterate.

Track Failures, Not Wins

The fastest way to improve prompts is to categorize failures: missing fields, wrong tone, hallucinations, invalid format, ignored constraints. Fix the top 1–2 failure modes first.

Takeaway

If you want to know whether a prompt “works,” stop judging single outputs and start running lightweight evaluations. Define success criteria, build a small eval set, score outputs systematically, and iterate based on failure patterns. Prompting becomes dramatically easier when you treat prompts like products: test, measure, refine.