12 Apr 2026/4 min

Evals before features.

An agent without evals is a liability. Here's the smallest possible eval harness I ship before letting any LLM touch a customer.

Most AI products I see in the wild ship with two prompts, a vibe check, and a prayer. It works in demos. It breaks in production. And by the time it breaks, nobody can explain why.

The fix is unglamorous: build the eval harness first. Not a fancy one — a CSV of inputs, the expected behavior, and a script that runs the agent across all of them and tells you the pass rate. Twenty rows is enough to start.

Once that exists, every prompt change, model swap, or tool addition has a number attached to it. Suddenly you're not arguing about whether the new prompt is better — you're looking at 78% vs 84%.

This is the single highest-leverage habit I've picked up. It turns AI development from a taste contest into engineering.

← All writing