Agents that do work, not demos

The gap between an impressive AI demo and one you'd trust with real operations.

Every week there's a new agent demo that looks like magic. Most of them would not survive a Tuesday in a real operation. The gap between the two isn't model quality — it's everything wrapped around the model.

A demo succeeds once, on a happy path, with someone watching. An operational agent has to succeed on the hundredth run, on inputs nobody anticipated, with no one watching, and it has to fail safely when it can't. Those are different engineering problems wearing the same costume.

What closes the gap, in my experience: tight scope (an agent that does one job well beats one that does ten jobs unreliably), real tools instead of free-text guessing, guardrails that make the expensive mistakes impossible rather than merely discouraged, and logs you can actually read when something goes sideways at 2am.

The unglamorous truth is that a reliable agent looks a lot like reliable software — bounded, observable, and boring in the best way. The intelligence is the cheap part now; the harness around it is the work.

I'd rather ship an agent that does three things I can trust than one that does thirty I have to babysit. The first one buys back time. The second one is a demo with a salary.

← All notes Schedule a call →