New Blog! Evaluate Your Agents!
Most organizations treat AI agents like overpriced Roombas: flip the switch, cross your fingers, and pretend everything's under control. But unlike Roombas—which just bump into your couch until the battery dies—your AI agents are making business decisions, handling customer interactions, and potentially torching your reputation. Right now, as you read this, do you honestly know what your agents are up to? Are they improving, or just repeatedly bumping into metaphorical walls? It’s the same as asking the question, “why did the Roomba miss the dirt in this corner?” If you don’t ask and you don’t use the answer to redesign the living room for optimal Roomba coverage, then your floor is going to stay gross.
Lev Shestov, a philosopher who loved making people uncomfortable, wrote: “The business of philosophy is to teach man to live in uncertainty… not to reassure him, but to upset him.” He probably wasn't thinking about your chatbot’s existential crises, but his insight applies perfectly. If deploying AI doesn't give you at least mild anxiety, you're probably not paying attention.
Here's a little-known truth: your AI agents already behave like optimization algorithms. Every time they tackle a task, assess the outcome, and tweak their approach, they're running tiny optimization loops—much like the training cycles of a machine learning model. Yet most organizations still deploy these agents with about as much rigor as launching paper airplanes. Seriously, would you launch a classification algorithm without asking tools to assess its performance on a hold-out set? (On second thought, don’t answer that, I don’t want to know.)
If you actually want results (and I'm assuming you're paying these cloud bills for a reason), it's time to treat your agents like the optimizers they secretly are. Here’s how:
Models have loss functions—your agents need clear, measurable goals, not vague suggestions like "be helpful" or "improve efficiency."
Instead, try:
"Resolve 90% of inquiries without bothering human employees."
"Identify revenue opportunities similar to the five most recent closed/won opps."
Clear objectives get your engineers, product people, and business stakeholders aligned. Vague goals just lead to vague results—and probably vague disappointment.
In training models, gradients guide improvement. For your agents, these gradients come from APIs, databases, and knowledge bases, diving deeply into your data and veering well outside of the regime of foundation model training data. But if your agents keep making the same mistakes, it's like they're blindly following broken GPS directions.
Get more nuanced than asking “did the bot do the thing?”:
"Are agents repeatedly asking the same dumb questions?"
"Is crucial data missing or embarrassingly outdated?"
"Are there expensive tools our agents completely ignore—and should we kill those subscriptions?"
Identify these patterns, fix the source, and watch performance go up.
Endless loops might be fun in philosophy, but they’re disastrous in business. Clearly define when your agents should stop:
When to escalate to humans.
When to stop banging their digital heads against the wall.
When to declare victory
Agents without boundaries are like interns trapped in endless meetings—burning resources without accomplishing anything useful.
Stop celebrating isolated victories. Evaluate your agents over multiple iterations. The real question isn't, "Did it succeed once?" It's, "Is it consistently improving, or just occasionally lucky?"
Track trends over time:
Success rates or agent effectiveness: Are the agents completing the task as assigned and adding value to the generic LLM results? Can you capture this with concrete metrics like precision vs. recall or via human evaluation?
Are you learning to ask better questions using the agents?: what is the business impact on revenue or customer satisfaction?
Is agent performance improving?: Are they completing faster, or with less of a load on the various integration points (which are never free, by the way).
This long-term view proves your agents are optimizing instead of just stumbling into random successes. Exercise for the read: it’s crucial to track the data environment the agents are operating in at that time as well.
Deploying AI without systematic evaluation isn't bold innovation—it's gambling with your company's reputation. Your agents have incredible potential, but only if you treat them like the optimizers they secretly are.
Next time someone asks what your agents are doing, you'll confidently reply:
"They're systematically improving, iteration by iteration, not just bumping into expensive furniture. And yes, I can prove it."
That's the difference between hoping your AI gets lucky and ensuring it actually delivers.