The Agent Stack #009 — Wednesday Stack
Building reliable AI agents means your tools can’t crash when the LLM does something unexpected. Which happens constantly.
I’ve been testing ToolGuard, the “pytest for AI agent tool calls” that launched this week. The premise is simple: fuzz your Python tool functions with edge cases before your agent calls them in production. Missing JSON keys, type mismatches, 10MB payloads, null values — all the creative ways LLMs break your assumptions.
Why This Matters
Most agent failures aren’t model quality issues. They’re system-level problems. Your agent calls get_weather(location="London") but the LLM passes {"city": "London", "country": "UK"} instead. Your function expects a string, gets a dict, crashes.
Traditional unit tests don’t catch this because you write tests assuming correct input. ToolGuard generates adversarial inputs that mirror real LLM behaviour patterns. It found 23 edge cases in my email-sending function that I’d never considered.
The tool generates a reliability score out of 100. My “robust” file processor scored 34/100 before fixes. Humbling.
How It Works
Install with pip install toolguard. Decorate your agent functions with @toolguard.fuzz. Run toolguard test and watch it pummel your code with malformed inputs.
@toolguard.fuzz
def send_email(recipient: str, subject: str, body: str):
# Your function here
The fuzzer generates hundreds of variations: empty strings, massive payloads, wrong types, missing fields if you’re expecting dicts. It logs failures and suggests fixes.
Better than finding out your agent crashed at 3am because Claude decided to pass a list instead of a string.
The Competition
Onprem launched similar sandboxed agent execution this week. It’s more comprehensive but heavier — full Docker isolation versus ToolGuard’s function-level testing.
For quick validation of individual tools, ToolGuard wins on simplicity. For production agent deployment, Onprem’s broader sandbox approach makes more sense.
Neither addresses the real elephant: most agent tools fail because of external API changes, not input validation. Your weather API suddenly requires API keys, or returns different JSON structure. No amount of fuzzing catches that.
Quick Hits
• DashClaw adds human-in-the-loop approval before agent actions execute — smart for high-stakes operations
• Agent protocol translator tackles the interoperability mess between different agent frameworks
• QCCBot provides browser-based Android environments for AI agents — clever alternative to phone farms
One Thing to Try
Pick your most critical agent tool function. Run ToolGuard against it. The failures will surprise you.
Because the best time to find edge cases is before your users do.