The Agent Stack #009 — Wednesday Stack


Building reliable AI agents means your tools can’t crash when the LLM does something unexpected. Which happens constantly.

I’ve been testing ToolGuard, the “pytest for AI agent tool calls” that launched this week. The premise is simple: fuzz your Python tool functions with edge cases before your agent calls them in production. Missing JSON keys, type mismatches, 10MB payloads, null values — all the creative ways LLMs break your assumptions.

Why This Matters

Most agent failures aren’t model quality issues. They’re system-level problems. Your agent calls get_weather(location="London") but the LLM passes {"city": "London", "country": "UK"} instead. Your function expects a string, gets a dict, crashes.

Traditional unit tests don’t catch this because you write tests assuming correct input. ToolGuard generates adversarial inputs that mirror real LLM behaviour patterns. It found 23 edge cases in my email-sending function that I’d never considered.

The tool generates a reliability score out of 100. My “robust” file processor scored 34/100 before fixes. Humbling.

How It Works

Install with pip install toolguard. Decorate your agent functions with @toolguard.fuzz. Run toolguard test and watch it pummel your code with malformed inputs.

@toolguard.fuzz
def send_email(recipient: str, subject: str, body: str):
    # Your function here

The fuzzer generates hundreds of variations: empty strings, massive payloads, wrong types, missing fields if you’re expecting dicts. It logs failures and suggests fixes.

Better than finding out your agent crashed at 3am because Claude decided to pass a list instead of a string.

The Competition

Onprem launched similar sandboxed agent execution this week. It’s more comprehensive but heavier — full Docker isolation versus ToolGuard’s function-level testing.

For quick validation of individual tools, ToolGuard wins on simplicity. For production agent deployment, Onprem’s broader sandbox approach makes more sense.

Neither addresses the real elephant: most agent tools fail because of external API changes, not input validation. Your weather API suddenly requires API keys, or returns different JSON structure. No amount of fuzzing catches that.

Quick Hits

DashClaw adds human-in-the-loop approval before agent actions execute — smart for high-stakes operations

Agent protocol translator tackles the interoperability mess between different agent frameworks

QCCBot provides browser-based Android environments for AI agents — clever alternative to phone farms

One Thing to Try

Pick your most critical agent tool function. Run ToolGuard against it. The failures will surprise you.

Because the best time to find edge cases is before your users do.