The Agent Stack #041 — Wednesday Stack


Microsoft just dropped something that every agent builder needs: a testing framework that doesn’t suck. ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) lets you write AI behaviour tests in plain English, then automatically generates the evaluation logic.

I’ve been running ASSERT against our customer service agents for two weeks. The results are impressive, but there are sharp edges you need to know about.

ASSERT in Action

The concept is simple. Instead of writing complex evaluation scripts, you describe what good behaviour looks like:

"The agent should acknowledge the customer's frustration before offering solutions"
"Response time should be under 3 seconds for simple queries"  
"Never suggest cancelling when the customer mentions pricing concerns"

ASSERT converts these specs into executable tests using GPT-4 as the evaluator. The framework runs continuous regression testing as you update your agent prompts or switch models.

The magic happens in the specification layer. ASSERT uses structured YAML files that business teams can actually read and modify. No more “trust me, the tests cover that scenario” conversations between engineering and product.

Production Reality Check

The good: ASSERT caught three critical regressions that our manual testing missed. When we switched from GPT-4 to Claude 3.5 Sonnet, it immediately flagged that our appointment booking agent started being overly formal with returning customers.

The rough bits: Test execution is slow (30+ seconds per spec on complex scenarios) and expensive. Running our full test suite costs about £15 per run with GPT-4 as the evaluator. Microsoft suggests using GPT-4o-mini for cheaper evaluation, but accuracy drops noticeably.

The framework also struggles with edge cases involving context windows. Agents that reference previous conversations or maintain state across multiple interactions often produce false positives.

Alternatives and Trade-offs

Weights & Biases’ Weave offers similar natural language testing but requires more setup. LangSmith has better observability but weaker test generation. Braintrust gives you more control over evaluation logic but demands traditional programming skills.

ASSERT hits the sweet spot for teams that want sophisticated testing without hiring ML engineers. The Microsoft integration with Azure AI Studio is seamless if you’re already in that ecosystem.

But here’s the key insight: ASSERT works best as part of a testing stack, not a replacement for everything. Use it for behaviour verification, but keep your unit tests for core functions and performance benchmarks for latency.

Quick Hits

Scout Integration: Microsoft’s new Scout assistant (built on OpenClaw) includes ASSERT by default for testing custom workflows • Open Source: Full framework available on GitHub at microsoft/assert-eval with MIT licence
Cost Control: Set evaluation budgets in the config to avoid surprise bills like Uber’s £2M AI overspend

One Thing to Try

Set up ASSERT with three basic specs for your main agent workflow this week. Start simple: response tone, factual accuracy, and safety guardrails. The 30-minute investment will save hours of manual testing later.

The best testing frameworks disappear into your workflow. ASSERT is getting there.