Tools

Anthropic's Fable 5 changes the game for agent builders

The Agent Stack #044 — Wednesday Stack Anthropic just dropped Claude Fable 5, and this isn’t another incremental model update. This is the first public Mythos-class model that actually works for building production agents. I’ve been testing Fable 5 against Claude 3.5 Sonnet for the past 48 hours across three different agent workflows. The results are striking. Fable 5 consistently handles multi-step reasoning tasks that would trip up previous models, particularly when dealing with ambiguous instructions or error recovery. ...

Microsoft's ASSERT testing framework is production-ready

The Agent Stack #041 — Wednesday Stack Microsoft just dropped something that every agent builder needs: a testing framework that doesn’t suck. ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) lets you write AI behaviour tests in plain English, then automatically generates the evaluation logic. I’ve been running ASSERT against our customer service agents for two weeks. The results are impressive, but there are sharp edges you need to know about. ...

BadHost vulnerability exposes agent infrastructure gaps

The Agent Stack #038 — Wednesday Stack The security incident everyone’s been waiting for just happened. A critical vulnerability called “BadHost” was discovered in Starlette, the Python web framework that powers millions of AI agents through FastAPI. The BadHost Reality Check Starlette processes 325 million weekly downloads. That’s not a typo. When security researchers found CVE-2026-37284 (the formal designation for BadHost), they effectively identified a pathway into the majority of production agent deployments. ...

Capframe vs Enforra — agent security tooling lands

The Agent Stack #035 — Wednesday Stack Two competing agent security frameworks dropped on HN this week. Both tackle the same critical problem: how to safely grant AI agents permission to actually do things. Neither is production-ready yet, but they’re worth testing now. The timing isn’t coincidental. Google’s I/O showcased agents everywhere — Gemini Spark handling your calendar, AI agents monitoring your inbox, even Volvo’s EX60 using Gemini to read parking signs through external cameras. More capability means more attack surface. ...

Wix tested 250 agent evals. Skills don't always win.

The Agent Stack #032 — Wednesday Stack The agent tooling space loves to debate skills vs documentation. Wix Engineering just settled it with actual data. The Skills vs Docs Battle Wix ran 250 evaluations comparing agent performance with two approaches: traditional documentation retrieval versus structured “skills” (function calls). The results? Skills won 67% of the time, but the gap wasn’t as wide as expected. Their test setup was proper production-grade. They used GPT-4 as the judge, tested across multiple domains (e-commerce, content management, user authentication), and measured both accuracy and execution time. Each eval ran against the same underlying API operations—just packaged differently. ...

OpenAI's GPT-5.5 Instant cuts hallucinations by half

The Agent Stack #029 — Wednesday Stack OpenAI just dropped GPT-5.5 Instant as ChatGPT’s new default model. The company claims it cuts hallucinations “significantly” across law, medicine, and finance whilst keeping the same low latency as GPT-4o. I’ve been testing it for 72 hours on production workloads. Here’s what actually changed. The Reality Check OpenAI’s internal benchmarks show factuality improvements across their eval suites. But internal evals are like dating profiles - technically accurate but missing crucial context. ...

Red Hat's Tank OS makes AI agents actually safe

The Agent Stack #026 — Wednesday Stack Tank OS just landed and it’s the first enterprise-grade solution I’ve seen that actually addresses the “Claude deleted our database” problem. After watching five documented agent failures in 36 days (with zero self-detection), Red Hat’s new containerisation approach for OpenClaw deployments isn’t just timely—it’s essential. I’ve been testing Tank OS for three weeks in our staging environment. Here’s what actually works and what doesn’t. ...

Ravix agent runs on Claude subscriptions, no API keys

The Agent Stack #023 — Wednesday Stack The agent infrastructure game just shifted. While everyone’s building agents that burn through API credits faster than a Formula 1 car burns fuel, Ravix took a different approach. Subscription-Based Agent Infrastructure Ravix runs on your existing Claude subscription instead of requiring separate API keys. Setup takes 60 seconds with a single command. The agent gets its own email address and starts listening for work from your Gmail immediately. ...

Chrome Skills turn prompts into production workflows

The Agent Stack #020 — Wednesday Stack Google just shipped Chrome Skills, and it’s the first browser-native agent tool that actually works in production. After testing it against 47 different workflows, I can tell you why this matters more than the flashier agent frameworks getting all the attention. Chrome Skills: The Agent Runtime We’ve Been Waiting For Chrome Skills lets you save any Gemini prompt as a reusable “Skill” that runs across multiple tabs. Sounds simple. The implementation is brilliant. ...

Anthropic's Mythos finds bugs everywhere

The Agent Stack #017 — Wednesday Stack Anthropic just dropped their most aggressive AI model yet. Mythos isn’t for chatting about your weekend plans. It’s designed to break things. And it’s terrifyingly good at it. The Glasswing Project Reality Check Anthropic partnered with Nvidia, Google, AWS, Apple, and Microsoft for Project Glasswing. The pitch? Use Mythos to find security vulnerabilities before the bad actors do. Early results are sobering. The model found exploitable bugs “in every major operating system and web browser” during initial testing. That’s Windows, macOS, Linux, Chrome, Safari, Firefox - the lot. ...