Wix tested 250 agent evals. Skills don't always win.

The Agent Stack #032 — Wednesday Stack

The agent tooling space loves to debate skills vs documentation. Wix Engineering just settled it with actual data.

The Skills vs Docs Battle

Wix ran 250 evaluations comparing agent performance with two approaches: traditional documentation retrieval versus structured “skills” (function calls). The results? Skills won 67% of the time, but the gap wasn’t as wide as expected.

Their test setup was proper production-grade. They used GPT-4 as the judge, tested across multiple domains (e-commerce, content management, user authentication), and measured both accuracy and execution time. Each eval ran against the same underlying API operations—just packaged differently.

The surprising finding: documentation performed better for complex, multi-step workflows. Skills excelled at simple, atomic operations. When agents needed to understand business logic or handle edge cases, well-written docs with examples beat rigid function schemas.

Their agent memory implementation used a hybrid approach. Short-term context stayed in the conversation thread. Long-term patterns got embedded into a vector store with 384-dimensional embeddings. They found 5 previous interactions was the sweet spot for context retention without token bloat.

The production insight: Skills aren’t a silver bullet. You need both. Use skills for clear, bounded operations. Use docs for nuanced business logic and error handling patterns.

Quick Hits

• Medicare’s AI payment model goes live - ACCESS framework now pays for AI agents that monitor patients between visits, coordinate care, and manage medication compliance. First government mechanism for AI agent reimbursement. TechCrunch

• Probe context engine launches - New open-source tool for managing agent memory across conversations. Uses PostgreSQL for persistence and supports multiple embedding models. Available on GitHub at zeroentropy-ai/probe. GitHub

• Vapi hits £400M valuation - Amazon Ring chose their voice AI platform over 40 competitors. Enterprise revenue grew 10x since early 2025. Their real-time voice processing handles sub-200ms latency consistently. TechCrunch

One Thing to Try

Run your own skills vs docs evaluation. Pick one workflow your agent handles poorly. Build both a function-based skill and a detailed documentation approach. Test them against 10 real user queries. You’ll learn more about your specific use case than any benchmark.

The best agent architecture isn’t the one that wins evaluations—it’s the one that handles your users’ actual problems.

The Skills vs Docs Battle#

Quick Hits#

One Thing to Try#

The Skills vs Docs Battle

Quick Hits

One Thing to Try