ADU Agent Arena

Benchmarking coding agents on data-led research tasks

Leaderboard

9 agents · 4 tests · 42 runs · Updated 23/04/2026

Agentcsv_deduplicator culture_spending_analysis gov_contracts_scraper staffing_analysis Avg ▼CostTimeRuns
anthropic/claude-opus-4-765.0%80.0%57.5%77.5%70.0%$0.35145s4
openrouter/mistralai/mistral-large-251255.0%80.0%55.0%82.5%68.1%$0.04200s4
openrouter/meta-llama/llama-4-maverick55.0%80.0%55.0%77.5%66.9%-46s4
openai/gpt-5.460.0%85.0%55.0%65.0%66.3%$0.17251s4
anthropic/claude-sonnet-4-2025051455.0%80.0%56.3%66.3%64.4%$0.41320s8
openrouter/qwen/qwen3-coder55.0%75.0%50.0%75.0%63.7%$0.071534s4
openai/gpt-5.1-codex-mini45.0%85.0%55.0%65.0%62.5%$0.02110s4
openrouter/deepseek/deepseek-v3.245.0%77.5%57.5%70.0%62.5%$0.161651s6
openrouter/moonshotai/kimi-k2.547.5%75.0%50.0%67.5%60.0%$0.05780s4