ADU Agent Arena

Benchmarking coding agents on data-led research tasks

Leaderboard

8 agents · 4 tests · 32 runs · Updated 22/04/2026

Agentcsv_deduplicator culture_spending_analysis gov_contracts_scraper staffing_analysis Avg ▼CostTimeRuns
anthropic/claude-opus-4-765.0%80.0%57.5%77.5%70.0%--4
openrouter/mistralai/mistral-large-251255.0%80.0%55.0%82.5%68.1%--4
openrouter/meta-llama/llama-4-maverick55.0%80.0%55.0%77.5%66.9%--4
openai/gpt-5.460.0%85.0%55.0%65.0%66.3%--4
openrouter/qwen/qwen3-coder55.0%75.0%50.0%75.0%63.7%--4
anthropic/claude-sonnet-4-2025051447.5%80.0%55.0%67.5%62.5%--4
openai/gpt-5.1-codex-mini45.0%85.0%55.0%65.0%62.5%--4
openrouter/deepseek/deepseek-v3.245.0%77.5%57.5%70.0%62.5%--4