Agent Benchmark Directory

A comprehensive directory of 50+ benchmarks for evaluating AI agents. Compare coding, web, tool-use, reasoning, and safety benchmarks. Find the right evaluation for your AI agent use case.
53
Benchmarks
10
Categories
Leaderboards
All Coding Web Reasoning General Data Science Tool Use Computer Use Safety Domain Multimodal
Benchmark Category Tasks Description Year
Click any row to expand details — what it measures, how, metrics, and links.
Our Social Media →  
Original data from HuggingFace, OpenCompass and various public git repos.
Check out Ag3ntum — our secure, self-hosted AI agent for server management.
Release v20260324