Agent Benchmark Directory
A comprehensive directory of 50+ benchmarks for evaluating AI agents. Compare coding, web, tool-use, reasoning, and safety benchmarks. Find the right evaluation for your AI agent use case.
53
Benchmarks
10
Categories
—
Leaderboards
| Benchmark | Category | Tasks | Description | Year |
|---|
Click any row to expand details — what it measures, how, metrics, and links.
Original data from HuggingFace, OpenCompass and various public git repos.
Check out Ag3ntum — our secure, self-hosted AI agent for server management.
Release v20260324