Agent Benchmark Directory

A comprehensive directory of 50+ benchmarks for evaluating AI agents. Compare coding, web, tool-use, reasoning, and safety benchmarks. Find the right evaluation for your AI agent use case.

Benchmarks

Categories

—

Leaderboards

All Coding Web Reasoning General Data Science Tool Use Computer Use Safety Domain Multimodal

	Benchmark	Category	Tasks		Description	Year

Click any row to expand details — what it measures, how, metrics, and links.

Email us: info@extractum.io. Our Privacy Policy | Terms and Conditions | Suggest an improvement.

Our Social Media →

Original data from HuggingFace, OpenCompass and various public git repos.

Check out Ag3ntum — our secure, self-hosted AI agent for server management.

Release v20260324

Support LLM Explorer