Evaluating a chatbot is straightforward: does the answer match? Evaluating an agent — a system that plans, calls tools, modifies state, coordinates with users, and operates over many turns — is a fundamentally harder problem.
I put together awesome-agentic-evaluation, a curated list of benchmarks, environments, papers, competitions, and open-source platforms for evaluating AI agents in interactive, tool-using, dynamic, and production-like settings. This post walks through the key ideas.
Why agentic evaluation is different
A standard LLM benchmark asks: did the model produce the right text?
An agentic benchmark needs to ask much more:
- Did the agent use the right tools in the right order?
- Did it follow business policies and safety constraints?
- Did it recover when something went wrong?
- Was it consistently reliable, not just occasionally correct?
- Did it coordinate properly with users or other agents?
These questions demand fundamentally different evaluation infrastructure: sandboxed environments, trajectory scoring, execution-based judging, and reliability metrics that go beyond pass@k.
Key concepts worth knowing
Before diving into benchmarks, a few terms that come up constantly:
| Term | What it means |
|---|---|
| Trajectory | The full sequence of actions, observations, and decisions an agent makes. Trajectory evaluation judges the process, not just the final outcome. |
| pass^k | A reliability metric from τ-bench. Unlike pass@k (did the agent succeed in any of k tries), pass^k asks: did it succeed on all k tries? This measures consistency. |
| Execution-based judging | Running the agent’s outputs (executing code, applying patches, checking database state) rather than text-matching or LLM-as-judge. Much stronger correctness guarantees. |
| Dual-control | Settings where both the agent and the user can change the shared environment, surfacing coordination failures that single-control setups miss entirely. |
| Contamination | When a model has seen benchmark data during training, inflating scores beyond true capability. A major validity concern for static benchmarks. |
The foundational benchmark families
Three lineages have shaped how the community thinks about agent evaluation:
τ-bench (Sierra Research)
The τ (tau) family models the triangle of tool, agent, and user. Tasks are realistic customer-service scenarios where the agent must follow business policies while helping a simulated user via tool calls.
- τ-bench introduced policy-aware conversational tasks and the
pass^kreliability metric. - τ²-bench extends this to dual-control: both the agent and the simulated user can act on the shared environment. This catches coordination failures — e.g., the user booking a flight while the agent simultaneously modifies the same reservation.
SWE-bench (Princeton NLP → Stanford)
The canonical benchmark for coding agents. Given a real GitHub repository and a real issue, the agent must generate a patch that resolves the problem. Tasks come from actual open-source projects (Django, scikit-learn, sympy) with real test suites for execution-based verification.
The family has grown to include SWE-bench Verified (human-verified subset), SWE-bench Multimodal (visual software domains), and Multi-SWE-bench (multi-language: Java, TypeScript, Rust, Go).
AgentBench (Tsinghua University)
A multi-dimensional benchmark across 8 distinct environments — operating system, database, knowledge graph, card game, web shopping, web browsing, and more. It identified that poor long-term reasoning, decision-making, and instruction following are the main bottlenecks for LLM-as-Agent systems.
Benchmarks by domain
The full list covers benchmarks across many domains:
Conversational & tool-using agents — APIGen-MT, FlowBench, Gorilla/BFCL, IntellAgent, ToolSandbox. These test multi-turn conversations, tool selection, policy compliance, and partial-progress scoring.
Web & computer-use agents — WebArena, VisualWebArena, OSWorld, BrowserGym, AgentLab. These put agents in real browser and desktop environments where they must perceive screen content, plan actions, and execute them correctly.
Software engineering & DevOps — SWE-bench family, DevOps-Gym, MLE-bench, MLGym, SWT-bench. These cover the full spectrum from bug-fixing to ML experiment management to regression-test generation.
Cybersecurity — Cybench, CyberGym, BountyBench, CVE-Bench, NYU CTF Bench. Security-oriented tasks requiring agents to reason about vulnerabilities, exploits, and adversarial scenarios.
Safety & dangerous capabilities — Google DeepMind’s dangerous capability evaluations (self-proliferation, self-reasoning), Anthropic’s model-written evaluations, Stanford’s HELM Safety Leaderboard, AIR-Bench. Critical for frontier model assessment and policy decisions.
Multi-agent — AutoGen Bench, Magentic-One. Testing how agents collaborate, coordinate, or compete with each other.
Dynamic & evolving environments — AUTOENV, ToolQA-D. These address the fundamental problem that static benchmarks eventually saturate and become gameable. Dynamic environments generate fresh tasks and change conditions over time.
Evaluation tooling and observability
Building a benchmark is only half the problem. You also need infrastructure to run evaluations, trace agent behavior, and analyze results:
- Eval frameworks: DeepEval (pytest-like LLM evaluation), agentevals (trajectory-focused), OpenAI Evals, Inspect Evals (UK AI Safety Institute).
- Tracing & observability: Langfuse, Phoenix, Braintrust. Production agent evaluation requires visibility into what agents are actually doing — every tool call, observation, and decision point.
- Task standards: The METR Task Standard provides a common format for defining agent evaluation tasks, already used for 200+ task families. Google’s A2A protocol and Anthropic’s MCP standardize agent communication and tool interfaces.
Seven best practices for agentic evaluation
Drawing from all the benchmarks and papers in the list, these emerging practices represent the current state of the art:
1. Use execution-based evaluation over text matching
Text comparison (exact match, BLEU, LLM-as-judge) is brittle for agentic tasks. An agent might fix a bug with a different but equally valid approach. Run the agent’s outputs in a real environment — execute the code, apply the patch, run the test suite.
2. Measure reliability, not just peak performance
An agent that succeeds 1 out of 10 times is useless in production. Use pass^k metrics that penalize inconsistency. Report variance across runs. Test with different seeds and prompt formulations.
3. Evaluate process, not just outcomes
A correct final answer can mask dangerous intermediate behavior — hallucinated tool calls that happened to work, or policy violations along the way. Score trajectories and intermediate steps. Check policy compliance at each turn.
4. Design for contamination resistance
Static benchmarks inevitably leak into training data. Use dynamic environments that generate fresh tasks. Maintain hidden test sets. Regularly refresh benchmark data.
5. Test dual-control and coordination
Most real deployments involve coordination with humans or other agents who independently change the environment. Single-agent benchmarks miss coordination failures.
6. Sandbox everything, log everything
Agents that execute code, call APIs, or modify files can cause real damage. Use containerized environments. Implement undo and damage confinement. Deploy comprehensive tracing for every evaluation run.
7. Validate your benchmark before publishing
Many published benchmarks contain ambiguous tasks, incorrect ground truth, or evaluation functions that don’t measure what they claim. Follow the ABC (Agentic Benchmark Checklist). Have human experts verify task solvability.
Design patterns to know
When choosing or designing a benchmark, these are the fundamental design axes:
| Axis | Options | Trade-off |
|---|---|---|
| Scoring target | Outcome vs. Process | Outcome is simpler; process reveals how the agent got there |
| Environment | Static vs. Dynamic | Static is reproducible but gameable; dynamic resists memorization |
| Control model | Single vs. Dual/Multi | Single is simpler; dual better reflects real coordination challenges |
| Judging method | Text matching vs. LLM judge vs. Execution | Execution is strongest but requires sandboxing |
| Task source | Hand-crafted vs. Programmatic vs. Real-world | Hand-crafted is high quality but expensive; real-world is authentic but noisy |
A good benchmark makes its stance on each axis explicit.
Who’s building what
The ecosystem spans research labs, big tech, and startups:
- Research labs: UC Berkeley (AgentBeats, CyberGym, Gorilla/BFCL), Stanford CRFM (HELM), Princeton NLP (SWE-bench), METR (Task Standard), Tsinghua (AgentBench), CMU (WebArena).
- Large companies: OpenAI (Evals, MLE-bench), Google DeepMind (Dangerous Capability Evaluations), Anthropic (MCP, model-written evals), Microsoft (AutoGen, Magentic-One), Meta (MLGym), Sierra (τ-bench family).
- Startups & OSS: Arize AI (Phoenix), Braintrust, Confident AI (DeepEval), LangChain (agentevals), Langfuse.
The repo
The full curated list — with links to every benchmark, paper, tool, competition, and conference tutorial mentioned above (and many more) — is on GitHub:
github.com/pauldebdeep9/awesome-agentic-evaluation
Contributions are welcome. The list is licensed under CC0-1.0.