AutoBench Leaderboard
Multi-run AutoBench leaderboard with historical navigation
Using AI to benchmark and route AI
AutoBench is the premier LLM evaluation and routing infra for the Agentic Era. This is not just about LLM benchmarking, but real-time, AI-trained LLM routing for agents (delivering up to 90% inference cost savings).
We are solving the LLM evaluation crisis by moving the industry beyond static, domain-rigid, and easily gameable benchmarks. AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate LLM performance. Our benchmarks correlate 80-90% with industry standards, but they remain strictly un-gameable, unbiased, granular, flexible. At a fraction of the cost.
And that is just the beginning. We leverage the massive synthetic datasets generated by our benchmarks to train next-gen Agentic LLM Routers, helping agentic frameworks optimize for both quality and economics.
Our vision sees AutoBench as the essential, universal layer that will soon intermediate between all AI agents and underlying LLMs.
Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn Agentic Virtual Environments.
tools[] arrays filled with randomly injected "distractor" tools.Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic Agentic LLM Routers. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.
The core engine powering our latest generalist and domain-specific runs (such as our Agronomy vertical). AutoBench 2.0 introduces three major technical breakthroughs to the Collective-LLM-as-a-Judge framework:
Powered by AutoBench's evaluation methodology, Bot Scanner is the "skyscanner for LLM responses." It is a live platform that allows users to route a single prompt to multiple "responder" LLMs simultaneously, and then uses AutoBench's "judge" LLMs to evaluate, rank, and deliver the absolute best answer instantly, ending LLM guesswork.
The foundational open-source framework that proved the Collective-LLM-as-a-Judge concept. It remains free and available for researchers and developers to run local evaluations and explore the core architecture.
AutoBench solves the traditional tradeoff between scalability, cost, and accuracy:
Our methodology is scientifically validated and continuously peer-reviewed. We extend our immense gratitude to our partners and supporters:
Whether you are an AI researcher, a prompt engineer, or an enterprise IT architect deploying autonomous agents, AutoBench has the data you need to stop flying blind.
Inference Support: Running a compute-intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits to expand the scope of our evaluations.
If you use AutoBench in your research, please cite our validation paper:
@misc{autobench2025,
title={AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment},
author={AutoBench},
year={2025},
eprint={2510.22593},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={[https://arxiv.org/abs/2510.22593](https://arxiv.org/abs/2510.22593)},
}