AI & ML interests

Using AI to benchmark and route AI

Recent Activity

PeterKruger  updated a Space 2 days ago
AutoBench/README
PeterKruger  updated a Space 18 days ago
AutoBench/AutoBench-Leaderboard
PeterKruger  updated a Space 21 days ago
AutoBench/AutoBench-Leaderboard
View all activity

Organization Card

AutoBench

Organization Description

AutoBench is the premier LLM evaluation and routing infra for the Agentic Era. This is not just about LLM benchmarking, but real-time, AI-trained LLM routing for agents (delivering up to 90% inference cost savings).

We are solving the LLM evaluation crisis by moving the industry beyond static, domain-rigid, and easily gameable benchmarks. AutoBench uses massive pools of LLMs to dynamically generate tasks, execute multi-turn workflows, and granularly evaluate LLM performance. Our benchmarks correlate 80-90% with industry standards, but they remain strictly un-gameable, unbiased, granular, flexible. At a fraction of the cost.

And that is just the beginning. We leverage the massive synthetic datasets generated by our benchmarks to train next-gen Agentic LLM Routers, helping agentic frameworks optimize for both quality and economics.

Our vision sees AutoBench as the essential, universal layer that will soon intermediate between all AI agents and underlying LLMs.

The AutoBench Ecosystem

1. AutoBench Agentic (Latest Evolution)

Current agentic benchmarks (like Gval-AA or Terminal-bench) are static, allowing models to simply "train to the test." AutoBench Agentic fundamentally changes this by dropping LLMs into dynamically generated, multi-turn Agentic Virtual Environments.

  • Technical Complexity: Our infrastructure combines deterministic procedures and LLM generation to build complex business-flavored agentic task payloads via a native Universal Intermediate Representation (UIR). We inject stateful "memory lines" of previous workflow failures, and force models to navigate complex native JSON tools[] arrays filled with randomly injected "distractor" tools.
  • 10 Granular Task Types: We evaluate true orchestration under pressure, measuring specific capabilities like Adaptive Replanning, Parameter Complexity, Single Tool Call, and Failure Recovery.
  • Cost vs. Performance Tracking: Tracks exact P99 Latency and strict USD/run costs to help developers define their efficiency frontier.

2. Agentic LLM Routing (alpha)

Benchmarking is just the first step. AutoBench uses the millions of execution traces and granular performance data generated by our runs to train dynamic Agentic LLM Routers. Instead of passive gateways or superficial semantic heuristic routers, AutoBench empowers active pipeline optimization—routing complex edge cases to frontier models and standard tasks to open-weight models, saving agent applications and enterprises up to 90% in API costs.

3. AutoBench 2.0 & Domain Benchmarks

The core engine powering our latest generalist and domain-specific runs (such as our Agronomy vertical). AutoBench 2.0 introduces three major technical breakthroughs to the Collective-LLM-as-a-Judge framework:

  • Random Score Pooling: Instead of prefixed judging models, we pool random models for every scoring session, expanding exploration of the "LLM performance space" while reducing required compute.
  • Nonlinear Weighting: Replaces simple linear averaging with advanced weighting functions (exponential, power-law, Boltzmann) to compensate for variance and improve convergence among highly capable frontier models.
  • Parallel Iteration: Reduces evaluation cycles from days to mere hours.

4. Bot Scanner (Consumer/Dev Platform)

Powered by AutoBench's evaluation methodology, Bot Scanner is the "skyscanner for LLM responses." It is a live platform that allows users to route a single prompt to multiple "responder" LLMs simultaneously, and then uses AutoBench's "judge" LLMs to evaluate, rank, and deliver the absolute best answer instantly, ending LLM guesswork.

5. AutoBench 1.0 (Open Source)

The foundational open-source framework that proved the Collective-LLM-as-a-Judge concept. It remains free and available for researchers and developers to run local evaluations and explore the core architecture.

Key Differentiators & Industry Correlations

AutoBench solves the traditional tradeoff between scalability, cost, and accuracy:

  • Strictly Un-gameable: Because tasks and environments are dynamically generated at runtime, test-set contamination is impossible. Models cannot "memorize" the benchmark.
  • Highly Correlated (Scientific Validation): Despite its dynamic nature, AutoBench achieves massive correlation with rigid, human-verified industry standards:
    • Agentic Correlations: 85.15% with the Artificial Analysis Intelligence Index, 84.56% with GDPval-AA, and 83.00% with Terminal-Bench Hard.
    • Generalist Correlations: 89.38% with the Artificial Analysis Index, 82.21% with MMLU-Pro, and 71.84% with LMSYS Chatbot Arena (Human Preference).
  • High Granularity & Adaptability: Unlike one-size-fits-all tests, AutoBench's architecture easily adapts to highly specialized, domain-specific verticals. We provide granular, topic-specific performance insights—such as our recent Agronomic Benchmark, allowing enterprises to test models on their exact proprietary schemas and niche industry knowledge.
  • Highly Scalable & Cost-Effective: A comprehensive benchmark evaluating 30+ models costs a fraction of human-annotated alternatives (often under $100 in raw compute).

Scientific Validation & Acknowledgements

Our methodology is scientifically validated and continuously peer-reviewed. We extend our immense gratitude to our partners and supporters:

  • Translated: Global leader in Professioal AI-enabled translations and high-quality training human data generation for their continued support in compute resources and strategic insight.
  • DIAG, Sapienza Università di Roma: The team led by Prof. Fabrizio Silvestri for providing the rigorous scientific validation that underpins our methodology.
  • eZecute: The venture builder for enabling the industrialization and scaling of this platform.
  • AWS Startups: For compute credits.

Explore, Connect, and Contribute

Whether you are an AI researcher, a prompt engineer, or an enterprise IT architect deploying autonomous agents, AutoBench has the data you need to stop flying blind.

Inference Support: Running a compute-intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits to expand the scope of our evaluations.

Citation

If you use AutoBench in your research, please cite our validation paper:

@misc{autobench2025,
      title={AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment}, 
      author={AutoBench},
      year={2025},
      eprint={2510.22593},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={[https://arxiv.org/abs/2510.22593](https://arxiv.org/abs/2510.22593)}, 
}