Figure 1: An overview of the DiscoBench system. A user can select a domain from a broad range of different areas of machine learning. After, they can select files from the list of available modules for an AI research agent to edit. We support selecting any combination of modules, from just one to all of them!

Figure 1: An overview of the DiscoBench system. A user can select a domain from a broad range of different areas of machine learning. After, they can select files from the list of available modules for an AI research agent to edit. We support selecting any combination of modules, from just one to all of them!

Evaluating Automated Algorithm Discovery

One long term goal of automated algorithm discovery systems is to safely automate AI research itself. To do so, we need to be able to measure the ability of AI research agents. While current benchmarks do exist, they suffer from fundamental limitations, including data contamination, poor quality evaluations, and difficulty in assessing whether methods discovered by the agent generalise to new tasks and domains.

We design DiscoBench specifically with the issues of current benchmarks in mind; as such, we hope that it can remain pertinent for a long time. Below, we explain why DiscoBench is so useful, and some of the tasks that are currently implemented in it; expect these to grow over the next few months!

Limitations of Current Benchmarks

Criteria MLEBench ASTABench MLGym Bench DiscoBench
Algorithm Transfer
Task Diversity 🟡 🟡
Unbiased to Initialisation 🟡
Contamination Resistant 🟡 🟡

What is DiscoBench?

DiscoBench is a new task-generation framework, and task-suite, for algorithm discovery and AI research agents. As well as already providing a number of problems with which to measure an agent's performance for AI research, we believe it provides a useful foundation where efforts in automated, open-ended scientific research can effectively take place.

DiscoBench is set up in a modular way, such that you can specify which parts of a codebase an LLM edits; this enables unparalleled task diversity compared to other benchmarks in this space. We also differ from other benchmarks in how we define evaluation, with emphasis on meta-test evaluation. This means we test the performance of algorithms on held-out datasets or environments, without telling the LLM what it will be evaluated on a priori. And DiscoBench is very diverse; we offer tasks from different applied and foundational disciplines and support a broad range of public datasets and RL environments.

Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of meta-train datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out meta-test dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.

Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of meta-train datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out meta-test dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.

Here are some of the main features of DiscoBench: