Figure 1: An overview of the DiscoBench system. A user can select a domain from a broad range of different areas of machine learning. After, they can select files from the list of available modules for an AI research agent to edit. We support selecting any combination of modules, from just one to all of them!

Evaluating Automated Algorithm Discovery

One long term goal of automated algorithm discovery systems is to safely automate AI research itself. To do so, we need to be able to measure the ability of AI research agents. While current benchmarks do exist, they suffer from fundamental limitations, including data contamination, poor quality evaluations, and difficulty in assessing whether methods discovered by the agent generalise to new tasks and domains.

We design DiscoBench specifically with the issues of current benchmarks in mind; as such, we hope that it can remain pertinent for a long time. Below, we explain why DiscoBench is so useful, and some of the tasks that are currently implemented in it; expect these to grow over the next few months!

Limitations of Current Benchmarks

Poor Evaluation: Like in all machine learning, proper evaluation requires a train-test dataset separation. Most preexisting benchmarks do this on the model-level - i.e., a successful algorithm is one that trains a model that generalises to a test dataset. This misses the correct train-test boundary. Given we are finding algorithms on the meta-scale, we should instead be measuring the transfer of algorithms to train models on completely new datasets. In other words, our focus should be on meta-test performance!
Limited Diversity: Existing benchmarks require manually creating every single problem, which is laborious and repetitive. Even when a benchmark consists of many tasks, they often focus on specific domains at the cost of breadth.
Bias To The Initialisation: Because of how they are implemented, many preexisting benchmarks require initialising a codebase from a fully working, preexisting implementation. In addition to requiring a complete initial solution, which is non-trivial for hard problems, initialising agents in this local minimum can limit the creativity that we hope to elicit from AI research agents. It also affects evaluation, since doing nothing can be a reasonably performant strategy.
Data Contamination: It is hard to accurately measure data contamination of benchmarks, especially when they use machine learning problems that have been publicly released. This problem is particularly prevalent for older Kaggle competitions, where first place solutions are public, or HuggingFace datasets, where an agent could have seen the data in pretraining and use that to inform their solutions. In particular, issues often arise when AI research agents are aware of which specific tasks it will be evaluated on.

Criteria	MLEBench	ASTABench	MLGym Bench	DiscoBench
Algorithm Transfer	❌	❌	❌	✅
Task Diversity	🟡	✅	🟡	✅
Unbiased to Initialisation	🟡	✅	❌	✅
Contamination Resistant	❌	🟡	🟡	✅

What is DiscoBench?

DiscoBench is a new task-generation framework, and task-suite, for algorithm discovery and AI research agents. As well as already providing a number of problems with which to measure an agent's performance for AI research, we believe it provides a useful foundation where efforts in automated, open-ended scientific research can effectively take place.

DiscoBench is set up in a modular way, such that you can specify which parts of a codebase an LLM edits; this enables unparalleled task diversity compared to other benchmarks in this space. We also differ from other benchmarks in how we define evaluation, with emphasis on meta-test evaluation. This means we test the performance of algorithms on held-out datasets or environments, without telling the LLM what it will be evaluated on a priori. And DiscoBench is very diverse; we offer tasks from different applied and foundational disciplines and support a broad range of public datasets and RL environments.

Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of meta-train datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out meta-test dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.

Figure 2: A visualisation of a typical DiscoBench setup. An agent will develop algorithms to train models on a set of meta-train datasets, making refinements based on the model's evaluation score. After a final algorithm is developed, it is used to train models on a held-out meta-test dataset, which is unknown to the agent. The models are evaluated to get a final performance metric.

Here are some of the main features of DiscoBench:

Modular File System: To massively expand task diversity, we implement each task in a modular fashion. For every ML codebase we use, we identify a series of small modules that we can mark as editable (the LLM is fed just the interface it must match) or fixed (the code uses a prewritten file which the LLM shouldn't edit). For example, in one task the modules could be the network architecture, the loss function and the optimiser. This means that for the cost of implementing a single codebase with n modules, we can get 2n−1 possible different tasks from each combination of editable and fixed modules! Also, when a module is editable, we start the agent off with a near-empty file, so agents are not biased too close to an initialisation. Lastly, due to this modular setup, DiscoBench is not limited to LLMs; other methods, such as symbolic evolution, may also prove effective given the emphasis is only on discovering algorithms, rather than needing to write the boilerplate code.

2n−1
Open-Ended Task Space: In DiscoBench, we focus on up-to-date codebases with unsaturated datasets and environments. These represent active research problems in which we can measure the performance of a discovered algorithm but do not yet know a “perfect” solution. As such, they are inherently open-ended problems that rely on creativity and effective contextual reasoning to maximise performance.
Meta-Train/Meta-Test Split: DiscoBench is implemented with the ability to define a clear Meta-Train/Meta-Test split. In DiscoBench, we focus on the transferability of the algorithms written by agents. In practice, the agent must write algorithms which train models on data. The algorithm is evaluated at meta-test time based on its score when training a new model on an unseen, and unknown to an agent, dataset or RL environment. For example, the LLM might receive feedback for algorithms it develops to train classifiers on CIFAR10, and its evaluation score is the performance of classifiers trained with this algorithm on ImageNet. It turns out, it's really easy to overfit algorithms to specific problems, limiting the utility of the discovered algorithms. We try to support a broad range of datasets or RL environments, and meta-training/meta-testing can take place on more than one at a time, meaning the LLM potentially has to balance multiple objectives. This only adds to the huge diversity of possible tasks above!