Built-in Benchmarks

Benchmark is a combination of dataset and metric. We provide several popular benchmarks out-of-box.

MMLU-Pro

MMLU Pro benchmark contains 12032 difficult multiple-choice questions from the realm of math, chemistry, engineering, etc.

Here is how to run it with redlite:

from redlite import run
from redlite.benchmark.mmlu_pro import dataset, metric

model = ...  # configure the model to be benchmarked

run(dataset=dataset, metric=metric, model=model)

GPQA

GPQA benchmark contains four separate datasets of multiple-choice questions:

main - 448 rows
diamon - 198 rows
experts - 60 rows
extended - 546 rows

Here is how to run diamond version with redlite:

from redlite import run
from redlite.benchmark.gpqa import get_dataset, metric

model = ...  # configure the model to be benchmarked

dataset = get_dataset('diamond')

run(dataset=dataset, metric=metric, model=model)

Math 500

Math 500 benchmark contains 500 math problems. Here is how to run this benchmark in redlite:

from redlite import run
from redlite.benchmark.math500 import dataset, metric

model = ...  # configure the model to be benchmarked

run(dataset=dataset, metric=metric, model=model)

Live Code Bench code generation

LiveCodeBench benchmark contains Pyhton code generation tasks. Here is how to run this benchmark in redlite.

First, you need to run grader service with docker. You may need to build the docker first, see GitHub project redlite-livecodebench-grader for details. Once you have Docker image, start grading service:

docker run -it -p 8000:80 ilabs/redlite-livecodebench-grader:latest

Now you can run benchmark like this:

from redlite import run
from redlite.benchmark.livecodebench import get_dataset, get_metric

model = ...  # configure the model to be benchmarked

dataset = get_dataset()  # use default partition
metric = get_metric()  # use default grader endpoint http://localhost:8000

run(dataset=dataset, metric=metric, model=model)

There are 4 available test configs (a.k.a. partitions):

Name	Start Date	End Date	Revision	Records
test_v5_2408_2502	2024-08	2025-02	v5	279
test_v5_2407_2412	2024-07	2024-12	v5	315
test_v5_2410_2502	2024-10	2025-02	v5	166
test_v6_2408_2505	2024-08	2025-05	v6	454

Default config is test_v5_2408_2502. You may load other configs by passing config name to the get_dataset function.

AIME 2024

AIME 2024 benchmark contains 30 math problems from 2024 AIME I and 2024 AIME II tests.

Here is how to run this benchmark in redlite:

from redlite import run
from redlite.benchmark.aime24 import dataset, metric

model = ...  # configure the model to be benchmarked

run(dataset=dataset, metric=metric, model=model)

AIME 2025

AIME 2025 benchmark contains 30 math problems from 2025 AIME I and AIME II tests.

Here is how to run this benchmark in redlite:

from redlite import run
from redlite.benchmark.aime25 import dataset, metric

model = ...  # configure the model to be benchmarked

run(dataset=dataset, metric=metric, model=model)