Built-in Benchmarks
Benchmark is a combination of dataset and metric. We provide several popular benchmarks out-of-box.
MMLU-Pro
MMLU Pro benchmark contains 12032 difficult multiple-choice questions from the realm of math, chemistry, engineering, etc.
Here is how to run it with redlite:
from redlite import run
from redlite.benchmark.mmlu_pro import dataset, metric
model = ... # configure the model to be benchmarked
run(dataset=dataset, metric=metric, model=model)
GPQA
GPQA benchmark contains four separate datasets of multiple-choice questions:
main- 448 rowsdiamon- 198 rowsexperts- 60 rowsextended- 546 rows
Here is how to run diamond version with redlite:
from redlite import run
from redlite.benchmark.gpqa import get_dataset, metric
model = ... # configure the model to be benchmarked
dataset = get_dataset('diamond')
run(dataset=dataset, metric=metric, model=model)
Math 500
Math 500 benchmark contains 500 math problems. Here is how to run this
benchmark in redlite:
from redlite import run
from redlite.benchmark.math500 import dataset, metric
model = ... # configure the model to be benchmarked
run(dataset=dataset, metric=metric, model=model)
Live Code Bench code generation
LiveCodeBench benchmark contains Pyhton code generation tasks.
Here is how to run this benchmark in redlite.
First, you need to run grader service with docker. You may need to build the docker first, see GitHub project redlite-livecodebench-grader for details. Once you have Docker image, start grading service:
docker run -it -p 8000:80 ilabs/redlite-livecodebench-grader:latest
Now you can run benchmark like this:
from redlite import run
from redlite.benchmark.livecodebench import get_dataset, get_metric
model = ... # configure the model to be benchmarked
dataset = get_dataset() # use default partition
metric = get_metric() # use default grader endpoint http://localhost:8000
run(dataset=dataset, metric=metric, model=model)
There are 4 available test configs (a.k.a. partitions):
| Name | Start Date | End Date | Revision | Records |
|---|---|---|---|---|
| test_v5_2408_2502 | 2024-08 | 2025-02 | v5 | 279 |
| test_v5_2407_2412 | 2024-07 | 2024-12 | v5 | 315 |
| test_v5_2410_2502 | 2024-10 | 2025-02 | v5 | 166 |
| test_v6_2408_2505 | 2024-08 | 2025-05 | v6 | 454 |
Default config is test_v5_2408_2502. You may load other configs by passing config name to the get_dataset function.
AIME 2024
AIME 2024 benchmark contains 30 math problems from 2024 AIME I and 2024 AIME II tests.
Here is how to run this benchmark in redlite:
from redlite import run
from redlite.benchmark.aime24 import dataset, metric
model = ... # configure the model to be benchmarked
run(dataset=dataset, metric=metric, model=model)
AIME 2025
AIME 2025 benchmark contains 30 math problems from 2025 AIME I and AIME II tests.
Here is how to run this benchmark in redlite:
from redlite import run
from redlite.benchmark.aime25 import dataset, metric
model = ... # configure the model to be benchmarked
run(dataset=dataset, metric=metric, model=model)