Built-in Metrics
Match metric
Credits LLM answer iff it matches the expected one.
One can optionally choose to ignore casing and/or strip punctuation.
Matching can be done using the following strategies:
exact(default): strings must be the sameprefix: actual response should start with the expected sequence of wordscontains: actual response must contain the expected sequence of words
from redlite.metric import MatchMetric
metric = MatchMetric(...)
Please see Reference documentation for more detail and available parameters.
BLEU metric
A sentence-level BLEU metric. One can choose one of BLEU-1, BLEU-2, BLEU-3, or BLEU-4
from redlite.metric.bleu import BleuMetric
metric = BleuMetric(...)
For benchmarking CJK languages consider using BleuCJKMetric. The only difference is
in the tokenization --- CJK version will consider every character to be a separate token.
from redlite.metric.bleu import BleuCJKMetric
metric = BleuCJKMetric(...)
Please see Reference documentation for more detail and available parameters.
ROUGE metric
A standard ROUGE metric. One can choose one of rougeL, rouge1, or rouge2.
from redlite.metric.rouge import RougeMetric
metric = RougeMetric(...)
For benchmarking in CJK languages consider using RougeCJKMetric. The only difference is
in the tokenization --- CJK version will consider every character to be a separate token.
from redlite.metric.rouge import RougeCJKMetric
metric = RougeCJKMetric(...)
Please see Reference documentation for more detail and available parameters.
F1 on word set metric
Computes F1 metric on the tokenized sets of expected and actual strings.
from redlite.metric.f1 import F1Metric
metric = F1Metric(...)
Please see Reference documentation for more detail and available parameters.
Boxed Math metrics
This metric is specific to Math500 benchmark. It scores answers to math tests, that are expected to be within a
LaTeX \boxed{...} function.
from redlite.metric.math import BoxedMathMetric
metric = BoxedMathMetric()
Please see Reference documentation for more detail and available parameters.
Live Code Bench metric
This metric is specific to the LiveCodeBench. It scores python code generation. It requires that
a server application redlite-livecodebench-grader is running as a docker container:
docker run -it -p 8000:80 ilabs/redlite-livecodebench-grader
If you use a different endpoint for hosting the grader, change endpoint parameter to the
LiveCodeBench constructor accordingly (default endpoint is http://localhost:8000)
Example:
from redlite.metric.livecodebench import LiveCodeBenchMetric
metric = LiveCodeBenchMetric(endpoint="http://grader.example.com:9999")
Please see Reference documentation for more detail and available parameters.
Best of several metrics
We may want to score LLM answer with several metrics and choose the best score. For example, when we score an item with
the expected answer "I refuse to tell you this" one may want to use MatchMetric(strategy="prefix"). And for the data points that
expect model to give a thoughtful answer, we want to use RougeMetric(rouge_type="rouge2").
The good dataset metric in this case is BestOf.
prefix_metric = MatchMetric(strategy="prefix")
rouge2_metric = RougeMetric(rouge_type="rouge2")
metric = BestOfMetric(prefix_metric, rouge2_metric)
Please see Reference documentation for more detail.
Custom metrics
Custom metrics can be easily integrated, see the Customization Guide.