Tutorial for Evaluating Reasoning Models

OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).

At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
For datasets with limited samples, we employ multiple evaluation runs and take the average
For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation

Installation and Preparation

Please follow OpenCompass's installation guide.

Evaluation Configuration Setup

We provide example configurations in examples/eval_deepseek_r1.py. Below is the configuration explanation:

Configuration Interpretation

1. Dataset and Validator Configuration

# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets

datasets = sum(
    (v for k, v in locals().items() if k.endswith('_datasets')),
    [],
)

# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
    abbr='qwen2-5-32B-Instruct',
    type=OpenAISDK,
    path='Qwen/Qwen2.5-32B-Instruct',  # Replace with actual path
    key='YOUR_API_KEY',  # Use real API key
    openai_api_base=['http://your-api-endpoint'],  # Replace with API endpoint
    query_per_second=16,
    batch_size=1024,
    temperature=0.001,
    max_out_len=16384
)

# Apply validator to all datasets
for item in datasets:
    if 'judge_cfg' in item['eval_cfg']['evaluator']:
        item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg

2. Model Configuration

We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)

# LMDeploy model configuration example
models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-7b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
        gen_config=dict(
            do_sample=True,
            temperature=0.6,
            top_p=0.95,
            max_new_tokens=32768
        ),
        max_seq_len=32768,
        batch_size=64,
        run_cfg=dict(num_gpus=1),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
    # Extendable 14B/32B configurations...
]

3. Evaluation Process Configuration

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
    
# Evaluation configuration
eval = dict(
    partitioner=dict(type=NaivePartitioner, n=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))

4. Summary Configuration

# Multiple runs results average configuration
summary_groups = [
    {
        'name': 'AIME2024-Aveage8',
        'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
    },
    # Other dataset average configurations...
]

summarizer = dict(
    dataset_abbrs=[
        ['AIME2024-Aveage8', 'naive_average'],
        # Other dataset metrics...
    ],
    summary_groups=summary_groups
)

# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"

Evaluation Execution

Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU

opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details

Evaluation logs will be output in the command line.

Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs

You need to modify the infer configuration in the configuration file and set num_worker to 8

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

At the same time, remove the --debug parameter from the evaluation command

opencompass examples/eval_deepseek_r1.py --dump-eval-details

In this mode, OpenCompass will use multithreading to start $num_worker tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under $work_dir.

Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs

Note that in the model configuration, num_gpus in run_cfg needs to be set to 2 (if using an inference backend, parameters such as tp in LMDeploy also need to be modified accordingly to 2), and at the same time, set num_worker in the infer configuration to 4

models += [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-14b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
        gen_config=dict(
                        do_sample=True,
                        temperature=0.6,
                        top_p=0.95,
                        max_new_tokens=32768),
        max_seq_len=32768,
        max_out_len=32768,
        batch_size=128,
        run_cfg=dict(num_gpus=2),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
]

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

Evaluation Results

The evaluation results are displayed as follows:

dataset                             version    metric         mode    deepseek-r1-distill-qwen-7b-turbomind                                                                                                       ----------------------------------  ---------  -------------  ------  ---------------------------------------                                                                                                     MATH                                -          -              -                                         AIME2024-Aveage8                    -          naive_average  gen     56.25

Performance Baseline

Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.

Model	Dataset	Metric	Value
DeepSeek-R1-Distill-Qwen-7B	AIME2024	Accuracy	56.3
DeepSeek-R1-Distill-Qwen-14B	AIME2024	Accuracy	74.2
DeepSeek-R1-Distill-Qwen-32B	AIME2024	Accuracy	74.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek_r1.md

deepseek_r1.md

Tutorial for Evaluating Reasoning Models

Installation and Preparation

Evaluation Configuration Setup

Configuration Interpretation

1. Dataset and Validator Configuration

2. Model Configuration

3. Evaluation Process Configuration

4. Summary Configuration

Evaluation Execution

Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU

Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs

Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs

Evaluation Results

Performance Baseline

Files

deepseek_r1.md

Latest commit

History

deepseek_r1.md

File metadata and controls

Tutorial for Evaluating Reasoning Models

Installation and Preparation

Evaluation Configuration Setup

Configuration Interpretation

1. Dataset and Validator Configuration

2. Model Configuration

3. Evaluation Process Configuration

4. Summary Configuration

Evaluation Execution

Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU

Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs

Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs

Evaluation Results

Performance Baseline