Skip to content

Latest commit

 

History

History
192 lines (149 loc) · 6.65 KB

deepseek_r1.md

File metadata and controls

192 lines (149 loc) · 6.65 KB

Tutorial for Evaluating Reasoning Models

OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).

  • At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
  • For datasets with limited samples, we employ multiple evaluation runs and take the average
  • For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation

Installation and Preparation

Please follow OpenCompass's installation guide.

Evaluation Configuration Setup

We provide example configurations in examples/eval_deepseek_r1.py. Below is the configuration explanation:

Configuration Interpretation

1. Dataset and Validator Configuration

# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets

datasets = sum(
    (v for k, v in locals().items() if k.endswith('_datasets')),
    [],
)

# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
    abbr='qwen2-5-32B-Instruct',
    type=OpenAISDK,
    path='Qwen/Qwen2.5-32B-Instruct',  # Replace with actual path
    key='YOUR_API_KEY',  # Use real API key
    openai_api_base=['http://your-api-endpoint'],  # Replace with API endpoint
    query_per_second=16,
    batch_size=1024,
    temperature=0.001,
    max_out_len=16384
)

# Apply validator to all datasets
for item in datasets:
    if 'judge_cfg' in item['eval_cfg']['evaluator']:
        item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg

2. Model Configuration

We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)

# LMDeploy model configuration example
models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-7b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
        gen_config=dict(
            do_sample=True,
            temperature=0.6,
            top_p=0.95,
            max_new_tokens=32768
        ),
        max_seq_len=32768,
        batch_size=64,
        run_cfg=dict(num_gpus=1),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
    # Extendable 14B/32B configurations...
]

3. Evaluation Process Configuration

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
    
# Evaluation configuration
eval = dict(
    partitioner=dict(type=NaivePartitioner, n=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))

4. Summary Configuration

# Multiple runs results average configuration
summary_groups = [
    {
        'name': 'AIME2024-Aveage8',
        'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
    },
    # Other dataset average configurations...
]

summarizer = dict(
    dataset_abbrs=[
        ['AIME2024-Aveage8', 'naive_average'],
        # Other dataset metrics...
    ],
    summary_groups=summary_groups
)

# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"

Evaluation Execution

Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU

opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details

Evaluation logs will be output in the command line.

Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs

You need to modify the infer configuration in the configuration file and set num_worker to 8

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

At the same time, remove the --debug parameter from the evaluation command

opencompass examples/eval_deepseek_r1.py --dump-eval-details

In this mode, OpenCompass will use multithreading to start $num_worker tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under $work_dir.

Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs

Note that in the model configuration, num_gpus in run_cfg needs to be set to 2 (if using an inference backend, parameters such as tp in LMDeploy also need to be modified accordingly to 2), and at the same time, set num_worker in the infer configuration to 4

models += [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-14b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
        gen_config=dict(
                        do_sample=True,
                        temperature=0.6,
                        top_p=0.95,
                        max_new_tokens=32768),
        max_seq_len=32768,
        max_out_len=32768,
        batch_size=128,
        run_cfg=dict(num_gpus=2),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
]
# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

Evaluation Results

The evaluation results are displayed as follows:

dataset                             version    metric         mode    deepseek-r1-distill-qwen-7b-turbomind                                                                                                       ----------------------------------  ---------  -------------  ------  ---------------------------------------                                                                                                     MATH                                -          -              -                                         AIME2024-Aveage8                    -          naive_average  gen     56.25     

Performance Baseline

Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.

Model Dataset Metric Value
DeepSeek-R1-Distill-Qwen-7B AIME2024 Accuracy 56.3
DeepSeek-R1-Distill-Qwen-14B AIME2024 Accuracy 74.2
DeepSeek-R1-Distill-Qwen-32B AIME2024 Accuracy 74.2