Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: open-compass/opencompass
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: main@{1day}
Choose a base ref
...
head repository: open-compass/opencompass
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
  • 4 commits
  • 20 files changed
  • 2 contributors

Commits on Mar 24, 2025

  1. [Update] Add dataset configurations of no max_out_len (#1967)

    * [Update] Add dataset configurations of no max_out_len
    
    * update test torch version
    
    * update test torch version
    
    * update test torch version
    
    * update test torch version
    MaiziXiao authored Mar 24, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    aa05993 View commit details
  2. [Update] Add SuperGPQA subset metrics (#1966)

    MaiziXiao authored Mar 24, 2025
    Copy the full SHA
    db96161 View commit details
  3. [Update] Add QWQ32b model config (#1959)

    * feat qwq-32b
    
    * fix
    
    * feat phi_4
    
    ---------
    
    Co-authored-by: Linchen Xiao <[email protected]>
    Myhs-phz and MaiziXiao authored Mar 24, 2025

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    37307fa View commit details
  4. [Update] Add Korbench config with no max_out_len (#1968)

    * Add Korbench no max_out_len
    
    * Add Korbench no max_out_len
    MaiziXiao authored Mar 24, 2025
    Copy the full SHA
    07930b8 View commit details
Showing with 1,135 additions and 9 deletions.
  1. +3 −3 .github/workflows/pr-stage-check.yml
  2. +56 −0 opencompass/configs/datasets/ARC_Prize_Public_Evaluation/arc_prize_public_evaluation_gen_fedd04.py
  3. +45 −0 opencompass/configs/datasets/GaokaoBench/GaokaoBench_no_subjective_gen_d16acb.py
  4. +81 −0 opencompass/configs/datasets/MathBench/mathbench_2024_gen_4b8f28.py
  5. +189 −0 opencompass/configs/datasets/bbh/bbh_llmjudge_gen_b5bdf1.py
  6. +45 −0 opencompass/configs/datasets/bigcodebench/bigcodebench_hard_complete_gen_2888d3.py
  7. +45 −0 opencompass/configs/datasets/bigcodebench/bigcodebench_hard_instruct_gen_c3d5ad.py
  8. +39 −0 opencompass/configs/datasets/cmo_fib/cmo_fib_gen_2783e5.py
  9. +37 −0 opencompass/configs/datasets/gsm8k/gsm8k_0shot_v2_gen_17d799.py
  10. +117 −0 opencompass/configs/datasets/korbench/korbench_llmjudge_gen_17854d.py
  11. +117 −0 opencompass/configs/datasets/korbench/korbench_llmjudge_gen_56cf43.py
  12. +96 −0 opencompass/configs/datasets/math/math_500_llmjudge_gen_6ff468.py
  13. +1 −1 opencompass/configs/datasets/nq/nq_open_1shot_gen_2e45e5.py
  14. +29 −0 opencompass/configs/datasets/scicode/scicode_gen_62c139.py
  15. +2 −2 opencompass/configs/datasets/supergpqa/supergpqa_llmjudge_gen_12b8bc.py
  16. +62 −0 opencompass/configs/datasets/triviaqa/triviaqa_wiki_1shot_gen_c87d61.py
  17. +17 −0 opencompass/configs/models/qwq/lmdeploy_qwq_32b.py
  18. +132 −0 opencompass/datasets/supergpqa/supergpqa.py
  19. +14 −3 opencompass/evaluator/generic_llm_evaluator.py
  20. +8 −0 opencompass/openicl/icl_evaluator/icl_base_evaluator.py
6 changes: 3 additions & 3 deletions .github/workflows/pr-stage-check.yml
Original file line number Diff line number Diff line change
@@ -20,7 +20,7 @@ jobs:
matrix:
python-version: ['3.10']
include:
- torch: 2.0.0
- torch: 2.5.1
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
@@ -30,7 +30,7 @@ jobs:
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install PyTorch
run: pip install torch==${{matrix.torch}}+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
run: pip install torch==${{matrix.torch}} -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install system dependencies
run: |
sudo sed -i '$ a deb http://th.archive.ubuntu.com/ubuntu jammy main' /etc/apt/sources.list
@@ -106,7 +106,7 @@ jobs:
- name: Upgrade pip
run: python -m pip install pip --upgrade
- name: Install PyTorch
run: pip install torch==2.0.0+${{matrix.platform}} -f https://download.pytorch.org/whl/${{matrix.platform}}/torch_stable.html
run: pip install torch==2.5.1 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install opencompass dependencies
run: |
pip install -r requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.arc_prize_public_evaluation import ARCPrizeDataset, ARCPrizeEvaluator


# The system_prompt defines the initial instructions for the model,
# setting the context for solving ARC tasks.
system_prompt = '''You are a puzzle solving wizard. You are given a puzzle from the abstraction and reasoning corpus developed by Francois Chollet.'''

# User message template is a template for creating user prompts. It includes placeholders for training data and test input data,
# guiding the model to learn the rule and apply it to solve the given puzzle.
user_message_template = '''Here are the example input and output pairs from which you should learn the underlying rule to later predict the output for the given test input:
----------------------------------------
{training_data}
----------------------------------------
Now, solve the following puzzle based on its input grid by applying the rules you have learned from the training data.:
----------------------------------------
[{{'input': {input_test_data}, 'output': [[]]}}]
----------------------------------------
What is the output grid? Only provide the output grid in the form as in the example input and output pairs. Do not provide any additional information:'''


arc_prize_public_evaluation_reader_cfg = dict(
input_columns=['training_data', 'input_test_data'],
output_column='output_test_data'
)

arc_prize_public_evaluation_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='SYSTEM',fallback_role='HUMAN', prompt=system_prompt),
dict(role='HUMAN', prompt=user_message_template),
],
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)

arc_prize_public_evaluation_eval_cfg = dict(
evaluator=dict(type=ARCPrizeEvaluator)
)

arc_prize_public_evaluation_datasets = [
dict(
abbr='ARC_Prize_Public_Evaluation',
type=ARCPrizeDataset,
path='opencompass/arc_prize_public_evaluation',
reader_cfg=arc_prize_public_evaluation_reader_cfg,
infer_cfg=arc_prize_public_evaluation_infer_cfg,
eval_cfg=arc_prize_public_evaluation_eval_cfg
)
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import GaokaoBenchDataset
from mmengine.config import read_base

with read_base():
from .GaokaoBench_prompts import MCQ_prompts, FBQ_prompts

GaokaoBench_datasets = []
for folder, prompts in [
('Multiple-choice_Questions', MCQ_prompts),
('Fill-in-the-blank_Questions', FBQ_prompts),
]:
for p in prompts:
reader_cfg = {
'input_columns': ['question'],
'output_column': 'answer',
}
infer_cfg = {
'ice_template': {
'type': PromptTemplate,
'template': {'round': [{'role': 'HUMAN', 'prompt': p['prefix_prompt'] + '{question}'}]},
'ice_token': '</E>',
},
'retriever': {'type': ZeroRetriever},
'inferencer': {'type': GenInferencer},
}
eval_cfg = {
'evaluator': {'type': 'GaokaoBenchEvaluator' + '_' + p['type']},
'pred_role': 'BOT',
}
_base_path = 'opencompass/GAOKAO-BENCH'
dataset = {
'type': GaokaoBenchDataset,
'abbr': 'GaokaoBench_' + p['keyword'],
'path': _base_path,
'filename': '/' + folder + '/' + p['keyword'] + '.json',
'name': p['keyword'],
'reader_cfg': reader_cfg,
'infer_cfg': infer_cfg,
'eval_cfg': eval_cfg,
}
GaokaoBench_datasets.append(dataset)
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from mmengine.config import read_base
from copy import deepcopy
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer, PPLInferencer
from opencompass.openicl.icl_evaluator import CircularEvaluator, AccEvaluator
from opencompass.datasets import MathBenchDataset, math_postprocess_v2
from opencompass.utils.text_postprocessors import first_option_postprocess

with read_base():
from .mathbench_prompt import zero_shot_prompts, few_shot_prompts, mathbench_sets

# Max for this dataset is 4
num_shot = 0
# Generate reasoning path or not, only for single choice
with_reasoning = True
# Use circular evaluation or not
with_circular_eval = True
# Use PPL mode in single choice test or not
use_ppl_single_choice = False

assert 0 <= num_shot <= 4
if num_shot == 0:
prompts = zero_shot_prompts
else:
prompts = {name: p[- 2 * num_shot - 2:] for name, p in few_shot_prompts.items()}

mathbench_datasets = []
for _split in mathbench_sets:
for _name in mathbench_sets[_split]:
if 'single_choice' in _name:
if with_reasoning:
template_round = prompts[_name + '_with_reasoning']
else:
template_round = prompts[_name]
else:
template_round = prompts[_name]

if 'single_choice' in _name:
pred_postprocessor = dict(type=first_option_postprocess, options='ABCD')
else:
pred_postprocessor = dict(type=math_postprocess_v2)

if 'single_choice' in _name and with_circular_eval:
evaluator = dict(type=CircularEvaluator)
else:
evaluator = dict(type=AccEvaluator)

# assemble the final config
mathbench_reader_cfg = dict(input_columns=['question'], output_column='answer')
if use_ppl_single_choice and 'single_choice' in _name and not with_reasoning:
template = {}
for answer in ['A', 'B', 'C', 'D']:
one_template_round = deepcopy(template_round)
one_template_round['round'][-1]['prompt'] = one_template_round['round'][-1]['prompt'].format(answer=answer)
template[answer] = dict(round=one_template_round)
mathbench_infer_cfg = dict(
prompt_template=dict(type=PromptTemplate, template=template),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer),
)
else:
mathbench_infer_cfg = dict(
prompt_template=dict(type=PromptTemplate, template=dict(round=template_round)),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
mathbench_eval_cfg = dict(evaluator=evaluator, pred_postprocessor=pred_postprocessor)

mathbench_datasets.append(
dict(
abbr='mathbench-' + _split + '-' + _name,
type=MathBenchDataset,
path=f'data/mathbench_v1/{_split}',
name=_name,
with_circular=with_circular_eval,
reader_cfg=mathbench_reader_cfg,
infer_cfg=mathbench_infer_cfg,
eval_cfg=mathbench_eval_cfg,
)
)
Loading