Skip to content

Commit 07c96ac

Browse files
authoredAug 1, 2024··
Calm dataset (#1385)
* Add CALM Dataset
1 parent 46cc789 commit 07c96ac

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+9931
-0
lines changed
 

‎configs/datasets/calm/README.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# CaLM Lite
2+
**CaLM Lite** is a lightweight version of CaLM.
3+
4+
**Ca**usal evaluation of **L**anguage **M**odels (CaLM), to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. The CaLM framework establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results).
5+
6+
<div align="center">
7+
8+
[🌐 Website](https://opencausalab.github.io/CaLM) |
9+
[📃 Report](https://arxiv.org/abs/2405.00622) |[ 🎆 Github](https://github.com/OpenCausaLab/CaLM) | 📧 Welcome to join us by email at causalai@pjlab.org.cn
10+
</div>
11+
12+
## Quick Start
13+
### Data Preparation
14+
Download dataset to data/ folder.
15+
```
16+
wget https://github.com/OpenCausaLab/CaLM/releases/download/v1.0.0.lite/calm.zip
17+
unzip calm.zip
18+
```
19+
### Run Model and Infer
20+
To obtain a concise output with only the average information for all tasks, use:
21+
22+
```
23+
python run.py --models YOUR_MODEL --datasets calm --summarizer calm
24+
```
25+
26+
If you want detailed information for each task, use:
27+
28+
```
29+
python run.py --models YOUR_MODEL --datasets calm
30+
```
31+
32+
The `--summarizer calm` flag in the first command is used to generate a summarized output, while omitting it in the second command will provide task-specific details.
33+
## Available Causal Tasks
34+
We provide 92 tasks for causal evaluation, stored in the `data/calm` folder. For more information about our causal tasks, refer to [tasks](https://github.com/OpenCausaLab/CaLM/blob/main/documents/tasks.md).
35+
The directory structure is:
36+
37+
```
38+
├── calm
39+
| ├── association
40+
| ├── causal_discovery # Rung of the causal ladder
41+
| │ ├── abstract_reasoning # Causal scenario
42+
| │ │ ├── AR-B_CaLM-AR_CN.json # Causal task
43+
| │ | └── AR-B_CaLM-AR_EN.json # Causal task
44+
| │ └── ...
45+
| └── ...
46+
└── ...
47+
```
48+
49+
## Dataset
50+
- **Dataset size**: CaLM Lite leverages a light dataset of **9200**, while CaLM uses a significantly larger dataset of 126,334. The table below details the English dataset composition, with the Chinese version structured identically.
51+
- **Dataset configuration**: We prioritize balance in our dataset for **binary classification** and **choice selection** questions. By ensuring an equal number of each GT label, we minimize the risk of introducing bias into the model's testing. For **probability calculation**, CaLM-Lite takes extra attention to balance the number of problems across different causal reasoning processes. (For more details on how causal reasoning process is defined, please refer to Section 9.1.6 of the [paper](https://arxiv.org/abs/2405.00622).)
52+
- **Efficient evaluation**: For enhanced evaluation efficiency, OpenCompass offers customizable methods. Refer to the [documentation](https://opencompass.org.cn/doc) for guidance on tailoring these methods to your needs.
53+
54+
| Causal ladder | Causal scenario | Subset | Question type | Mode | CaLM Lite | CaLM |
55+
|---------------|-----------------|--------|---------------|------|-----------|------|
56+
| Causal discovery | PCD | E-CARE | Binary classification | Natural | 100 | 2000 |
57+
| Causal discovery | PCD | E-CARE | Choice selection | Natural | 100 | 1000 |
58+
| Causal discovery | PCD | COPA | Binary classification | Natural | 100 | 2000 |
59+
| Causal discovery | PCD | COPA | Choice selection | Natural | 100 | 1000 |
60+
| Causal discovery | ECI | CTB | Binary classification | Natural | 100 | 596 |
61+
| Causal discovery | ECI | ESC | Binary classification | Natural | 100 | 1000 |
62+
| Causal discovery | ECI | MAVEN-ERE | Binary classification | Natural | 100 | 1000 |
63+
| Causal discovery | AR | CaLM-AR | Binary classification | Symbolic | 100 | 1600 |
64+
| Causal discovery | CA | FP | Binary classification | Symbolic | 100 | 1600 |
65+
| Causal discovery | CA | FA | Binary classification | Symbolic | 100 | 1600 |
66+
| Association | CORR | correlation | Binary classification | Natural | 100 | 1476 |
67+
| Association | EAE | exp-away | Binary classification | Natural | 100 | 168 |
68+
| Intervention | CB | collider-bias | Binary classification | Natural | 100 | 163 |
69+
| Intervention | ATE | ATE-natural | Binary classification | Natural | 100 | 1600 |
70+
| Intervention | ATE | ATE-basic | Probability calculation | Mathematical | 100 | 1600 |
71+
| Intervention | ATE | ATE-hard | Probability calculation | Mathematical | 100 | 1600 |
72+
| Intervention | CDE | CDE-natural | Binary classification | Natural | 100 | 1600 |
73+
| Intervention | CDE | CDE-basic | Probability calculation | Mathematical | 100 | 1600 |
74+
| Intervention | CDE | CDE-hard | Probability calculation | Mathematical | 100 | 1600 |
75+
| Intervention | BAS | backadj | Binary classification | Natural | 100 | 227 |
76+
| Intervention | BAS | max-BAS | Choice selection | Symbolic | 100 | 1600 |
77+
| Intervention | BAS | min-BAS | Choice selection | Symbolic | 100 | 1600 |
78+
| Intervention | BAS | mix-BAS | Choice selection | Symbolic | 100 | 1600 |
79+
| Intervention | FAS | FAS | Choice selection | Symbolic | 100 | 1600 |
80+
| Intervention | IV | CaLM-IV | Choice selection | Symbolic | 100 | 1600 |
81+
| Intervention | CEI | 0.2-UC | Binary classification | Symbolic | 100 | 1600 |
82+
| Intervention | CEI | 0.4-UC | Binary classification | Symbolic | 100 | 1600 |
83+
| Intervention | CEI | 0.6-UC | Binary classification | Symbolic | 100 | 1600 |
84+
| Intervention | CEI | 0.8-UC | Binary classification | Symbolic | 100 | 1600 |
85+
| Counterfactuals | ETT | ETT-natural | Binary classification | Natural | 100 | 1600 |
86+
| Counterfactuals | ETT | ETT-basic | Probability calculation | Mathematical | 100 | 1600 |
87+
| Counterfactuals | ETT | ETT-hard | Probability calculation | Mathematical | 100 | 1600 |
88+
| Counterfactuals | NDE | NDE-natural | Binary classification | Natural | 100 | 1600 |
89+
| Counterfactuals | NDE | NDE-basic | Probability calculation | Mathematical | 100 | 1600 |
90+
| Counterfactuals | NDE | NDE-hard | Probability calculation | Mathematical | 100 | 1600 |
91+
| Counterfactuals | NIE | NIE-natural | Binary classification | Natural | 100 | 1600 |
92+
| Counterfactuals | NIE | NIE-basic | Probability calculation | Mathematical | 100 | 1600 |
93+
| Counterfactuals | NIE | NIE-hard | Probability calculation | Mathematical | 100 | 1600 |
94+
| Counterfactuals | PN | PN-basic | Probability calculation | Mathematical | 100 | 1600 |
95+
| Counterfactuals | PN | PN-hard | Probability calculation | Mathematical | 100 | 1600 |
96+
| Counterfactuals | PS | PS-basic | Probability calculation | Mathematical | 100 | 1600 |
97+
| Counterfactuals | PS | PS-hard | Probability calculation | Mathematical | 100 | 1600 |
98+
| Counterfactuals | AC | causal judgement | Binary classification | Natural | 100 | 187 |
99+
| Counterfactuals | CR | CRASS | Choice selection | Natural | 100 | 274 |
100+
| Counterfactuals | CR | det-counterfactual | Binary classification | Natural | 100 | 1476 |
101+
| Counterfactuals | CEG | E-CARE | Open-ended generation | Natural | 100 | 1000 |
102+
| **Total** | | | | | 4600 | 63167 |
103+
104+
## Available Prompt Styles (Adaptation)
105+
Basic Prompt is our default setting for efficient evaluation of CaLM Lite, but we provide flexibility for exploring additional prompts through CaLM. If you'd like to explore and compare a wider range of prompts, we encourage you to use CaLM. We provide a comprehensive and easy-to-follow guide to assist you in our [repository](https://github.com/OpenCausaLab/CaLM).
106+
107+
## Citation
108+
```
109+
@misc{chen2024causal,
110+
title={Causal Evaluation of Language Models},
111+
author={Sirui Chen and Bo Peng and Meiqi Chen and Ruiqi Wang and Mengying Xu and Xingyu Zeng and Rui Zhao and Shengjie Zhao and Yu Qiao and Chaochao Lu},
112+
year={2024},
113+
eprint={2405.00622},
114+
archivePrefix={arXiv},
115+
primaryClass={cs.CL}
116+
}
117+
```

‎configs/datasets/calm/calm.py

+160
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
from opencompass.openicl.icl_prompt_template import PromptTemplate
2+
from opencompass.openicl.icl_retriever import ZeroRetriever
3+
from opencompass.openicl.icl_inferencer import GenInferencer
4+
from opencompass.datasets import CaLMDataset, CaLMEvaluator
5+
6+
task_hiearchy_dict = {
7+
# association/
8+
# correlation/
9+
'CORR-B_correlation_CN':'association/correlation/',
10+
'CORR-B_correlation_EN':'association/correlation/',
11+
# explaining_away_effect/
12+
'EAE-B_exp-away_CN':'association/explaining_away_effect/',
13+
'EAE-B_exp-away_EN':'association/explaining_away_effect/',
14+
# causal_discovery/
15+
# abstract_reasoning/
16+
'AR-B_CaLM-AR_CN':'causal_discovery/abstract_reasoning/',
17+
'AR-B_CaLM-AR_EN':'causal_discovery/abstract_reasoning/',
18+
# causal_attribution/
19+
'CA-B_FA_CN':'causal_discovery/causal_attribution/',
20+
'CA-B_FA_EN':'causal_discovery/causal_attribution/',
21+
'CA-B_FP_CN':'causal_discovery/causal_attribution/',
22+
'CA-B_FP_EN':'causal_discovery/causal_attribution/',
23+
# event_causality_identification/
24+
'ECI-B_CTB_CN':'causal_discovery/event_causality_identification/',
25+
'ECI-B_CTB_EN':'causal_discovery/event_causality_identification/',
26+
'ECI-B_ESC_CN':'causal_discovery/event_causality_identification/',
27+
'ECI-B_ESC_EN':'causal_discovery/event_causality_identification/',
28+
'ECI-B_MAVEN-ERE_CN':'causal_discovery/event_causality_identification/',
29+
'ECI-B_MAVEN-ERE_EN':'causal_discovery/event_causality_identification/',
30+
# pairwise_causal_discovery/
31+
'PCD-B_COPA_CN':'causal_discovery/pairwise_causal_discovery/',
32+
'PCD-B_COPA_EN':'causal_discovery/pairwise_causal_discovery/',
33+
'PCD-B_E-CARE_CN':'causal_discovery/pairwise_causal_discovery/',
34+
'PCD-B_E-CARE_EN':'causal_discovery/pairwise_causal_discovery/',
35+
'PCD-C_COPA_CN':'causal_discovery/pairwise_causal_discovery/',
36+
'PCD-C_COPA_EN':'causal_discovery/pairwise_causal_discovery/',
37+
'PCD-C_E-CARE_CN':'causal_discovery/pairwise_causal_discovery/',
38+
'PCD-C_E-CARE_EN':'causal_discovery/pairwise_causal_discovery/',
39+
# counterfactual/
40+
# actual_causality/
41+
'AC-B_causal_judgement_CN':'counterfactual/actual_causality/',
42+
'AC-B_causal_judgement_EN':'counterfactual/actual_causality/',
43+
# causal_explanation_generation/
44+
'CEG-O_E-CARE_CN':'counterfactual/causal_explanation_generation/',
45+
'CEG-O_E-CARE_EN':'counterfactual/causal_explanation_generation/',
46+
# counterfactual_reasoning/
47+
'CR-B_det-counterfactual_CN':'counterfactual/counterfactual_reasoning/',
48+
'CR-B_det-counterfactual_EN':'counterfactual/counterfactual_reasoning/',
49+
'CR-C_CRASS_CN':'counterfactual/counterfactual_reasoning/',
50+
'CR-C_CRASS_EN':'counterfactual/counterfactual_reasoning/',
51+
# effect_of_the_treatment_on_the_treated/
52+
'ETT-B_ETT-natural_CN':'counterfactual/effect_of_the_treatment_on_the_treated/',
53+
'ETT-B_ETT-natural_EN':'counterfactual/effect_of_the_treatment_on_the_treated/',
54+
'ETT-P_ETT-basic_CN':'counterfactual/effect_of_the_treatment_on_the_treated/',
55+
'ETT-P_ETT-basic_EN':'counterfactual/effect_of_the_treatment_on_the_treated/',
56+
'ETT-P_ETT-hard_CN':'counterfactual/effect_of_the_treatment_on_the_treated/',
57+
'ETT-P_ETT-hard_EN':'counterfactual/effect_of_the_treatment_on_the_treated/',
58+
# natural_direct_effect/
59+
'NDE-B_NDE-natural_CN':'counterfactual/natural_direct_effect/',
60+
'NDE-B_NDE-natural_EN':'counterfactual/natural_direct_effect/',
61+
'NDE-P_NDE-basic_CN':'counterfactual/natural_direct_effect/',
62+
'NDE-P_NDE-basic_EN':'counterfactual/natural_direct_effect/',
63+
'NDE-P_NDE-hard_CN':'counterfactual/natural_direct_effect/',
64+
'NDE-P_NDE-hard_EN':'counterfactual/natural_direct_effect/',
65+
# natural_indirect_effect/
66+
'NIE-B_NIE-natural_CN':'counterfactual/natural_indirect_effect/',
67+
'NIE-B_NIE-natural_EN':'counterfactual/natural_indirect_effect/',
68+
'NIE-P_NIE-basic_CN':'counterfactual/natural_indirect_effect/',
69+
'NIE-P_NIE-basic_EN':'counterfactual/natural_indirect_effect/',
70+
'NIE-P_NIE-hard_CN':'counterfactual/natural_indirect_effect/',
71+
'NIE-P_NIE-hard_EN':'counterfactual/natural_indirect_effect/',
72+
# probability_of_necessity/
73+
'PN-P_PN-basic_CN':'counterfactual/probability_of_necessity/',
74+
'PN-P_PN-basic_EN':'counterfactual/probability_of_necessity/',
75+
'PN-P_PN-hard_CN':'counterfactual/probability_of_necessity/',
76+
'PN-P_PN-hard_EN':'counterfactual/probability_of_necessity/',
77+
# probability_of_sufficiency/
78+
'PS-P_PS-basic_CN':'counterfactual/probability_of_sufficiency/',
79+
'PS-P_PS-basic_EN':'counterfactual/probability_of_sufficiency/',
80+
'PS-P_PS-hard_CN':'counterfactual/probability_of_sufficiency/',
81+
'PS-P_PS-hard_EN':'counterfactual/probability_of_sufficiency/',
82+
# intervention/
83+
# average_treatment_effect/
84+
'ATE-B_ATE-natural_CN':'intervention/average_treatment_effect/',
85+
'ATE-B_ATE-natural_EN':'intervention/average_treatment_effect/',
86+
'ATE-P_ATE-basic_CN':'intervention/average_treatment_effect/',
87+
'ATE-P_ATE-basic_EN':'intervention/average_treatment_effect/',
88+
'ATE-P_ATE-hard_CN':'intervention/average_treatment_effect/',
89+
'ATE-P_ATE-hard_EN':'intervention/average_treatment_effect/',
90+
# backdoor_adjustment_set/
91+
'BAS-B_backadj_CN':'intervention/backdoor_adjustment_set/',
92+
'BAS-B_backadj_EN':'intervention/backdoor_adjustment_set/',
93+
'BAS-C_max-BAS_CN':'intervention/backdoor_adjustment_set/',
94+
'BAS-C_max-BAS_EN':'intervention/backdoor_adjustment_set/',
95+
'BAS-C_min-BAS_CN':'intervention/backdoor_adjustment_set/',
96+
'BAS-C_min-BAS_EN':'intervention/backdoor_adjustment_set/',
97+
'BAS-C_mix-BAS_CN':'intervention/backdoor_adjustment_set/',
98+
'BAS-C_mix-BAS_EN':'intervention/backdoor_adjustment_set/',
99+
# causal_effect_identification/
100+
'CEI-B_0.2-UC_CN':'intervention/causal_effect_identification/',
101+
'CEI-B_0.2-UC_EN':'intervention/causal_effect_identification/',
102+
'CEI-B_0.4-UC_CN':'intervention/causal_effect_identification/',
103+
'CEI-B_0.4-UC_EN':'intervention/causal_effect_identification/',
104+
'CEI-B_0.6-UC_CN':'intervention/causal_effect_identification/',
105+
'CEI-B_0.6-UC_EN':'intervention/causal_effect_identification/',
106+
'CEI-B_0.8-UC_CN':'intervention/causal_effect_identification/',
107+
'CEI-B_0.8-UC_EN':'intervention/causal_effect_identification/',
108+
# collider_bias/
109+
'CB-B_collider-bias_CN':'intervention/collider_bias/',
110+
'CB-B_collider-bias_EN':'intervention/collider_bias/',
111+
# controlled_direct_effect/
112+
'CDE-B_CDE-natural_CN':'intervention/controlled_direct_effect/',
113+
'CDE-B_CDE-natural_EN':'intervention/controlled_direct_effect/',
114+
'CDE-P_CDE-basic_CN':'intervention/controlled_direct_effect/',
115+
'CDE-P_CDE-basic_EN':'intervention/controlled_direct_effect/',
116+
'CDE-P_CDE-hard_CN':'intervention/controlled_direct_effect/',
117+
'CDE-P_CDE-hard_EN':'intervention/controlled_direct_effect/',
118+
# frontdoor_adjustment_set/
119+
'FAS-C_FAS_CN':'intervention/frontdoor_adjustment_set/',
120+
'FAS-C_FAS_EN':'intervention/frontdoor_adjustment_set/',
121+
# instrumental_variable/
122+
'IV-C_CaLM-IV_CN':'intervention/instrumental_variable/',
123+
'IV-C_CaLM-IV_EN':'intervention/instrumental_variable/',}
124+
125+
calm_reader_cfg = dict(
126+
input_columns=['question'],
127+
output_column='gt_item')
128+
129+
calm_all_sets = list(set(key[:-3] for key in task_hiearchy_dict.keys()))
130+
131+
calm_datasets = []
132+
for _name in calm_all_sets:
133+
for _prompt_style in ['basic','basic-CN']:
134+
_task_name = _name + ('_CN' if _prompt_style.endswith('-CN') else '_EN')
135+
_path = f'./data/calm/{task_hiearchy_dict[_task_name]}{_task_name}.json'
136+
137+
calm_infer_cfg = dict(
138+
prompt_template=dict(
139+
type=PromptTemplate,
140+
template='{question}'),
141+
retriever=dict(type=ZeroRetriever),
142+
inferencer=dict(type=GenInferencer, max_out_len=500))
143+
144+
calm_eval_cfg = dict(evaluator=dict(
145+
type=CaLMEvaluator,
146+
core_metrics=True,
147+
error_analysis=True,
148+
prompt_style=_prompt_style,
149+
task=_task_name))
150+
calm_datasets.append(
151+
dict(
152+
abbr=f'calm_{_task_name}',
153+
type=CaLMDataset,
154+
path=_path,
155+
prompt_style=_prompt_style,
156+
reader_cfg=calm_reader_cfg,
157+
infer_cfg=calm_infer_cfg,
158+
eval_cfg=calm_eval_cfg)
159+
)
160+
del _prompt_style, _task_name, _path, _name

0 commit comments

Comments
 (0)
Please sign in to comment.