Skip to content

Commit ba82bbe

Browse files
sbonner0cthoyt
andauthoredMay 19, 2023
🧬 💾 Add the PharMeBINet dataset (pykeen#1257)
Closes pykeen#1256 Just a note that I had to add some missing functionality to `TarFileSingleDataset` class that was present in the `SingleTabbedDataset` class. This was to be able to specify which columns to use from a dataframe for the head, edge and target columns. --------- Co-authored-by: Charles Tapley Hoyt <[email protected]>
1 parent fd4fe6c commit ba82bbe

File tree

5 files changed

+94
-3
lines changed

5 files changed

+94
-3
lines changed
 

‎README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@
4545
<p align="center">
4646
<a href="#installation">Installation</a> •
4747
<a href="#quickstart">Quickstart</a> •
48-
<a href="#datasets">Datasets (36)</a> •
48+
<a href="#datasets">Datasets (37)</a> •
4949
<a href="#inductive-datasets">Inductive Datasets (5)</a> •
5050
<a href="#models">Models (44)</a> •
5151
<a href="#supporters">Support</a> •
@@ -112,7 +112,7 @@ in ``pykeen``.
112112

113113
### Datasets
114114

115-
The following 36 datasets are built in to PyKEEN. The citation for each dataset corresponds to either the paper
115+
The following 37 datasets are built in to PyKEEN. The citation for each dataset corresponds to either the paper
116116
describing the dataset, the first paper published using the dataset with knowledge graph embedding models,
117117
or the URL for the dataset if neither of the first two are available. If you want to use a custom dataset,
118118
see the [Bring Your Own Dataset](https://pykeen.readthedocs.io/en/latest/byo/data.html) tutorial. If you
@@ -146,6 +146,7 @@ have a suggestion for another dataset to include in PyKEEN, please let us know
146146
| OpenBioLink | [`pykeen.datasets.OpenBioLink`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.OpenBioLink.html) | [Breit *et al*., 2020](https://doi.org/10.1093/bioinformatics/btaa274) | 180992 | 28 | 4563407 |
147147
| OpenBioLink LQ | [`pykeen.datasets.OpenBioLinkLQ`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.OpenBioLinkLQ.html) | [Breit *et al*., 2020](https://doi.org/10.1093/bioinformatics/btaa274) | 480876 | 32 | 27320889 |
148148
| OpenEA Family | [`pykeen.datasets.OpenEA`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.OpenEA.html) | [Sun *et al*., 2020](http://www.vldb.org/pvldb/vol13/p2326-sun.pdf) | 15000 | 248 | 38265 |
149+
| PharMeBINet | [`pykeen.datasets.PharMeBINet`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.PharMeBINet.html) | [Königs *et al*., 2022](https://www.nature.com/articles/s41597-022-01510-3) | 2869407 | 208 | 15883653 |
149150
| PharmKG | [`pykeen.datasets.PharmKG`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.PharmKG.html) | [Zheng *et al*., 2020](https://doi.org/10.1093/bib/bbaa344) | 188296 | 39 | 1093236 |
150151
| PharmKG8k | [`pykeen.datasets.PharmKG8k`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.PharmKG8k.html) | [Zheng *et al*., 2020](https://doi.org/10.1093/bib/bbaa344) | 7247 | 28 | 485787 |
151152
| PrimeKG | [`pykeen.datasets.PrimeKG`](https://pykeen.readthedocs.io/en/latest/api/pykeen.datasets.PrimeKG.html) | [Chandak *et al*., 2022](https://doi.org/10.1101/2022.05.01.489928) | 129375 | 30 | 8100498 |

‎docs/source/references.rst

+3
Original file line numberDiff line numberDiff line change
@@ -123,3 +123,6 @@ References
123123
.. [peng2020] Y. Peng and J. Zhang (2020) `LineaRE: Simple but Powerful Knowledge Graph Embedding for
124124
Link Prediction <https://arxiv.org/abs/2004.10037>`_, *2020 IEEE International Conference on Data Mining (ICDM)*,
125125
pp. 422-431, doi: 10.1109/ICDM50108.2020.00051.
126+
127+
.. [koenigs2022] Königs, C., *et al* (2022) `The heterogeneous pharmacological medical biochemical
128+
network PharMeBINet <https://doi.org/10.1038/s41597-022-01510-3>`_, *Scientific Data*, **9**, 393.

‎src/pykeen/datasets/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
from .nations import Nations
4646
from .ogb import OGBBioKG, OGBLoader, OGBWikiKG2
4747
from .openbiolink import OpenBioLink, OpenBioLinkLQ
48+
from .pharmebinet import PharMeBINet
4849
from .pharmkg import PharmKG, PharmKG8k
4950
from .primekg import PrimeKG
5051
from .umls import UMLS
@@ -97,6 +98,7 @@
9798
"PharmKG",
9899
"PrimeKG",
99100
"Globi",
101+
"PharMeBINet",
100102
]
101103

102104
logger = logging.getLogger(__name__)

‎src/pykeen/datasets/base.py

+11-1
Original file line numberDiff line numberDiff line change
@@ -717,6 +717,7 @@ def __init__(
717717
create_inverse_triples: bool = False,
718718
delimiter: Optional[str] = None,
719719
random_state: TorchRandomHint = None,
720+
read_csv_kwargs: Optional[Dict[str, Any]] = None,
720721
):
721722
"""Initialize dataset.
722723
@@ -734,6 +735,7 @@ def __init__(
734735
:param random_state: An optional random state to make the training/testing/validation split reproducible.
735736
:param delimiter:
736737
The delimiter for the contained dataset.
738+
:param read_csv_kwargs: Keyword arguments to pass through to :func:`pandas.read_csv`.
737739
"""
738740
self.cache_root = self._help_cache(cache_root)
739741

@@ -743,6 +745,8 @@ def __init__(
743745
self.url = url
744746
self._create_inverse_triples = create_inverse_triples
745747
self._relative_path = pathlib.PurePosixPath(relative_path)
748+
self.read_csv_kwargs = read_csv_kwargs or {}
749+
self.read_csv_kwargs.setdefault("sep", self.delimiter)
746750

747751
if eager:
748752
self._load()
@@ -808,7 +812,13 @@ def _get_df(self) -> pd.DataFrame:
808812
# tarfile does not like pathlib
809813
tar_file.extract(str(self._relative_path), self.cache_root)
810814

811-
df = pd.read_csv(_actual_path, sep=self.delimiter)
815+
df = pd.read_csv(_actual_path, **self.read_csv_kwargs)
816+
817+
usecols = self.read_csv_kwargs.get("usecols")
818+
if usecols is not None:
819+
logger.info("reordering columns: %s", usecols)
820+
df = df[usecols]
821+
812822
return df
813823

814824

‎src/pykeen/datasets/pharmebinet.py

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# -*- coding: utf-8 -*-
2+
3+
"""The `PharMeBINet <https://github.com/ckoenigs/PharMeBINet/>`_ dataset.
4+
5+
Get a summary with ``python -m pykeen.datasets.pharmebinet``.
6+
"""
7+
8+
import click
9+
from docdata import parse_docdata
10+
from more_click import verbose_option
11+
12+
from .base import TarFileSingleDataset
13+
from ..typing import TorchRandomHint
14+
15+
__all__ = [
16+
"PharMeBINet",
17+
]
18+
19+
RAW_URL = "https://zenodo.org/record/7011027/files/pharmebinet_tsv_2022_08_19_v2.tar.gz"
20+
21+
22+
@parse_docdata
23+
class PharMeBINet(TarFileSingleDataset):
24+
"""The PharMeBINet dataset from [koenigs2022]_.
25+
26+
---
27+
name: PharMeBINet
28+
citation:
29+
github: ckoenigs/PharMeBINet
30+
author: Königs
31+
year: 2022
32+
link: https://www.nature.com/articles/s41597-022-01510-3
33+
single: true
34+
statistics:
35+
entities: 2869407
36+
relations: 208
37+
triples: 15883653
38+
training: 12702210
39+
testing: 1587776
40+
validation: 1587777
41+
"""
42+
43+
def __init__(
44+
self,
45+
random_state: TorchRandomHint = 0,
46+
**kwargs,
47+
):
48+
"""Initialize the PharMeBINet dataset from [koenigs2022]_.
49+
50+
:param random_state: An optional random state to make the training/testing/validation split reproducible.
51+
:param kwargs: keyword arguments passed to :class:`pykeen.datasets.base.TarFileSingleDataset`.
52+
"""
53+
super().__init__(
54+
url=RAW_URL,
55+
relative_path="edges.tsv",
56+
random_state=random_state,
57+
read_csv_kwargs=dict(
58+
usecols=["start_id", "type", "end_id"],
59+
sep="\t",
60+
dtype={"start_id": str, "end_id": str},
61+
),
62+
**kwargs,
63+
)
64+
65+
66+
@click.command()
67+
@verbose_option
68+
def _main():
69+
from pykeen.datasets import get_dataset
70+
71+
get_dataset(dataset=PharMeBINet).summarize()
72+
73+
74+
if __name__ == "__main__":
75+
_main()

0 commit comments

Comments
 (0)
Please sign in to comment.