1
1
Bring Your Own Data
2
2
===================
3
3
As an alternative to using a pre-packaged dataset, the training and testing can be set explicitly
4
- by file path or with instances of :class: `pykeen.triples.TriplesFactory `.
4
+ by file path or with instances of :class: `pykeen.triples.TriplesFactory `. Throughout this
5
+ tutorial, the paths to the training, testing, and validation sets for built-in
6
+ :class: `pykeen.datasets.Nations ` will be used as examples.
5
7
6
8
Pre-stratified Dataset
7
9
----------------------
8
10
You've got a training and testing file as 3-column TSV files, all ready to go. You're sure that there aren't
9
11
any entities or relations appearing in the testing set that don't appear in the training set. Load them in the
10
12
pipeline like this:
11
13
12
- .. code-block :: python
13
-
14
- from pykeen.triples import TriplesFactory
15
- from pykeen.pipeline import pipeline
16
-
17
- training_path: str = ...
18
- testing_path: str = ...
19
-
20
- result = pipeline(
21
- training_triples_factory = training_path,
22
- testing_triples_factory = testing_path,
23
- model = ' TransE' ,
24
- )
25
- result.save_to_directory(' test_pre_stratified_transe' )
14
+ >>> from pykeen.triples import TriplesFactory
15
+ >>> from pykeen.pipeline import pipeline
16
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH , NATIONS_TEST_PATH
17
+ >>> result = pipeline(
18
+ ... training= NATIONS_TRAIN_PATH ,
19
+ ... testing= NATIONS_TEST_PATH ,
20
+ ... model= ' TransE' ,
21
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
22
+ ... )
23
+ >>> result.save_to_directory(' doctests/test_pre_stratified_transe' )
26
24
27
25
PyKEEN will take care of making sure that the entities are mapped from their labels to appropriate integer
28
26
(technically, 0-dimensional :class: `torch.LongTensor `) indexes and that the different sets of triples
@@ -31,68 +29,54 @@ share the same mapping.
31
29
This is equally applicable for the :func: `pykeen.hpo.hpo_pipeline `, which has a similar interface to
32
30
the :func: `pykeen.pipeline.pipeline ` as in:
33
31
34
- .. code-block :: python
35
-
36
- from pykeen.triples import TriplesFactory
37
- from pykeen.hpo import hpo_pipeline
38
-
39
- training_path: str = ...
40
- testing_path: str = ...
41
-
42
- result = hpo_pipeline(
43
- n_trials = 30 ,
44
- training_triples_factory = training_path,
45
- testing_triples_factory = testing_path,
46
- model = ' TransE' ,
47
- )
48
- result.save_to_directory(' test_hpo_pre_stratified_transe' )
32
+ >>> from pykeen.hpo import hpo_pipeline
33
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH , NATIONS_TEST_PATH , NATIONS_VALIDATE_PATH
34
+ >>> result = hpo_pipeline(
35
+ ... n_trials= 3 , # you probably want more than this
36
+ ... training= NATIONS_TRAIN_PATH ,
37
+ ... testing= NATIONS_TEST_PATH ,
38
+ ... validation= NATIONS_VALIDATE_PATH ,
39
+ ... model= ' TransE' ,
40
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
41
+ ... )
42
+ >>> result.save_to_directory(' doctests/test_hpo_pre_stratified_transe' )
49
43
50
44
The remainder of the examples will be for :func: `pykeen.pipeline.pipeline `, but all work exactly the same
51
45
for :func: `pykeen.hpo.hpo_pipeline `.
52
46
53
47
If you want to add dataset-wide arguments, you can use the ``dataset_kwargs `` argument
54
48
to the :class: `pykeen.pipeline.pipeline ` to enable options like ``create_inverse_triples=True ``.
55
49
56
- .. code-block :: python
57
-
58
- from pykeen.triples import TriplesFactory
59
- from pykeen.pipeline import pipeline
60
-
61
- training_path: str = ...
62
- testing_path: str = ...
63
-
64
- result = pipeline(
65
- training_triples_factory = training_path,
66
- testing_triples_factory = testing_path,
67
- dataset_kwargs = {' create_inverse_triples' : True },
68
- model = ' TransE' ,
69
- )
70
- result.save_to_directory(' test_pre_stratified_transe' )
50
+ >>> from pykeen.pipeline import pipeline
51
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH , NATIONS_TEST_PATH
52
+ >>> result = pipeline(
53
+ ... training= NATIONS_TRAIN_PATH ,
54
+ ... testing= NATIONS_TEST_PATH ,
55
+ ... dataset_kwargs= {' create_inverse_triples' : True },
56
+ ... model= ' TransE' ,
57
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
58
+ ... )
59
+ >>> result.save_to_directory(' doctests/test_pre_stratified_transe' )
71
60
72
61
If you want finer control over how the triples are created, for example, if they are not all coming from
73
62
TSV files, you can use the :class: `pykeen.triples.TriplesFactory ` interface.
74
63
75
- .. code-block :: python
76
-
77
- from pykeen.triples import TriplesFactory
78
- from pykeen.pipeline import pipeline
79
-
80
- training_path: str = ...
81
- testing_path: str = ...
82
-
83
- training = TriplesFactory(path = training_path)
84
- testing = TriplesFactory(
85
- path = testing_path,
86
- entity_to_id = training.entity_to_id,
87
- relation_to_id = training.relation_to_id,
88
- )
89
-
90
- result = pipeline(
91
- training_triples_factory = training,
92
- testing_triples_factory = testing,
93
- model = ' TransE' ,
94
- )
95
- pipeline_result.save_to_directory(' test_pre_stratified_transe' )
64
+ >>> from pykeen.triples import TriplesFactory
65
+ >>> from pykeen.pipeline import pipeline
66
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH , NATIONS_TEST_PATH
67
+ >>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH )
68
+ >>> testing = TriplesFactory.from_path(
69
+ ... NATIONS_TEST_PATH ,
70
+ ... entity_to_id= training.entity_to_id,
71
+ ... relation_to_id= training.relation_to_id,
72
+ ... )
73
+ >>> result = pipeline(
74
+ ... training= training,
75
+ ... testing= testing,
76
+ ... model= ' TransE' ,
77
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
78
+ ... )
79
+ >>> result.save_to_directory(' doctests/test_pre_stratified_transe' )
96
80
97
81
.. warning ::
98
82
@@ -106,31 +90,26 @@ The ``dataset_kwargs`` argument is ignored when passing your own :class:`pykeen.
106
90
sure to include the ``create_inverse_triples=True `` in the instantiation of those classes if that's your
107
91
desired behavior as in:
108
92
109
- .. code-block :: python
110
-
111
- from pykeen.triples import TriplesFactory
112
- from pykeen.pipeline import pipeline
113
-
114
- training_path: str = ...
115
- testing_path: str = ...
116
-
117
- training = TriplesFactory(
118
- path = training_path,
119
- create_inverse_triples = True ,
120
- )
121
- testing = TriplesFactory(
122
- path = testing_path,
123
- entity_to_id = training.entity_to_id,
124
- relation_to_id = training.relation_to_id,
125
- create_inverse_triples = True ,
126
- )
127
-
128
- result = pipeline(
129
- training_triples_factory = training,
130
- testing_triples_factory = testing,
131
- model = ' TransE' ,
132
- )
133
- result.save_to_directory(' test_pre_stratified_transe' )
93
+ >>> from pykeen.triples import TriplesFactory
94
+ >>> from pykeen.pipeline import pipeline
95
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH , NATIONS_TEST_PATH
96
+ >>> training = TriplesFactory.from_path(
97
+ ... NATIONS_TRAIN_PATH ,
98
+ ... create_inverse_triples= True ,
99
+ ... )
100
+ >>> testing = TriplesFactory.from_path(
101
+ ... NATIONS_TEST_PATH ,
102
+ ... entity_to_id= training.entity_to_id,
103
+ ... relation_to_id= training.relation_to_id,
104
+ ... create_inverse_triples= True ,
105
+ ... )
106
+ >>> result = pipeline(
107
+ ... training= training,
108
+ ... testing= testing,
109
+ ... model= ' TransE' ,
110
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
111
+ ... )
112
+ >>> result.save_to_directory(' doctests/test_pre_stratified_transe' )
134
113
135
114
Triples factories can also be instantiated using the ``triples `` keyword argument instead of the ``path `` argument
136
115
if you already have triples loaded in a :class: `numpy.ndarray `.
@@ -141,37 +120,34 @@ It's more realistic your real-world dataset is not already stratified into train
141
120
PyKEEN has you covered with :func: `pykeen.triples.TriplesFactory.split `, which will allow you to create
142
121
a stratified dataset.
143
122
144
- .. code-block :: python
145
-
146
- from pykeen.triples import TriplesFactory
147
- from pykeen.pipeline import pipeline
148
-
149
- tf = TriplesFactory(path = ... )
150
- training, testing = tf.split()
151
-
152
- result = pipeline(
153
- training_triples_factory = training,
154
- testing_triples_factory = testing,
155
- model = ' TransE' ,
156
- )
157
- pipeline_result.save_to_directory(' test_unstratified_transe' )
123
+ >>> from pykeen.triples import TriplesFactory
124
+ >>> from pykeen.pipeline import pipeline
125
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
126
+ >>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH )
127
+ >>> training, testing = tf.split()
128
+ >>> result = pipeline(
129
+ ... training= training,
130
+ ... testing= testing,
131
+ ... model= ' TransE' ,
132
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go higher
133
+ ... )
134
+ >>> result.save_to_directory(' doctests/test_unstratified_transe' )
158
135
159
136
By default, this is an 80/20 split. If you want to use early stopping, you'll also need a validation set, so
160
137
you should specify the splits:
161
138
162
- .. code-block :: python
163
-
164
- from pykeen.triples import TriplesFactory
165
- from pykeen.pipeline import pipeline
166
-
167
- tf = TriplesFactory(path = ... )
168
- training, testing, validation = tf.split([.8 , .1 , .1 ])
169
-
170
- result = pipeline(
171
- training_triples_factory = training,
172
- testing_triples_factory = testing,
173
- validation_triples_factory = validation,
174
- model = ' TransE' ,
175
- stopper = ' early' ,
176
- )
177
- pipeline_result.save_to_directory(' test_unstratified_stopped_transe' )
139
+ >>> from pykeen.triples import TriplesFactory
140
+ >>> from pykeen.pipeline import pipeline
141
+ >>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
142
+ >>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH )
143
+ >>> training, testing, validation = tf.split([.8 , .1 , .1 ])
144
+ >>> result = pipeline(
145
+ ... training= training,
146
+ ... testing= testing,
147
+ ... validation= validation,
148
+ ... model= ' TransE' ,
149
+ ... stopper= ' early' ,
150
+ ... training_kwargs= dict (num_epochs = 5 ), # short epochs for testing - you should go
151
+ ... # higher, especially with early stopper enabled
152
+ ... )
153
+ >>> result.save_to_directory(' doctests/test_unstratified_stopped_transe' )
0 commit comments