Skip to content

S1M0N38/master-thesis-datasets

Repository files navigation

Datasets

  • datasets/
    • CIFAR100/ toy-dataset, 100 classes, 6-level hierarchy
      • classes/ contains classes.txt (each line is a class name)
      • descriptions/ contains descriptions of classes generated by writers
      • embeddings/ contains embeddings of descriptions generated by embedders
      • encodings/ contains encodings (i.e. targets) generated from hierarchy or from embeddings by encoders
      • hierarchy/ contains various representations of the hierarchy
      • inputs/ empty directory to fill with the actual dataset (e.g. soft link to a directory containing images)
    • iNaturalist19/ 1010 classes, 8-level hierarchy
      • ...
    • tieredImageNet/ 608 classes, 13-level hierarchy
      • ...
  • embedders/ takes descriptions and produce embeddings
  • encoders/ takes embeddings or hierarchy and produce encodings
  • writers/ generate descriptions of classes from classes.txt

Usage

  • python utils/hierarchy.py generate lca.npy and hierarchy.npy for all datasets

  • python utils/classes.py generate classes.txt for all datasets

  • python writers/[writer].py --dataset [DATASET] generate description of class names given the dataset

  • python embedders/[embedders].py --dataset [DATASET] --writer [WRITER] generate embeddings from descriptions given the dataset and the writer

  • python encoders/[encoder].py --dataset [DATASET] {--writer [WRITER]} generate encodings from descriptions embeddings or from hierarchy give the dataset

  • bash encode.sh [DATASET] provide a wrapper to encode a dataset using all encoders and fixed hyperparameters.

Softlinking datasets

Suppose that you have a CIFAR100 dataset downloaded by torchvision at /data/user/dataset-inputs/cifar-100-python. To generate the softlink use

ln -s /data/user/dataset-inputs/cifar-100-python datasets/CIFAR100/inputs/cifar-100-python

Using softlinks is a convenient way to have all the datasets in one location, allowing you to access them from different locations without the need to copy them.

Softlinks can also be defined for folder-datasets, e.g. iNaturalist19:

ln -s /data/user/datasets-inputs/iNaturalist19/train datasets/iNaturalist19/inputs/train
ln -s /data/user/datasets-inputs/iNaturalist19/val datasets/iNaturalist19/inputs/val
ln -s /data/user/datasets-inputs/iNaturalist19/test datasets/iNaturalist19/inputs/test

Visualizing encodings

Encodings can visualize by projecting them onto 2D space using some algorithm for dimensionality reduction. Here we provide projections using UMAP which can be explored using

python projectors/explore.py --dataset CIFAR100

This will spawn a web interface at http://localhost:8050/ where you can select the encoding and color them at different levels of the hierarchy by moving a slider.