Skip to content


Repository files navigation


This tool can download, generate, convert, and cache datasets. It uses a dataset metadata file that contains URLs to download known existing datasets. It is able to generate some datasets such as TPC-H by calling their (external) generator program (e.g. dbgen).


datalogistik <get|cache>

datalogistik get [-h] \
    -d DATASET \
    -f FORMAT \
    [-s SCALE_FACTOR] \
    [-c COMPRESSION] \
    [-r | --remote]

datalogistik cache [-h] \
    [--clean] \
    [--prune-entry ENTRY] \
    [--prune-invalid] \
Name of the dataset as specified in the repository, or one of the supported generators (tpc-h, tpc-ds).
File format to instantiate the dataset in. If the original dataset specified in the repository has a different format, it will be converted. Supported formats: parquet, csv, arrow.
Scale factor for generating TPC data. Default 1.
Compression to be used for the dataset. For Parquet dataset, this value will be passed to the parquet writer. For CSV datasets, supported values are gz (for GZip) or none.
When set, the requested dataset will not be downloaded to the local filesystem. Instead, datalogistik will return the url(s) to access the files directly via the remote filesystem supported by Arrow (see Conversions cannot be performed on remote datasets; the user needs to upload the desired variant manually and add a corresponding entry to their repo file.
Perform a clean-up of the cache, checking whether all of the subdirectories are part of a dataset that contains a valid metadata file. Otherwise, they will be removed. This option is helpful after manually removing directories from the cache.
Remove a given subdirectory from the cache. The user can specify a certain particular dataset (e.g. tpc-h/1/parquet/0), or a directory higher in the hierarchy (e.g. tpc-h/100).
Validate all entries in the cache for file integrity and remove entries that fail.
Validate all entries in the cache for file integrity and report entries that fail.

Installing using pipx (recommended)

pipx is a CLI tool installer that keeps each tool's dependencies isolated from your working python session and from other tools. This means you won't have to deal with any dependency version conflicts with datalogistik, and if you change one of datalogistik's dependencies (like pyarrow) in your working python session, the tool will still work.

Install pipx:

pip install pipx
pipx ensurepath

Note: after this, you need to restart your terminal session!

Install datalogistik:

pipx install \
    --pip-args '--extra-index-url' \

Run datalogistik:

datalogistik -d type_floats -f csv

Installing using pip

If you are okay with dealing with potential dependency problems, you may install the package with pip:

pip install \
    --extra-index-url \

Run datalogistik:

datalogistik -d type_floats -f csv

Installing from source

For local development of the package, you may install from source.

Clone the repo:

git clone
cd datalogistik

Install datalogistik and its dependencies:

pip install \
    --extra-index-url \
    -e '.[dev]'
pre-commit install

Run the checks that will be run in CI:

# Lint the repo
pre-commit run --all-files
# Run unit tests
# Run integration test
datalogistik -d tpc-h -f parquet

TPC Generators

The location of dbgen (the generator for TPC-H data) and dsdgen (the generator for TPC-DS data) can be specified by setting the environment variable DATALOGISTIK_GEN. If it is not set, datalogistik will clone them from a publicly available repo on Github and build from source.


By default, datalogistik caches datasets to the local directory ./datalogistik_cache. This directory is created if it does not exist yet. The location is the current working directory, but that can be overridden by setting the DATALOGISTIK_CACHE environment variable. It stores each instance of a dataset that the user has requested to instantiate, in addition to different file formats. There is no manifest that lists what entries are in the cache. datalogistik searches the cache by using its directory structure:

TPC datasets
Other datasets

Each entry in the cache has a metadata file called datalogistik_metadata.ini.


datalogistik uses pyarrow to convert between formats. It is able to convert datasets that are too large to fit in memory by using the pyarrow Datasets API.


datalogistik uses a metadata repository file for finding downloadable datasets. By default, it downloads the repo file from the datalogistik github repository, but you can override this by setting the DATALOGISTIK_REPO environment variable. You can also point it to a JSON file on your local filesystem.

The default repo.json file included is based on sources taken from the arrowbench repo.

A repository JSON file contains a list of entries, where each entry has the following properties:

A string to identify the dataset.
Location where this dataset can be downloaded (for now, http(s). Support for GCS may follow later).
File format (e.g. csv, parquet).

In addition, entries can have the following optional properties:

The character used as field delimiter (e.g. ",").
Dimensions ([rows, columns]).
File-level compression (e.g. gz for GZip), that needs to be decoded before an application can use the file. Some formats like parquet use internal compression, but that is not what is meant here.
The schema of the tabular data in the file. The structure of a schema is a JSON string with key:value pairs for each column. The key is the column name, and the value is either the name of an Arrow datatype without any parameters, or a dictionary with the following properties: - type_name: Name of an Arrow datatype - arguments: either a dictionary of argument_name:value items, a list of values, or a single value. Example:
    "a": "string",
    "b": {"type_name": "timestamp", "arguments": {"unit": "ms"}},
    "c": {"type_name": "decimal", "arguments": [7, 3]}
Boolean denoting whether the first line of a CSV file contains the column names (default: false)


Upon success, a JSON string is output on stdout. It points to the dataset created in the cache. It contains the following properties:

String to identify the dataset.
File format (e.g. csv, parquet) - note that this may differ from the information in the repo, because datalogistik might have performed a format conversion.
(optional) In case of a TPC dataset, the scale factor.
The character used as field delimiter (e.g. ",").
Dimensions ([rows, columns]).

The dataset itself contains a metadata file with the following additional properties:


Date and time when this dataset was downloaded or generated to the cache.
The location where this dataset was downloaded.
Location where more information about the origins of dataset can be found.

A list of tables in the dataset, each with its own (set of) files. Each entry in the list has the following properties:

Name of the table.
Schema of the table.

Download url for the table. This can be: * A URL specifying the file to be downloaded for that table (which could be a

single file, or a directory that contains many files to be downloaded)
  • A base URL that is concatenated with rel_url_path``s in the ``files attribute if the table is a multi-file table and it is preferable to list out the files

A list of files in this table. Each entry in the list has the following properties:

Path to the file(s), relative to the directory of this table. This is the location on disk in the cache.
URL path to the file(s), relative to the directory of this table where it is stored remotely. This is used only when downloading the file. This is only necesary when a multi table file has the files that make up the table listed out individually.
Size of the file.
MD5 checksum of the file.

Filesystem permissions

By default, datalogistik sets the files in its cache to read only. If this is not desired or helpful (e.g. when running datalogistik in CI where cleanup is helpful) set the environment variable DATALOGISTIK_NO_PERMISSIONS_CHANGE to a True value.

License info

Copyright (c) 2022, Voltron Data.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


No description, website, or topics provided.







No releases published


No packages published

Contributors 4

