Skip to content

Commit

Permalink
Merge branch 'main' into paul-blobs
Browse files Browse the repository at this point in the history
  • Loading branch information
yoid2000 authored Nov 6, 2024
2 parents 4490f41 + 7b47c9e commit 5422534
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,8 +173,7 @@ A step-by-step description of the algorithm can be found [here](docs/algorithm.m

There is an API to the stitching function. It is primarily for testing and development purposes. A description can be found [here](docs/stitching-api.md).

A paper describing the design of SynDiffix, its performance, and its anonymity properties can be found
[here on ArXiv](https://arxiv.org/abs/2311.09628).
A paper describing the design of SynDiffix, its performance, and its anonymity properties can be found [here on ArXiv](https://arxiv.org/abs/2311.09628).

A per-dimension range is internally called an interval (and handled by the `Interval` class), in order to avoid
potential name clashes with the native Python `range` API.
Expand Down
19 changes: 19 additions & 0 deletions docs/stitching-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Stitching API

When building tables that have too many columns to scale to a single cluster (tree), SynDiffix builds multiple clusters and stitches them together on one or more common columns.

The API for stitching is now exposed. This is primarily for testing and development purposes: the clustering decisions made by SynDiffix are generally good and do not need to be over-ridden.

The interface for the stitching API is:

```python
from syndiffix.stitcher import stitch

df_stitched = stitch(df_left=df_left, df_right=df_right, shared=False)
```

`df_left` and `df_right` are dataframes. They must have at least one column in common. Stitching will take place on the common columns. `df_stitched` will contain the common columns as well as the non-common columns from both `df_left` and `df_right`. `df_left` and `df_right` do not need to have the same number of rows, but in practice they should not differ by more than a few rows. Otherwise, the quality of `df_stitched` will be poor (many dropped or replicated rows from `df_left` and `df_right`).

`shared` is `True` by default. If `shared==False`, then the common columns in `df_left` will be preserved in `df_stitched`: they will not be modified by the stitching procedure. If `shared==True`, then the common columns in both `df_left` and `df_right` will be modified.

Examples of stitching can be found at `tests/test_stitcher.py`.

0 comments on commit 5422534

Please sign in to comment.