Merge branch 'main' into paul-blobs

diffix · Nov 6, 2024 · 5422534 · 5422534
2 parents 4490f41 + 7b47c9e
commit 5422534
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -173,8 +173,7 @@ A step-by-step description of the algorithm can be found [here](docs/algorithm.m
 
 There is an API to the stitching function. It is primarily for testing and development purposes. A description can be found [here](docs/stitching-api.md).
 
-A paper describing the design of SynDiffix, its performance, and its anonymity properties can be found
-[here on ArXiv](https://arxiv.org/abs/2311.09628).
+A paper describing the design of SynDiffix, its performance, and its anonymity properties can be found [here on ArXiv](https://arxiv.org/abs/2311.09628).
 
 A per-dimension range is internally called an interval (and handled by the `Interval` class), in order to avoid
 potential name clashes with the native Python `range` API.

diff --git a/docs/stitching-api.md b/docs/stitching-api.md
@@ -0,0 +1,19 @@
+## Stitching API
+
+When building tables that have too many columns to scale to a single cluster (tree), SynDiffix builds multiple clusters and stitches them together on one or more common columns.
+
+The API for stitching is now exposed. This is primarily for testing and development purposes: the clustering decisions made by SynDiffix are generally good and do not need to be over-ridden.
+
+The interface for the stitching API is:
+
+```python
+from syndiffix.stitcher import stitch
+
+df_stitched = stitch(df_left=df_left, df_right=df_right, shared=False)
+```
+
+`df_left` and `df_right` are dataframes. They must have at least one column in common. Stitching will take place on the common columns. `df_stitched` will contain the common columns as well as the non-common columns from both `df_left` and `df_right`. `df_left` and `df_right` do not need to have the same number of rows, but in practice they should not differ by more than a few rows. Otherwise, the quality of `df_stitched` will be poor (many dropped or replicated rows from `df_left` and `df_right`).
+
+`shared` is `True` by default. If `shared==False`, then the common columns in `df_left` will be preserved in `df_stitched`: they will not be modified by the stitching procedure. If `shared==True`, then the common columns in both `df_left` and `df_right` will be modified.
+
+Examples of stitching can be found at `tests/test_stitcher.py`.