From d04c87caec8746321940348a0269af5d5766ff53 Mon Sep 17 00:00:00 2001 From: yoid2000 Date: Fri, 18 Oct 2024 12:32:37 +0200 Subject: [PATCH 1/2] Added stitching API description --- README.md | 2 ++ docs/stitching-api.md | 19 +++++++++++++++++++ 2 files changed, 21 insertions(+) create mode 100644 docs/stitching-api.md diff --git a/README.md b/README.md index 2a5cac4..e79ecf0 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,8 @@ The [time-series notebook](docs/time-series.ipynb) gives examples of how to obta A step-by-step description of the algorithm can be found [here](docs/algorithm.md). +There is an API to the stitching function. It is primarily for testing and development purposes. A description can be found [here](docs/stitching-api.md). + A paper describing the design of **SynDiffix**, its performance, and its anonymity properties can be found [here on ArXiv](https://arxiv.org/abs/2311.09628). diff --git a/docs/stitching-api.md b/docs/stitching-api.md new file mode 100644 index 0000000..7748256 --- /dev/null +++ b/docs/stitching-api.md @@ -0,0 +1,19 @@ +## Stitching API + +When building tables that have too many columns to scale to a single cluster (tree), SynDiffix builds multiple clusters and stitches them together on one or more common columns. + +The API for stitching is now exposed. This is primarily for testing and development purposes: the clustering decisions made by SynDiffix are generally good and do not need to be over-ridden. + +The interface for the stitching API is: + +```python +from syndiffix.stitcher import stitch + +df_stitched = stitch(df_left=df_left, df_right=df_right, shared=False) +``` + +`df_left` and `df_right` are dataframes. They must have at least one column in common. Stitching will take place on the common columns. `df_stitched` will contain the common columns as well as the non-common columns from both `df_left` and `df_right`. `df_left` and `df_right` do not need to have the same number of rows, but in practice they should not differ by more than a few rows. Otherwise, the quality of `df_stitched` will be poor (many dropped or replicated rows from `df_left` and `df_right`). + +`shared` is `True` be default. If `shared==False`, then the common columns in `df_left` will be preserved in `df_stitched`: they will not be modified by the stitching procedure. If `shared==True`, then the common columns in both `df_left` and `df_right` will be modified. + +Examples of stitching can be found at `tests/test_stitcher.py`. \ No newline at end of file From dc5eb36b6bad4625d48c0611fa1cdaae8ccc083b Mon Sep 17 00:00:00 2001 From: yoid2000 Date: Fri, 18 Oct 2024 12:35:39 +0200 Subject: [PATCH 2/2] Fixed typo in stitching-api.md --- docs/stitching-api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/stitching-api.md b/docs/stitching-api.md index 7748256..aad09ab 100644 --- a/docs/stitching-api.md +++ b/docs/stitching-api.md @@ -14,6 +14,6 @@ df_stitched = stitch(df_left=df_left, df_right=df_right, shared=False) `df_left` and `df_right` are dataframes. They must have at least one column in common. Stitching will take place on the common columns. `df_stitched` will contain the common columns as well as the non-common columns from both `df_left` and `df_right`. `df_left` and `df_right` do not need to have the same number of rows, but in practice they should not differ by more than a few rows. Otherwise, the quality of `df_stitched` will be poor (many dropped or replicated rows from `df_left` and `df_right`). -`shared` is `True` be default. If `shared==False`, then the common columns in `df_left` will be preserved in `df_stitched`: they will not be modified by the stitching procedure. If `shared==True`, then the common columns in both `df_left` and `df_right` will be modified. +`shared` is `True` by default. If `shared==False`, then the common columns in `df_left` will be preserved in `df_stitched`: they will not be modified by the stitching procedure. If `shared==True`, then the common columns in both `df_left` and `df_right` will be modified. Examples of stitching can be found at `tests/test_stitcher.py`. \ No newline at end of file