Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

Merged
merged 40 commits into from
Mar 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
15a74b4
added uniqueness check
mwojtyczka Feb 28, 2025
15e58d2
Updated docs
mwojtyczka Feb 28, 2025
337a6ea
fmt
mwojtyczka Feb 28, 2025
afea0dc
updated link in the docs
mwojtyczka Feb 28, 2025
b05fec7
added unit tests
mwojtyczka Feb 28, 2025
f01ef58
updated comments
mwojtyczka Feb 28, 2025
ffd5330
optimized input params
mwojtyczka Mar 1, 2025
66bd25b
updated docs
mwojtyczka Mar 1, 2025
b8c081c
updated docs
mwojtyczka Mar 1, 2025
900c94a
updated logic
mwojtyczka Mar 1, 2025
7a4f7f8
update name of resulting check
mwojtyczka Mar 2, 2025
7437641
update name of resulting check
mwojtyczka Mar 2, 2025
27d6e97
fmt
mwojtyczka Mar 2, 2025
fd3c186
updated
mwojtyczka Mar 2, 2025
04182d1
updated tests
mwojtyczka Mar 2, 2025
6f13b12
updated tests
mwojtyczka Mar 2, 2025
e6cd2d9
updated error message
mwojtyczka Mar 2, 2025
b7b01dd
refactored checks
mwojtyczka Mar 2, 2025
7968069
fmt
mwojtyczka Mar 2, 2025
e6ede20
fmt
mwojtyczka Mar 2, 2025
257a233
updated tests
mwojtyczka Mar 2, 2025
393caf3
updated docs
mwojtyczka Mar 2, 2025
a79ff82
Merge branch 'main' into uniqueness
mwojtyczka Mar 3, 2025
b2c2c40
refactor
mwojtyczka Mar 3, 2025
3c19c5b
refactored documentation
mwojtyczka Mar 3, 2025
a5f89d2
fixed links
mwojtyczka Mar 3, 2025
fcb6bdf
updated docs
mwojtyczka Mar 3, 2025
ba69849
updated docs
mwojtyczka Mar 5, 2025
d745003
updated doc
mwojtyczka Mar 5, 2025
8ad4117
updated doc
mwojtyczka Mar 6, 2025
114473a
added checks validation
mwojtyczka Mar 6, 2025
bd5eea4
refactor docs
mwojtyczka Mar 6, 2025
20d758a
refactor docs
mwojtyczka Mar 6, 2025
3a644df
implemented code review feedback
mwojtyczka Mar 7, 2025
3e48b32
allow to define window spec for uniqueness check
mwojtyczka Mar 7, 2025
f4fa8e7
updated test
mwojtyczka Mar 7, 2025
da68c58
updated test
mwojtyczka Mar 7, 2025
e5d1489
updated test
mwojtyczka Mar 7, 2025
b68771b
updated test
mwojtyczka Mar 7, 2025
9210bba
updated tests
mwojtyczka Mar 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ Simplified Data Quality checking at Scale for PySpark Workloads on streaming and

# Documentation

The full documentation is available at: [https://databrickslabs.github.io/dqx/](https://databrickslabs.github.io/dqx/)
The complete documentation is available at: [https://databrickslabs.github.io/dqx/](https://databrickslabs.github.io/dqx/)

# Contribution

See contribution guidance [here](https://databrickslabs.github.io/dqx/docs/dev/contributing/) on how to contribute to the project (build, test, and submit a PR).
Please see the contribution guidance [here](https://databrickslabs.github.io/dqx/docs/dev/contributing/) on how to contribute to the project (build, test, and submit a PR).

# Project Support

Please note that this project is provided for your exploration only and is not
formally supported by Databricks with Service Level Agreements (SLAs). They are
provided AS-IS, and we do not make any guarantees of any kind. Please do not
provided AS-IS, and we do not make any guarantees. Please do not
submit a support ticket relating to any issues arising from the use of this project.

Any issues discovered through the use of this project should be filed as GitHub
Expand Down
16 changes: 9 additions & 7 deletions demos/dqx_demo_library.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
print(yaml.safe_dump(summary_stats))
print(profiles)

# generate DQX quality rules/checks
# generate DQX quality rules/checks candidates
# they should be manually reviewed before being applied to the data
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"
print(yaml.safe_dump(checks))
Expand Down Expand Up @@ -152,7 +153,7 @@

- criticality: error
check:
function: value_is_in_list
function: is_in_list
arguments:
col_name: col1
allowed:
Expand Down Expand Up @@ -185,7 +186,7 @@

# COMMAND ----------

from databricks.labs.dqx.col_functions import is_not_null, is_not_null_and_not_empty, value_is_in_list
from databricks.labs.dqx.col_functions import is_not_null, is_not_null_and_not_empty, is_in_list
from databricks.labs.dqx.engine import DQEngine, DQRule, DQRuleColSet
from databricks.sdk import WorkspaceClient

Expand All @@ -201,7 +202,7 @@
check=is_not_null_and_not_empty("col4")),
DQRule( # name for the check auto-generated if not provided
criticality="error",
check=value_is_in_list("col1", ["1", "2"]))
check=is_in_list("col1", ["1", "2"]))
] + DQRuleColSet( # define rule for multiple columns at once
columns=["col1", "col2"],
criticality="error",
Expand Down Expand Up @@ -254,7 +255,7 @@
- dropoff_latitude
criticality: warn
- check:
function: not_less_than
function: is_not_less_than
arguments:
col_name: trip_distance
limit: 1
Expand All @@ -267,7 +268,7 @@
name: pickup_datetime_greater_than_dropoff_datetime
criticality: error
- check:
function: not_in_future
function: is_not_in_future
arguments:
col_name: pickup_datetime
name: pickup_datetime_not_in_future
Expand Down Expand Up @@ -357,7 +358,8 @@ def ends_with_foo(col_name: str) -> Column:
dq_engine = DQEngine(WorkspaceClient())

custom_check_functions = {"ends_with_foo": ends_with_foo}
#custom_check_functions=globals() # include all functions for simplicity
# or include all functions with globals() for simplicity
#custom_check_functions=globals()

valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks, custom_check_functions)
display(valid_and_quarantined_df)
Expand Down
10 changes: 5 additions & 5 deletions demos/dqx_demo_tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
# MAGIC
# MAGIC You can also start the profiler by navigating to the Databricks Workflows UI.
# MAGIC
# MAGIC Note that using the profiler is optional. It is usually one-time operation and not a scheduled activity.
# MAGIC Note that using the profiler is optional. It is usually one-time operation and not a scheduled activity. The generated check candidates should be manually reviewed before being applied to the data.

# COMMAND ----------

Expand Down Expand Up @@ -135,7 +135,7 @@
- dropoff_latitude
criticality: error
- check:
function: not_less_than
function: is_not_less_than
arguments:
col_name: trip_distance
limit: 1
Expand All @@ -148,7 +148,7 @@
name: pickup_datetime_greater_than_dropoff_datetime
criticality: error
- check:
function: not_in_future
function: is_not_in_future
arguments:
col_name: pickup_datetime
name: pickup_datetime_not_in_future
Expand Down Expand Up @@ -206,7 +206,7 @@
# MAGIC %md
# MAGIC ### Save quarantined data to Unity Catalog table
# MAGIC
# MAGIC Note: In this demo, we only save the quarantined data and omit the output. This is because the dashboards use only quarantined data as their input. Therefore, saving the output data is unnecessary in this demo. If you apply checks to flag invalid records without quarantining them (e.g. using the apply check methods without the split), ensure that the `quarantine_table` field in your run config is set to the same value as the `output_table` field.
# MAGIC Note: In this demo, we only save the quarantined data and omit the output. This is because the dashboard use only quarantined data as their input. Therefore, saving the output data is unnecessary in this demo. If you apply checks to flag invalid records without quarantining them (e.g. using the apply check methods without the split), ensure that the `quarantine_table` field in your run config is set to the same value as the `output_table` field.
# MAGIC

# COMMAND ----------
Expand All @@ -222,7 +222,7 @@
# COMMAND ----------

# MAGIC %md
# MAGIC ### View data quality in DQX Dashboards
# MAGIC ### View data quality in DQX Dashboard

# COMMAND ----------

Expand Down
6 changes: 3 additions & 3 deletions demos/dqx_dlt_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,21 +64,21 @@ def bronze():
criticality: "error"
- check:
function: "not_in_future"
function: "is_not_in_future"
arguments:
col_name: "pickup_datetime"
name: "pickup_datetime_isnt_in_range"
criticality: "warn"
- check:
function: "not_in_future"
function: "is_not_in_future"
arguments:
col_name: "pickup_datetime"
name: "pickup_datetime_not_in_future"
criticality: "warn"
- check:
function: "not_in_future"
function: "is_not_in_future"
arguments:
col_name: "dropoff_datetime"
name: "dropoff_datetime_not_in_future"
Expand Down
11 changes: 8 additions & 3 deletions docs/dqx/docs/demos.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,17 @@
sidebar_position: 4
---

import Admonition from '@theme/Admonition';

# Demos

Install the [installation](/docs/installation) framework, and import the following notebooks in the Databricks workspace to try it out:
Import the following notebooks in the Databricks workspace to try DQX out:
* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
* [DQX DLT Demo Notebook](https://github.com/databrickslabs/dqx/blob/main/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Delta Live Tables (DLT).

Note that DQX don't have to be run from a Notebook. You can run it from any Python script as long as it runs on Databricks.
For example, you can add DQX as a library to your job or cluster.
<Admonition type="tip" title="Execution Environment">
You don't have to run DQX from a Notebook. DQX can be run from any Python script as long as it runs on Databricks.
For example, you can run it from a Databricks job by adding DQX as a dependent library.
DQX also comes with a set of command-line tools for running DQX jobs (see the [User Guide](/docs/guide)).
</Admonition>
42 changes: 24 additions & 18 deletions docs/dqx/docs/dev/contributing.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import Admonition from '@theme/Admonition';

# Contributing

## First Principles
Expand All @@ -7,7 +9,7 @@ development.

There are several reasons why this approach is encouraged:
- Standard libraries are typically well-vetted, thoroughly tested, and maintained by the official maintainers of the programming language or platform. This ensures a higher level of stability and reliability.
- External dependencies, especially lesser-known or unmaintained ones, can introduce bugs, security vulnerabilities, or compatibility issues that can be challenging to resolve. Adding external dependencies increases the complexity of your codebase.
- External dependencies, especially lesser-known or unmaintained ones, can introduce bugs, security vulnerabilities, or compatibility issues that can be challenging to resolve. Adding external dependencies increases the complexity of your codebase.
- Each dependency may have its own set of dependencies, potentially leading to a complex web of dependencies that can be difficult to manage. This complexity can lead to maintenance challenges, increased risk, and longer build times.
- External dependencies can pose security risks. If a library or package has known security vulnerabilities and is widely used, it becomes an attractive target for attackers. Minimizing external dependencies reduces the potential attack surface and makes it easier to keep your code secure.
- Relying on standard libraries enhances code portability. It ensures your code can run on different platforms and environments without being tightly coupled to specific external dependencies. This is particularly important in settings like Databricks, where you may need to run your code on different clusters or setups.
Expand All @@ -21,26 +23,26 @@ or specialized functionality unavailable in standard libraries.

## First contribution

If you're interested in contributing, please create a PR, reach out to us or open an issue to discuss your ideas.
If you're interested in contributing, please create a PR, contact us, or open an issue to discuss your ideas.

Here are the example steps to submit your first contribution:

1. Fork the repo. You can also create a branch if you are added as writer to the repo.
2. The locally: `git clone`
1. Fork the [DQX](https://github.com/databrickslabs/dqx) repo. You can also create a branch if you are added as a writer to the repo.
2. Clone the repo locally: `git clone`
3. `git checkout main` (or `gcm` if you're using [ohmyzsh](https://ohmyz.sh/)).
4. `git pull` (or `gl` if you're using [ohmyzsh](https://ohmyz.sh/)).
5. `git checkout -b FEATURENAME` (or `gcb FEATURENAME` if you're using [ohmyzsh](https://ohmyz.sh/)).
6. .. do the work
7. `make fmt`
8. `make lint`
9. .. fix if any issues reported
9. .. fix if any issues are reported
10. `make test` and `make integration`, and optionally `make coverage` (generate coverage report)
11. .. fix if any issues reported
12. `git commit -S -a -m "message"`

Make sure to enter a meaningful commit message title.
You need to sign commits with your GPG key (hence -S option).
To setup GPG key in your Github account follow [these instructions](https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification).
To set up GPG key in your Github account, follow [these instructions](https://docs.github.com/en/github/authenticating-to-github/managing-commit-signature-verification).
You can configure Git to sign all commits with your GPG key by default: `git config --global commit.gpgsign true`

If you have not signed your commits initially, you can re-apply all of them and sign as follows:
Expand All @@ -51,15 +53,15 @@ Here are the example steps to submit your first contribution:
```
13. `git push origin FEATURENAME`

To access the repository, you must use the HTTPS remote with a personal access token or SSH with an SSH key and passphrase that has been authorized for `databrickslabs` organization.
To access the repository, you must use the HTTPS remote with a personal access token or SSH with an SSH key and passphrase that has been authorized for the `databrickslabs` organization.
14. Go to GitHub UI and create PR. Alternatively, `gh pr create` (if you have [GitHub CLI](https://cli.github.com/) installed).
Use a meaningful pull request title because it'll appear in the release notes. Use `Resolves #NUMBER` in pull
request description to [automatically link it](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue)
to an existing issue.

## Local Setup

This section provides a step-by-step guide to set up and start working on the project. These steps will help you set up your project environment and dependencies for efficient development.
This section provides a step-by-step guide for setting up and starting work on the project. These steps will help you set up your project environment and dependencies for efficient development.

{/* Go through the [prerequisites](./README.md#prerequisites) and clone the [dqx github repo](https://github.com/databrickslabs/dqx). */}

Expand All @@ -81,7 +83,7 @@ make fmt
```

Before every commit, run automated bug detector and unit tests to ensure that automated
pull request checks do pass, before your code is reviewed by others:
pull request checks do pass before your code is reviewed by others:
```shell
make lint
make test
Expand All @@ -91,7 +93,7 @@ make test

Integration tests and code coverage are run automatically when you create a Pull Request in Github.
You can also trigger the tests from a local machine by configuring authentication to a Databricks workspace.
You can use any Unity Catalog enabled Databricks workspace.
You can use any Unity Catalog-enabled Databricks workspace.

#### Using terminal

Expand All @@ -117,13 +119,13 @@ Run integration tests with the following command:
make integration
```

Calculate test coverage and display report in html:
Calculate test coverage and display report in HTML:
```shell
make coverage
```
#### Using IDE

If you want to run integration tests from your IDE, you must setup `.env` or `~/.databricks/debug-env.json` file
If you want to run integration tests from your IDE, you must set `.env` or `~/.databricks/debug-env.json` file
(see [instructions](https://github.com/databrickslabs/pytester?tab=readme-ov-file#debug_env_name-fixture)).
The name of the debug environment that you must define is `ws` (see `debug_env_name` fixture in the `conftest.py`).

Expand All @@ -140,7 +142,7 @@ Create the `~/.databricks/debug-env.json` with the following content, replacing
}
}
```
You must provide an existing cluster which will be auto-started for you as part of the tests.
You must provide an existing cluster that will auto-start for you as part of the tests.

We recommend using [OAuth access token](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) generated for a service principal to authenticate with Databricks as presented above.
Alternatively, you can authenticate using [PAT token](https://docs.databricks.com/en/dev-tools/auth/pat.html) by providing the `DATABRICKS_TOKEN` field. However, we do not recommend this method, as it is less secure than OAuth.
Expand All @@ -160,11 +162,11 @@ To run integration tests on serverless compute, add the `DATABRICKS_SERVERLESS_C
}
}
```
When `DATABRICKS_SERVERLESS_COMPUTE_ID` is set the `DATABRICKS_CLUSTER_ID` is ignored, and tests run on serverless compute.
When `DATABRICKS_SERVERLESS_COMPUTE_ID` is set, the `DATABRICKS_CLUSTER_ID` is ignored, and tests run on serverless compute.

## Manual testing of the framework

We require that all changes be covered by unit tests and integration tests. A pull request (PR) will be blocked if the code coverage is negatively impacted by the proposed change.
We require that all changes must be covered by unit tests and integration tests. A pull request (PR) will be blocked if the proposed change negatively impacts the code coverage.
However, manual testing may still be useful before creating or merging a PR.

To test DQX from your feature branch, you can install it directly as follows:
Expand All @@ -177,7 +179,7 @@ Replace `feature_branch_name` with the name of your branch.
## Manual testing of the CLI commands from the current codebase

Once you clone the repo locally and install Databricks CLI you can run labs CLI commands from the root of the repository.
Similar to other databricks cli commands we can specify Databricks profile to use with `--profile`.
Similar to other Databricks CLI commands, we can specify the Databricks profile to use with `--profile`.

Build the project:
```commandline
Expand Down Expand Up @@ -216,7 +218,9 @@ In most cases, installing DQX directly from the current codebase is sufficient t
When DQX is installed from a released version, it creates a fresh and isolated Python virtual environment locally and installs all the required packages, ensuring a clean setup.
If you need to perform end-to-end testing of the CLI before an official release, follow the process outlined below.

Note: This is only available for GitHub accounts that have write access to the repository. If you contribute from a fork this method is not available.
<Admonition type="warning" title="Usage tips">
This method is only available for GitHub accounts with write access to the repository. It is not available if you contribute from a fork.
</Admonition>

```commandline
# create new tag
Expand All @@ -229,8 +233,10 @@ git push origin v0.1.12-alpha
databricks labs install [email protected]
```

<Admonition type="tip" title="Release">
The release pipeline only triggers when a valid semantic version is provided (e.g. v0.1.12).
Pre-release versions (e.g. v0.1.12-alpha) do not trigger the release pipeline, allowing you to test changes safely before making an official release.
</Admonition>

## Troubleshooting

Expand All @@ -240,7 +246,7 @@ If you encounter any package dependency errors after `git pull`, run `make clean

See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html for more details

**..., expression has type "None", variable has type "str"**
**..., expression has type "None", variable has a type "str"**

* Add `assert ... is not None` if it's a body of a method. Example:

Expand Down
7 changes: 3 additions & 4 deletions docs/dqx/docs/dev/docs_authoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,11 @@ import Admonition from '@theme/Admonition';

This document provides guidelines for writing documentation for the DQX project.


## Tech Stack

The DQX documentation is built using [Docusaurus](https://docusaurus.io/), a modern static site generator.

Docusaurus is a project of Facebook Open Source and is used by many open-source projects to build their documentation websites.
Docusaurus is a Facebook open source project used by many open source projects to build their documentation websites.

We also use [MDX](https://mdxjs.com/) to write markdown files that include JSX components. This allows us to write markdown files with embedded React components.

Expand Down Expand Up @@ -56,7 +55,7 @@ make docs-serve-dev
## Checking search functionality

<Admonition type="tip" title="Tip" emoji="💡">
We're using local search, and it won't be available in the development server.
We are using local search, which won't be available in the development server.
</Admonition>

To check the search functionality, run the following command:
Expand Down Expand Up @@ -129,7 +128,7 @@ The rule of thumb is:
Do not put any technical details in the main documentation.
</div>
<div className="text-lg font-mono">
All technical details should be kept in <code>/docs/dev/</code> section.
All technical details should be kept in the <code>/docs/dev/</code> section.
</div>
</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/dqx/docs/dev/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ sidebar_position: 7

# Contributing to DQX

This section is for contributors to the DQX project. It contains information on how to contribute, including how to submit issues, pull requests, and how to contribute to the documentation.
This section is for contributors to the DQX project. It contains information on how to contribute, including submitting issues, pulling requests, and contributing to the documentation.
Loading