Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

mwojtyczka · 2025-02-28T12:18:09Z

Changes

Added uniqueness check to verify values in a column are unique. Report an issue for each row that contains a duplicate value. Allow to specify custom window spec.
Renamed rule functions to unify the naming conventions across all checks.
Extended is_not_less_than and is_not_greater_than to accept column name or column expression as limit.
Unified input parameters to have a single field for min and max limits in the is_is_range and is_not_in_range checks.
Updated logic of is_not_in_range to be inclusive of the boundaries for consistency with the is_is_range check.
Updated quality checks api descriptions.
Improved documentation and provided comprehensive examples of checks.
Added info on using private PYPI package and installation of the lastest Databricks CLI to avoid installation issues

This change unifies the naming convention across all checks and introduces a breaking change!

Linked issues

Resolves #154 #131 #197 #175 #205

Tests

manually tested
added unit tests
added integration tests

extended not_less_than and not_greater_than checks to use limit col expr Updated check descriptions Updated docs

github-actions · 2025-02-28T12:32:30Z

✅ 134/134 passed, 1 skipped, 59m17s total

_{Running from acceptance #552}

refactored code

alexott

in general is good, just few comments

docs/dqx/docs/guide.mdx

docs/dqx/docs/installation.mdx

docs/dqx/docs/motivation.mdx

docs/dqx/src/pages/index.tsx

src/databricks/labs/dqx/col_functions.py

* Added uniqueness check([#200](#200)). A uniqueness check has been added, which reports an issue for each row containing a duplicate value in a specified column. This resolves issue [154](#154). * Added column expression support for limits in not less and not greater than checks, and updated docs ([#200](#200)). This commit introduces several changes to simplify and enhance data quality checking in PySpark workloads for both streaming and batch data. The naming conventions of rule functions have been unified, and the `is_not_less_than` and `is_not_greater_than` functions now accept column names or expressions as limits. The input parameters for range checks have been unified, and the logic of `is_not_in_range` has been updated to be inclusive of the boundaries. The project's documentation has been improved, with the addition of comprehensive examples, and the contribution guidelines have been clarified. This change includes a breaking change for some of the checks. Users are advised to review and test the changes before implementation to ensure compatibility and avoid any disruptions. Reslves issues: [131](#131), [197](#200), [175](#175), [205](#205) * Include predefined check functions by default when applying custom checks by metadata ([#203](#203)). The data quality engine has been updated to include predefined check functions by default when applying custom checks using metadata in the form of YAML or JSON. This change simplifies the process of defining custom checks, as users no longer need to manually import predefined functions, which were previously required and could be cumbersome. The default behavior now is to import all predefined checks. The `validate_checks` method has been updated to accept a dictionary of custom check functions instead of global variables. This improvement resolves issue [#48](#48).

mwojtyczka added 3 commits February 28, 2025 12:48

added uniqueness check

15a74b4

extended not_less_than and not_greater_than checks to use limit col expr Updated check descriptions Updated docs

Updated docs

15e58d2

fmt

337a6ea

mwojtyczka requested a review from alexott February 28, 2025 12:18

mwojtyczka self-assigned this Feb 28, 2025

mwojtyczka requested a review from a team as a code owner February 28, 2025 12:18

mwojtyczka had a problem deploying to tool February 28, 2025 12:18 — with GitHub Actions Error

mwojtyczka linked an issue Feb 28, 2025 that may be closed by this pull request

[FEATURE]: Better documentation of the quality rules #131

Closed

1 task

updated link in the docs

afea0dc

mwojtyczka temporarily deployed to tool February 28, 2025 12:25 — with GitHub Actions Inactive

added unit tests

b05fec7

refactored code

mwojtyczka had a problem deploying to tool February 28, 2025 13:45 — with GitHub Actions Error

updated comments

f01ef58

mwojtyczka temporarily deployed to tool February 28, 2025 13:48 — with GitHub Actions Inactive

mwojtyczka changed the title ~~Uniqueness check and column expression for not less and not greater than checks~~ Added uniqueness check and added column expression support for limits in not less / greater than checks Feb 28, 2025

optimized input params

ffd5330

mwojtyczka temporarily deployed to tool March 1, 2025 22:18 — with GitHub Actions Inactive

updated docs

66bd25b

mwojtyczka had a problem deploying to tool March 1, 2025 22:36 — with GitHub Actions Error

updated docs

b8c081c

mwojtyczka temporarily deployed to tool March 1, 2025 22:40 — with GitHub Actions Inactive

updated logic

900c94a

mwojtyczka linked an issue Mar 6, 2025 that may be closed by this pull request

[BUG]: validate_checks() says DQRule is unsupported, but the rules work #205

Closed

1 task

mwojtyczka changed the title ~~Added uniqueness check and added column expression support for limits in not less / greater than checks~~ Added uniqueness check, added column expression support for limits in not less / greater than checks, and update docs Mar 6, 2025

mwojtyczka changed the title ~~Added uniqueness check, added column expression support for limits in not less / greater than checks, and update docs~~ Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs Mar 6, 2025

alexott approved these changes Mar 7, 2025

View reviewed changes

implemented code review feedback

3a644df

mwojtyczka temporarily deployed to tool March 7, 2025 15:22 — with GitHub Actions Inactive

allow to define window spec for uniqueness check

3e48b32

mwojtyczka had a problem deploying to tool March 7, 2025 17:28 — with GitHub Actions Error

updated test

f4fa8e7

mwojtyczka had a problem deploying to tool March 7, 2025 17:33 — with GitHub Actions Error

updated test

da68c58

mwojtyczka had a problem deploying to tool March 7, 2025 17:37 — with GitHub Actions Error

updated test

e5d1489

mwojtyczka had a problem deploying to tool March 7, 2025 17:38 — with GitHub Actions Error

updated test

b68771b

mwojtyczka temporarily deployed to tool March 7, 2025 17:42 — with GitHub Actions Inactive

updated tests

9210bba

mwojtyczka temporarily deployed to tool March 7, 2025 19:51 — with GitHub Actions Inactive

mwojtyczka merged commit d2df543 into main Mar 7, 2025
9 checks passed

mwojtyczka deleted the uniqueness branch March 7, 2025 20:54

mwojtyczka mentioned this pull request Mar 10, 2025

Release v0.2.0 #207

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

mwojtyczka commented Feb 28, 2025 •

edited

Loading

github-actions bot commented Feb 28, 2025 •

edited

Loading

alexott left a comment

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

Conversation

mwojtyczka commented Feb 28, 2025 • edited Loading

Changes

Linked issues

Tests

github-actions bot commented Feb 28, 2025 • edited Loading

alexott left a comment

Choose a reason for hiding this comment

mwojtyczka commented Feb 28, 2025 •

edited

Loading

github-actions bot commented Feb 28, 2025 •

edited

Loading