Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs #200

Merged
merged 40 commits into from
Mar 7, 2025

Conversation

mwojtyczka
Copy link
Contributor

@mwojtyczka mwojtyczka commented Feb 28, 2025

Changes

  • Added uniqueness check to verify values in a column are unique. Report an issue for each row that contains a duplicate value. Allow to specify custom window spec.
  • Renamed rule functions to unify the naming conventions across all checks.
  • Extended is_not_less_than and is_not_greater_than to accept column name or column expression as limit.
  • Unified input parameters to have a single field for min and max limits in the is_is_range and is_not_in_range checks.
  • Updated logic of is_not_in_range to be inclusive of the boundaries for consistency with the is_is_range check.
  • Updated quality checks api descriptions.
  • Improved documentation and provided comprehensive examples of checks.
  • Added info on using private PYPI package and installation of the lastest Databricks CLI to avoid installation issues

This change unifies the naming convention across all checks and introduces a breaking change!

Linked issues

Resolves #154 #131 #197 #175 #205

Tests

  • manually tested
  • added unit tests
  • added integration tests

extended not_less_than and not_greater_than checks to use limit col expr
Updated check descriptions
Updated docs
Copy link

github-actions bot commented Feb 28, 2025

✅ 134/134 passed, 1 skipped, 59m17s total

Running from acceptance #552

refactored code
@mwojtyczka mwojtyczka changed the title Uniqueness check and column expression for not less and not greater than checks Added uniqueness check and added column expression support for limits in not less / greater than checks Feb 28, 2025
@mwojtyczka mwojtyczka linked an issue Mar 6, 2025 that may be closed by this pull request
1 task
@mwojtyczka mwojtyczka changed the title Added uniqueness check and added column expression support for limits in not less / greater than checks Added uniqueness check, added column expression support for limits in not less / greater than checks, and update docs Mar 6, 2025
@mwojtyczka mwojtyczka changed the title Added uniqueness check, added column expression support for limits in not less / greater than checks, and update docs Added uniqueness check, added column expression support for limits in not less / greater than checks, and updated docs Mar 6, 2025
Copy link
Contributor

@alexott alexott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general is good, just few comments

@mwojtyczka mwojtyczka merged commit d2df543 into main Mar 7, 2025
9 checks passed
@mwojtyczka mwojtyczka deleted the uniqueness branch March 7, 2025 20:54
mwojtyczka added a commit that referenced this pull request Mar 10, 2025
* Added uniqueness check([#200](#200)). A uniqueness check has been added, which reports an issue for each row containing a duplicate value in a specified column. This resolves issue [154](#154).
* Added column expression support for limits in not less and not greater than checks, and updated docs ([#200](#200)). This commit introduces several changes to simplify and enhance data quality checking in PySpark workloads for both streaming and batch data. The naming conventions of rule functions have been unified, and the `is_not_less_than` and `is_not_greater_than` functions now accept column names or expressions as limits. The input parameters for range checks have been unified, and the logic of `is_not_in_range` has been updated to be inclusive of the boundaries. The project's documentation has been improved, with the addition of comprehensive examples, and the contribution guidelines have been clarified. This change includes a breaking change for some of the checks. Users are advised to review and test the changes before implementation to ensure compatibility and avoid any disruptions. Reslves issues: [131](#131), [197](#200), [175](#175), [205](#205)
* Include predefined check functions by default when applying custom checks by metadata ([#203](#203)). The data quality engine has been updated to include predefined check functions by default when applying custom checks using metadata in the form of YAML or JSON. This change simplifies the process of defining custom checks, as users no longer need to manually import predefined functions, which were previously required and could be cumbersome. The default behavior now is to import all predefined checks. The `validate_checks` method has been updated to accept a dictionary of custom check functions instead of global variables. This improvement resolves issue [#48](#48).
@mwojtyczka mwojtyczka mentioned this pull request Mar 10, 2025
mwojtyczka added a commit that referenced this pull request Mar 10, 2025
* Added uniqueness
check([#200](#200)). A
uniqueness check has been added, which reports an issue for each row
containing a duplicate value in a specified column. This resolves issue
[154](#154).
* Added column expression support for limits in not less and not greater
than checks, and updated docs
([#200](#200)). This commit
introduces several changes to simplify and enhance data quality checking
in PySpark workloads for both streaming and batch data. The naming
conventions of rule functions have been unified, and the
`is_not_less_than` and `is_not_greater_than` functions now accept column
names or expressions as limits. The input parameters for range checks
have been unified, and the logic of `is_not_in_range` has been updated
to be inclusive of the boundaries. The project's documentation has been
improved, with the addition of comprehensive examples, and the
contribution guidelines have been clarified. This change includes a
breaking change for some of the checks. Users are advised to review and
test the changes before implementation to ensure compatibility and avoid
any disruptions. Reslves issues:
[131](#131),
[197](#200),
[175](#175),
[205](#205)
* Include predefined check functions by default when applying custom
checks by metadata
([#203](#203)). The data
quality engine has been updated to include predefined check functions by
default when applying custom checks using metadata in the form of YAML
or JSON. This change simplifies the process of defining custom checks,
as users no longer need to manually import predefined functions, which
were previously required and could be cumbersome. The default behavior
now is to import all predefined checks. The `validate_checks` method has
been updated to accept a dictionary of custom check functions instead of
global variables. This improvement resolves issue
[#48](#48).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants