Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: SQL Expression Parsing Error, when expression includes quotes #162

Closed
1 task done
bullah12 opened this issue Feb 7, 2025 · 1 comment · Fixed by #165
Closed
1 task done

[BUG]: SQL Expression Parsing Error, when expression includes quotes #162

bullah12 opened this issue Feb 7, 2025 · 1 comment · Fixed by #165
Assignees
Labels
bug Something isn't working

Comments

@bullah12
Copy link

bullah12 commented Feb 7, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When defining a SQL expression rule within a yaml file, if the expression contains any quotation marks, the parser is unable to validate the checks using the dq_engine.load_checks_from_workspace_file(file) function.

For example, when applying the following YAML rule to dq_engine.load_checks_from_workspace_file():

 - name: example_test
   criticality: error
   check:
     function: sql_expression
     arguments:
       expression: col1 like "Team %"

It will return the following error:

ParserError: while parsing a flow mapping
  in "<unicode string>", line 1, column 45:
     ...  "sql_expression", "arguments": {"expression": "col1 like "Team %""}}
                                         ^
expected ',' or '}', but got '<scalar>'
  in "<unicode string>", line 1, column 70:
     ... ents": {"expression": "col1 like "Team %""}}
                                         ^
File <command-7538297306428071>, line 1
----> 1 checks = dq_engine.load_checks_from_workspace_file(workspace_path="/Workspace/Users/<user-name>/<rules_file.yml>")
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/yaml/parser.py:549, in Parser.parse_flow_mapping_key(self, first)
    547     else:
    548         token = self.peek_token()
--> 549         raise ParserError("while parsing a flow mapping", self.marks[-1],
    550                 "expected ',' or '}', but got %r" % token.id, token.start_mark)
    551 if self.check_token(KeyToken):
    552     token = self.get_token()

The same problem will occur if you used SQL expressions like:

  • col1 = 'Team A'

Expected Behavior

The function should be able to expect quotation marks as part of the SQL expression argument, and ensure the quotation marks are carried over during JSON parsing.

Since it does not, you need to find a workaround to ensure the 'dq_engine.load_checks_from_workspace_file' function can ignore the quotation marks when parsing to JSON (Workaround is in Steps to Reproduce section)

Steps To Reproduce

As a temporary workaround, do the following:

  1. Replace any quotation marks you would typically use in your SQL expression with the curly apostrophe () [Notes]. e.g: col1 like Team %`
  2. Run your DQX load function. e.g: checks = dq_engine.load_checks_from_workspace_file(workspace_path="")
  3. Define the following function in your notebook/workspace:
def replace_quotes(checks,symbol='`'):
    cleaned_checks = checks.copy()
    for check in cleaned_checks:
        if check['check']['function']=='sql_expression':
            if symbol in check['check']['arguments']['expression'].upper():
                check['check']['arguments']['expression'] = check['check']['arguments']['expression'].replace(symbol,'"')
    return cleaned_checks
  1. Pass your checks variable through the function. e.g: checks = replace_quotes(checks)
  2. Check JSON form of your checks to ensure the rule has been translated over correctly

Notes:
The curly apostrophe can be replaced with any other combination of symbols you'd like. You just need to ensure you change it within your replace_quotes function too.

Cloud

Azure

Operating System

Windows

Relevant log output

@bullah12 bullah12 added the bug Something isn't working label Feb 7, 2025
@mwojtyczka mwojtyczka self-assigned this Feb 10, 2025
@mwojtyczka
Copy link
Contributor

mwojtyczka commented Feb 11, 2025

@bullah12 thank you for reporting this.
Fix should be merged and released this week: #165

mwojtyczka added a commit that referenced this issue Feb 12, 2025
* Provided option to customize reporting column names ([#127](#127)). In this release, the DQEngine library has been enhanced to allow for customizable reporting column names. A new constructor has been added to DQEngine, which accepts an optional ExtraParams object for extra configurations. A new Enum class, DefaultColumnNames, has been added to represent the columns used for error and warning reporting. New tests have been added to verify the application of checks with custom column naming. These changes aim to improve the customizability, flexibility, and user experience of DQEngine by providing more control over the reporting columns and resolving issue [#46](#46).
* Fixed parsing error when loading checks from a file ([#165](#165)). In this release, we have addressed a parsing error that occurred when loading checks (data quality rules) from a file, fixing issue [#162](#162). The specific issue being resolved is a SQL expression parsing error. The changes include refactoring tests to eliminate code duplication and improve maintainability, as well as updating method and variable names to use `filepath` instead of "path". Additionally, new unit and integration tests have been added and manually tested to ensure the correct functionality of the updated code.
* Removed usage of try_cast spark function from the checks to make sure DQX can be run on more runtimes ([#163](#163)). In this release, we have refactored the code to remove the usage of the `try_cast` Spark function and replace it with `cast` and `isNull` checks to improve code compatibility, particularly for runtimes where `try_cast` is not available. The affected functionality includes null and empty column checks, checking if a column value is in a list, and checking if a column value is a valid date or timestamp. We have added unit and integration tests to ensure functionality is working as intended.
* Added filter to rules so that you can make conditional checks ([#141](#141)). The filter serves as a condition that data must meet to be evaluated by the check function. The filters restrict the evaluation of checks to only apply to rows that meet the specified conditions. This feature enhances the flexibility and customizability of data quality checks in the DQEngine.
mwojtyczka added a commit that referenced this issue Feb 12, 2025
* Provided option to customize reporting column names
([#127](#127)). In this
release, the DQEngine library has been enhanced to allow for
customizable reporting column names. A new constructor has been added to
DQEngine, which accepts an optional ExtraParams object for extra
configurations. A new Enum class, DefaultColumnNames, has been added to
represent the columns used for error and warning reporting. New tests
have been added to verify the application of checks with custom column
naming. These changes aim to improve the customizability, flexibility,
and user experience of DQEngine by providing more control over the
reporting columns and resolving issue
[#46](#46).
* Fixed parsing error when loading checks from a file
([#165](#165)). In this
release, we have addressed a parsing error that occurred when loading
checks (data quality rules) from a file, fixing issue
[#162](#162). The specific
issue being resolved is a SQL expression parsing error. The changes
include refactoring tests to eliminate code duplication and improve
maintainability, as well as updating method and variable names to use
`filepath` instead of "path". Additionally, new unit and integration
tests have been added and manually tested to ensure the correct
functionality of the updated code.
* Removed usage of try_cast spark function from the checks to make sure
DQX can be run on more runtimes
([#163](#163)). In this
release, we have refactored the code to remove the usage of the
`try_cast` Spark function and replace it with `cast` and `isNull` checks
to improve code compatibility, particularly for runtimes where
`try_cast` is not available. The affected functionality includes null
and empty column checks, checking if a column value is in a list, and
checking if a column value is a valid date or timestamp. We have added
unit and integration tests to ensure functionality is working as
intended.
* Added filter to rules so that you can make conditional checks
([#141](#141)). The filter
serves as a condition that data must meet to be evaluated by the check
function. The filters restrict the evaluation of checks to only apply to
rows that meet the specified conditions. This feature enhances the
flexibility and customizability of data quality checks in the DQEngine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants