[Spark] Avoid non-deterministic UDF to filter deleted rows #2576
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Currently, filtering of rows that are marked as deleted (DVs) is performed with a non-deterministic UDF that is added in the plan during
PrepareDeltaScan
rule. The problem is that the non-deterministic UDF prevents any filters to be pushed down to the scan, resulting in bad performance. In addition, the non-deterministic UDF prevents a number of optimizations, e.g. reusing subqueries.To avoid the above issues, this commit replaces the non-deterministic UDF with a standard filter expressions that is injected by the new
PreprocessTableWithDVsStrategy
before converting the logical plan to a physical one. The DV filter will be the bottom-most filter in the logical plan and so will be placed at the beginning of the filters that are pushed to theFileSourceScanExec
node.Note that the DV filter will not be further pushed down to the Parquet reader because filter pushdown is disabled when DVs are enabled.
How was this patch tested?
Existing tests.
Does this PR introduce any user-facing changes?
No