[Spark] Avoid non-deterministic UDF to filter deleted rows #2576

cstavr · 2024-01-26T12:04:19Z

Which Delta project/connector is this regarding?

Description

Currently, filtering of rows that are marked as deleted (DVs) is performed with a non-deterministic UDF that is added in the plan during PrepareDeltaScan rule. The problem is that the non-deterministic UDF prevents any filters to be pushed down to the scan, resulting in bad performance. In addition, the non-deterministic UDF prevents a number of optimizations, e.g. reusing subqueries.

To avoid the above issues, this commit replaces the non-deterministic UDF with a standard filter expressions that is injected by the new PreprocessTableWithDVsStrategy before converting the logical plan to a physical one. The DV filter will be the bottom-most filter in the logical plan and so will be placed at the beginning of the filters that are pushed to the FileSourceScanExec node.

Note that the DV filter will not be further pushed down to the Parquet reader because filter pushdown is disabled when DVs are enabled.

How was this patch tested?

Existing tests.

Does this PR introduce any user-facing changes?

No

## Description Currently, when Deletion Vectors are enabled we disable predicate pushdown and splitting in scans. This is because we rely on a custom row index column which is constructed in the executors and cannot not handle splits and predicates. These restrictions can now be lifted by relying instead on `metadata.row_index` which was exposed recently after relevant [work](https://issues.apache.org/jira/browse/SPARK-37980) was concluded. Overall, this PR adds predicate pushdown and splits support as follows: 1. Replaces `__delta_internal_is_row_deleted` with `_metadata.row_index`. 2. Adds a new implementation of `__delta_internal_is_row_deleted` that is based on `_metadata.row_index`. 3. `IsRowDeleted` filter is now non deterministic to allow predicate pushdown. Furthermore, it includes previous relevant [work](#2576) to remove the UDF from `IsRowDeleted` filter. ## How was this patch tested? Added new suites.

#### Which Delta project/connector is this regarding?  - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description  Currently, when Deletion Vectors are enabled we disable predicate pushdown and splitting in scans. This is because we rely on a custom row index column which is constructed in the executors and cannot not handle splits and predicates. These restrictions can now be lifted by relying instead on `metadata.row_index` which was exposed recently after relevant [work](https://issues.apache.org/jira/browse/SPARK-37980) was concluded. Overall, this PR adds predicate pushdown and splits support as follows: 1. Replaces `__delta_internal_is_row_deleted` with `_metadata.row_index`. 2. Adds a new implementation of `__delta_internal_is_row_deleted` that is based on `_metadata.row_index`. 3. `IsRowDeleted` filter is now non deterministic to allow predicate pushdown. Furthermore, it includes previous relevant [work](#2576) to remove the UDF from `IsRowDeleted` filter. ## How was this patch tested?  Added new suites. ## Does this PR introduce _any_ user-facing changes?  No.

cstavr force-pushed the replace-udf branch from e8e2d83 to 0121a72 Compare January 29, 2024 11:47

Avoid non-deterministic UDF to filter deleted rows

16ddfcb

cstavr force-pushed the replace-udf branch from 0121a72 to 16ddfcb Compare January 29, 2024 12:06

This was referenced Apr 23, 2024

[Spark] Support predicate pushdown in scans with DVs #2933

Merged

[Spark] Support predicate pushdown in scans with DVs #2982

Merged

cstavr closed this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Avoid non-deterministic UDF to filter deleted rows #2576

[Spark] Avoid non-deterministic UDF to filter deleted rows #2576

cstavr commented Jan 26, 2024 •

edited

Loading

[Spark] Avoid non-deterministic UDF to filter deleted rows #2576

[Spark] Avoid non-deterministic UDF to filter deleted rows #2576

Conversation

cstavr commented Jan 26, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

cstavr commented Jan 26, 2024 •

edited

Loading