[Spark] Support predicate pushdown in scans with DVs #2933

andreaschat-db · 2024-04-21T13:07:59Z

Which Delta project/connector is this regarding?

Description

Currently, when Deletion Vectors are enabled we disable predicate pushdown and splitting in scans. This is because we rely on a custom row index column which is constructed in the executors and cannot not handle splits and predicates. These restrictions can now be lifted by relying instead on metadata.row_index which was exposed recently after relevant work was concluded.

Overall, this PR adds predicate pushdown and splits support as follows:

Replaces __delta_internal_is_row_deleted with _metadata.row_index.
Adds a new implementation of __delta_internal_is_row_deleted that is based on _metadata.row_index.
IsRowDeleted filter is now non deterministic to allow predicate pushdown.

Furthermore, it includes previous relevant work to remove the UDF from IsRowDeleted filter.

How was this patch tested?

Added new suites.

Does this PR introduce any user-facing changes?

No.

flush

flush First sane version without isRowDeleted

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix

flush First sane version without isRowDeleted Hack RowIndexMarkingFilters Add support for non-vectorized readers Metadata column fix Avoid non-deterministic UDF to filter deleted rows

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes

flush First sane version without isRowDeleted Hack RowIndexMarkingFilters Add support for non-vectorized readers Metadata column fix Avoid non-deterministic UDF to filter deleted rows metadata with Expression ID Fix complex views issue Tests cleaning More tests and fixes Partial cleaning

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter

flush First sane version without isRowDeleted Hack RowIndexMarkingFilters Add support for non-vectorized readers Metadata column fix Avoid non-deterministic UDF to filter deleted rows metadata with Expression ID Fix complex views issue Tests cleaning More tests and fixes Partial cleaning cleaning and improvements cleaning and improvements Clean RowIndexFilter Clean DeltaParquetFileFormat Improve DeletionVectorsSuite Disable DeltaParquetFileFormatSuite for predicate pushdown.

vkorukanti

LGTM, looks like some of the changes in the DeltaParquetFileFormat can be reverted/simplified.

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

spark/src/main/scala/org/apache/spark/sql/delta/deletionvectors/RowIndexMarkingFilters.scala

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

xupefei

Minor comments.
The logic of DeltaParquetFileFormat is getting out of hand but I don't see a fesible way to improve it in short term.

andreaschat-db · 2024-04-25T16:00:38Z

Minor comments. The logic of DeltaParquetFileFormat is getting out of hand but I don't see a fesible way to improve it in short term.

I did some cleaning. It looks better now.

vkorukanti

lgtm

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala

felipepessoto · 2024-05-07T19:02:32Z

@andreaschat-db do you know why we were using __delta_internal_is_row_deleted in previous version (2.4) if row_index is available in Spark 3.4? Any known issue in Spark 3.4?

I wonder if this is the reason: https://issues.apache.org/jira/browse/SPARK-39634

Do you think it is possible to backport to Delta 2.4? Split wouldn't work because Spark 3.4 doesn't support it, but at least the predicate pushdown could work?

andreaschat-db · 2024-05-08T11:03:10Z

@andreaschat-db do you know why we were using __delta_internal_is_row_deleted in previous version (2.4) if row_index is available in Spark 3.4? Any known issue in Spark 3.4?

I wonder if this is the reason: https://issues.apache.org/jira/browse/SPARK-39634

Do you think it is possible to backport to Delta 2.4? Split wouldn't work because Spark 3.4 doesn't support it, but at least the predicate pushdown could work?

Hi Felipe,

There was an issue in PARQUET MR that prevented the correct construction of the row indexes. To backport this, Spark needs to be paired with a parquet version that contains the fix.

felipepessoto · 2024-05-08T16:51:13Z

The fixed version, 1.12.3 is in 3.4: https://github.com/apache/spark/blob/da0c7cc81bb3d69d381dd0683e910eae4c80e9ae/pom.xml#L143

I think splits would not be possible yet, because of https://issues.apache.org/jira/browse/PARQUET-2161, which is fixed in 1.13.0 (Spark 3.5)

Spark 3.4 ParquetScan.isSplitable, where this is mentioned: https://github.com/apache/spark/blob/da0c7cc81bb3d69d381dd0683e910eae4c80e9ae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScan.scala#L50C1-L57C4

  override def isSplitable(path: Path): Boolean = {
    // If aggregate is pushed down, only the file footer will be read once,
    // so file should not be split across multiple tasks.
    pushedAggregate.isEmpty &&
      // SPARK-39634: Allow file splitting in combination with row index generation once
      // the fix for PARQUET-2161 is available.
      !RowIndexUtil.isNeededForSchema(readSchema)
  }

But filter pushdown might work.

andreaschat-db changed the title ~~[TEST] Support predicate pushday in scans with DVs~~ [TEST][DO_NOT_MERGE] Support predicate pushday in scans with DVs Apr 21, 2024

andreaschat-db changed the title ~~[TEST][DO_NOT_MERGE] Support predicate pushday in scans with DVs~~ [TEST][DO_NOT_MERGE] Support predicate pushdown in scans with DVs Apr 21, 2024

andreaschat-db force-pushed the supportPredicatePushdayInScansWithDVs branch from b8b7ea8 to 6657d63 Compare April 23, 2024 15:12

andreaschat-db added 9 commits April 23, 2024 19:02

flush

61a3b0d

flush

09a7ae6

flush

flush

cd81332

flush First sane version without isRowDeleted

flush

48fc958

flush First sane version without isRowDeleted Hack RowIndexMarkingFilters Add support for non-vectorized readers Metadata column fix Avoid non-deterministic UDF to filter deleted rows

andreaschat-db force-pushed the supportPredicatePushdayInScansWithDVs branch from 6657d63 to 8bdfd8c Compare April 23, 2024 18:14

andreaschat-db added 3 commits April 24, 2024 10:05

Fix row tracking issues and cleaning

5099fde

More test suites and cleaning

3ee766a

Add DeltaParquetFileFormatSuite test plus cleaning

505b774

andreaschat-db changed the title ~~[TEST][DO_NOT_MERGE] Support predicate pushdown in scans with DVs~~ [Spark] Support predicate pushdown in scans with DVs Apr 24, 2024

vkorukanti requested changes Apr 24, 2024

View reviewed changes

andreaschat-db added 2 commits April 25, 2024 13:30

Fix RowTrackingMergeCDFDVSuite and address some comments.

2c970cc

Fix spark master failure

d557cc0

xupefei reviewed Apr 25, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 25, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 25, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/deletionvectors/RowIndexMarkingFilters.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 25, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala Show resolved Hide resolved

xupefei approved these changes Apr 25, 2024

View reviewed changes

andreaschat-db added 2 commits April 25, 2024 15:11

Address comments

77faefc

Address last comment and cleanup

085fd4b

andreaschat-db requested a review from vkorukanti April 25, 2024 15:49

vkorukanti approved these changes Apr 26, 2024

View reviewed changes

xupefei reviewed Apr 26, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 26, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

xupefei reviewed Apr 26, 2024

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala Outdated Show resolved Hide resolved

andreaschat-db added 2 commits April 26, 2024 13:02

Address comments

f261b78

Address last comment

826ae0a

vkorukanti merged commit f4a4944 into delta-io:master Apr 26, 2024
7 of 8 checks passed

felipepessoto mentioned this pull request Aug 16, 2024

Remove broadcast and predicate pushdown #3026

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Support predicate pushdown in scans with DVs #2933

[Spark] Support predicate pushdown in scans with DVs #2933

andreaschat-db commented Apr 21, 2024 •

edited

Loading

vkorukanti left a comment

xupefei left a comment

andreaschat-db commented Apr 25, 2024

vkorukanti left a comment

felipepessoto commented May 7, 2024 •

edited

Loading

andreaschat-db commented May 8, 2024 •

edited

Loading

felipepessoto commented May 8, 2024 •

edited

Loading

[Spark] Support predicate pushdown in scans with DVs #2933

[Spark] Support predicate pushdown in scans with DVs #2933

Conversation

andreaschat-db commented Apr 21, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

vkorukanti left a comment

Choose a reason for hiding this comment

xupefei left a comment

Choose a reason for hiding this comment

andreaschat-db commented Apr 25, 2024

vkorukanti left a comment

Choose a reason for hiding this comment

felipepessoto commented May 7, 2024 • edited Loading

andreaschat-db commented May 8, 2024 • edited Loading

felipepessoto commented May 8, 2024 • edited Loading

andreaschat-db commented Apr 21, 2024 •

edited

Loading

felipepessoto commented May 7, 2024 •

edited

Loading

andreaschat-db commented May 8, 2024 •

edited

Loading

felipepessoto commented May 8, 2024 •

edited

Loading