Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet bloom filter write support to Iceberg connector #21602
Add Parquet bloom filter write support to Iceberg connector #21602
Changes from all commits
cd69afd
ac3b281
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few lines above you have the original
fileColumnNames
- please use those in correlation with what is specified in the table properties (case insensitive name matching) ingetParquetBloomFilterColumns
.Also a new test to add: schema evolution - create a table with a bunch of bloom filter columns, drop one of the columns which was specified as bloom filter column and make sure that you don't get any errors . I'm guessing we'd have to filter out in
getParquetBloomFilterColumns
the column names which don't exist anymore.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The writing logic ignores non-existent columns for which the Bloom filter property is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we see in
SHOW CREATE TABLE
the bloom filter columns now that we're dealing with a supported table property?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify
io.trino.plugin.iceberg.IcebergUtil#getIcebergTableProperties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried on the above scaffolding
and see the following
This seems not to overlap with the expectations from
io.trino.testing.BaseTestParquetWithBloomFilters#testBloomFilterRowGroupPruning(io.trino.spi.connector.CatalogSchemaTableName, java.lang.String)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a
toLowerCase
togetParquetBloomFilterColumns
to handle this? It looks like we have the same issues for the Iceberg ORC Bloom filters. Should we handle case sensitivity in this PR, or handle it in a follow up?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rather fix the functionality in the existing PR instead of delivering a half-baked functionality which may potentially back-fire with bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative with less headaches would be to register a pre-created resource table and check the query stats on it similar to what has been done on https://github.com/trinodb/trino/blob/ca209630136eabda2449594ef2b6a4d82fb9c2e5/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergReadVersionedTableByTemporal.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy access to this would be useful to have in the product tests. It would allow the product tests in this PR to give more coverage. Unfortunately, product tests are not my cup of tea for Friday hacking 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a mechanism to get the query stats in the product tests to ensure that the bloom filter is actually effective and we don't introduce while refactoring regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would someone be able to help add this logic? I don't have much experience with the product tests and unfortunately don't have much capacity to follow up on this at the moment. It would be much appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findinpath aren't we already testing effectiveness of bloom filter in query runner tests ? I'm not sure that we should block this PR over checking this in product tests as well, we don't do that with Apache Hive for bloom filters in hive connector as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to lowercase the column names?
We'd probably need a spark compatibility test using case sensitive column names to check thisI see already
testSparkReadingTrinoBloomFilters