Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Block getSizeInBytes to estimate fully expanded data size #25256

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

dain
Copy link
Member

@dain dain commented Mar 7, 2025

Description

The Block getSizeinBytes method has been defined to return the "compacted" size of a block. This means for an RLE only the size of the single value is returned, and for a dictionary block the size of each position is computed from the dictionary. This computation is quite expensive and requires Blocks to have getPositionsSizeInBytes and fixedSizeInBytesPerPosition methods.

This PR changes the implementation to be defined as returning an estimate of the full data size of a block. This means for an RLE block the size is value.getSizeInBytes() * positionCount and dictionary is (dictionary.getSizeInBytes() / dictionary.getPositionCount()) * positionCount, which is simpler and much faster to compute.

Current Usage

The getSizeinBytes method is typically used to estimate the size needed for output buffers when copying data in operators and generally these usages assume that the value returned is a fully expanded size. The other main usage is in stats, and here the change will have visible effects. This method is often used to calculate the input/output size of operators, sources, and sinks. It is not clear if the these uses intended to have compacted or fully expanded size.

Release notes

(X) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## SPI
* Block `getSizeInBytes` now returns an estimate of the full data size of the block. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 7, 2025
@github-actions github-actions bot added delta-lake Delta Lake connector hive Hive connector labels Mar 7, 2025
@dain dain force-pushed the getSizeInBytes branch from 072f370 to 177100d Compare March 8, 2025 01:42
@dain dain force-pushed the getSizeInBytes branch from 177100d to c4ab584 Compare March 8, 2025 22:29
@dain dain force-pushed the getSizeInBytes branch from c4ab584 to f3325f1 Compare March 8, 2025 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector
Development

Successfully merging this pull request may close these issues.

1 participant