Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add support for building statistics array for nested types #45474

Open
kou opened this issue Feb 10, 2025 · 7 comments
Open

[C++] Add support for building statistics array for nested types #45474

kou opened this issue Feb 10, 2025 · 7 comments

Comments

@kou
Copy link
Member

kou commented Feb 10, 2025

Describe the enhancement requested

arrow::RecordBatch::MakeStatisticsArray() can build a statistics array of a record batch. But it doesn't support nested types yet. We should use the rule that is used by RecordBatch message https://arrow.apache.org/docs/format/Columnar.html#ipc-recordbatch-message to compute column indexes for nested types.

See also: https://arrow.apache.org/docs/format/StatisticsSchema.html#schema

Component(s)

C++

@arashandishgar
Copy link
Contributor

as follow up our privious conversation and the implementation of arrow::RecordBatch::MakeStatisticsArray(), two things seems does not exist.

1.apporpiate type for nested array
2. lack of the implementation of Depth-first search on column array
3. Missing some array attributes #45639

I think we should address the first case to make it possible the second one

@arashandishgar
Copy link
Contributor

I've submitted my first pull request which is about to implement extract_regex_span and it is currently under review. I'd like to start working on this issue, However I'm still gaining experience in C++ and it maybe take some time to implement it. Nonetheless, I approciate if you could assign the issue to me.

@kou
Copy link
Member Author

kou commented Feb 27, 2025

You can do it by yourself by just adding "take" only comment!

Document: https://arrow.apache.org/docs/developers/bug_reports.html#issue-assignment
Example: #45619 (comment)

@arashandishgar
Copy link
Contributor

take

@arashandishgar
Copy link
Contributor

arashandishgar commented Mar 3, 2025

I found that it seems the StringView and BinaryVeiw types has not supported in ValueToArrowType in ArrayStatistics.

switch (array_type->id()) {
case Type::STRING:
case Type::BINARY:
case Type::FIXED_SIZE_BINARY:
case Type::LARGE_STRING:
case Type::LARGE_BINARY:
return array_type;
default:
return utf8();
}

should I add support for these types?

@kou
Copy link
Member Author

kou commented Mar 3, 2025

Yes but let's work on it as a separated task. Because StringView/BinaryView aren't nested types.

@arashandishgar
Copy link
Contributor

Yes but let's work on it as a separated task. Because StringView/BinaryView aren't nested types.

I've created the separate issue for this #45664 , However I think it's better to send a pull request for it after the current issue is solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants