Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] Allow prepending dictionary #47

Open
brancz opened this issue Sep 4, 2023 · 6 comments · May be fixed by apache/arrow#37634
Open

[Go] Allow prepending dictionary #47

brancz opened this issue Sep 4, 2023 · 6 comments · May be fixed by apache/arrow#37634
Assignees
Labels
Type: enhancement New feature or request

Comments

@brancz
Copy link
Contributor

brancz commented Sep 4, 2023

Describe the enhancement requested

The dictionary builders already have methods to insert whole arrays, but unfortunately they cause a lot of potentially unnecessary CPU time.

Take the following scenario: I have two sources of data, one of them is already dictionary encoded, the other is not, so I would like to initialize the dictionary builder with the existing dictionary, and only insert new items for the non-dictionary encodede items. Now comes the important part: I'm ok with inserts potentially creating duplicates in the dictionary.

I would like to propose a new API PrependInitialDict, that takes an array and must be called before inserting into the indices array, otherwise it errors, and then any new dictionary item inserted start at len(initialDict)+i.

Theoretically it could even be designed to insert dicts multiple times, but I would suggest to start the API like this and only extend when we have the use cases.


Alternative I have considered: Prepending the dictionary after building the "new" dictionary and have any indices start at the length. I've found this to not really be workable, for two reasons:

  1. There would still have to be an API to set the initial index.
  2. It would rely on the user actually prepending the dictionary afterward (easy to misuse).
  3. It would be quite awkward to use in scenarios where there are deeply nested lists and structs, where building the final record is primarily done using a record builder, but only this array would be the exception.

cc @zeroshade

Component(s)

Go

@brancz brancz added the Type: enhancement New feature or request label Sep 4, 2023
@zeroshade
Copy link
Member

How would this be different from using NewDictionaryBuilderWithDict which creates a new dictionary builder and inserts the provided array of data as the initial values in the dictionary but not as values?

@brancz
Copy link
Contributor Author

brancz commented Sep 7, 2023

The issue is that that also just calls InsertDictValues, which causes a number of CPU cycles that are not truly necessary: https://pprof.me/4a451fa

@zeroshade
Copy link
Member

So the goal here is that the dictionary values being prepended wouldn't actually be "in" the memotable? I'm not sure what you mean by the CPU cycles not being necessary if the cycles are being used to insert the values into the memotable.

In theory you could dictionary encode the non-encoded data and then just use the Concat to unify the dictionaries and transpose the indices?

@brancz
Copy link
Contributor Author

brancz commented Sep 7, 2023

I'm ok with inserts potentially creating duplicates in the dictionary.

That's why I wrote this. I don't actually want the "initial dictionary" values to be tracked by the memotable. I'm ok with potential duplicates as the two sources are so distinct they almost never will contain duplicates, and even if, then I'm ok with that happening occasionally.

The problem is that I want to only iterate once over my data to insert everything in a reasonably deeply nested struct, sometimes inserting just indices into the initial dictionary and sometimes adding new values.

@zeroshade
Copy link
Member

I'm still having difficulty visualizing the use case here. How about you put together a draft PR for this and some tests? i'll take a look and see if it helps me understand the use case better.

Personally it feels like there's gotta be an easier / faster way, like building the dictionary and then just appending the other values to the result dictionary at the end (that way they are still in the dictionary but don't break the indices), or something else that I'm just not seeing at the moment.

@brancz
Copy link
Contributor Author

brancz commented Sep 8, 2023

Sounds good, the change shouldn't be very large.

@assignUser assignUser transferred this issue from apache/arrow Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants