-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go] Allow prepending dictionary #47
Comments
How would this be different from using |
The issue is that that also just calls |
So the goal here is that the dictionary values being prepended wouldn't actually be "in" the memotable? I'm not sure what you mean by the CPU cycles not being necessary if the cycles are being used to insert the values into the memotable. In theory you could dictionary encode the non-encoded data and then just use the Concat to unify the dictionaries and transpose the indices? |
That's why I wrote this. I don't actually want the "initial dictionary" values to be tracked by the memotable. I'm ok with potential duplicates as the two sources are so distinct they almost never will contain duplicates, and even if, then I'm ok with that happening occasionally. The problem is that I want to only iterate once over my data to insert everything in a reasonably deeply nested struct, sometimes inserting just indices into the initial dictionary and sometimes adding new values. |
I'm still having difficulty visualizing the use case here. How about you put together a draft PR for this and some tests? i'll take a look and see if it helps me understand the use case better. Personally it feels like there's gotta be an easier / faster way, like building the dictionary and then just appending the other values to the result dictionary at the end (that way they are still in the dictionary but don't break the indices), or something else that I'm just not seeing at the moment. |
Sounds good, the change shouldn't be very large. |
Describe the enhancement requested
The dictionary builders already have methods to insert whole arrays, but unfortunately they cause a lot of potentially unnecessary CPU time.
Take the following scenario: I have two sources of data, one of them is already dictionary encoded, the other is not, so I would like to initialize the dictionary builder with the existing dictionary, and only insert new items for the non-dictionary encodede items. Now comes the important part: I'm ok with inserts potentially creating duplicates in the dictionary.
I would like to propose a new API
PrependInitialDict
, that takes an array and must be called before inserting into the indices array, otherwise it errors, and then any new dictionary item inserted start atlen(initialDict)+i
.Theoretically it could even be designed to insert dicts multiple times, but I would suggest to start the API like this and only extend when we have the use cases.
Alternative I have considered: Prepending the dictionary after building the "new" dictionary and have any indices start at the length. I've found this to not really be workable, for two reasons:
cc @zeroshade
Component(s)
Go
The text was updated successfully, but these errors were encountered: