-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make token deduplication optional for the repo indexer #7826
Conversation
Pondering about why I didn't see the kind of improvements in size that @ethantkoenig (the author of #3452) found, one thought came to mind: could it be possible that gains he found were caused by him rebuilding the index, and not thanks to adding the |
Codecov Report
@@ Coverage Diff @@
## master #7826 +/- ##
=========================================
+ Coverage 41.39% 41.4% +<.01%
=========================================
Files 474 474
Lines 63780 63785 +5
=========================================
+ Hits 26404 26412 +8
+ Misses 33940 33935 -5
- Partials 3436 3438 +2
Continue to review full report at Codecov.
|
Or perhaps the gains came mostly from the fact that the |
@guillep2k I think the size improvement is because removed |
@lunny I think so too, but I couldn't find anyone with a large index to try this change. If that's the case, then making the "unique" token optional makes no sense: it just should be removed. What you think? |
I also think that there is not much point in adding such things as configurable. We already have too many things in app.ini :) |
This PR solves #7825 by optionally removing the
unique
filter from the repository indexer analyzer.The thing is: the current behaviour directly tracks to #3452, where the
unique
filter was added for the sake of space saving. So, I guess there is a case where this is an acceptable trade-off. That's why I introduced an app.ini option (REPO_INDEXER_DEDUPLICATE
) to make the change optional to avoid introducing problems to users that count on this.However, I didn't notice much of a setback in the index size with this change in my tests (the resulting index was even a little smaller in my case):
unique
filter active (no patch): 26,935,296unique
filter removed (with this patch): 26,136,576I guess it really depends on the actual repository.
If the app.ini option is not desirable, I'll just remove it.