-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repository indexer clogs with file with multi-byte character sets #7809
Comments
I think some kind of encoding fallback could be used, perhaps pre-set in app.ini. |
Sounds like a bug in Bleve to me. |
I think just like for display we currently detect encoding and convert to utf-8 for display we need to do same before giving content to bleve |
@silverwind, It's more like the way it's being used but yes, it's not much robust when invalid data is presented to it. This is the current set of filters instantiated in Gitea for the repositories:
And then the queue is filled with:
The indexer is passed the original data nonchanantly, even if it's binary. This code was probably copied from the issue indexer, and issue texts are always utf-8 encoded. I agree with @lafriks, detect encoding is the way to go, but that only goes so far. I'd add a filter to deal with invalid cases, because if one invalid code point gets through, the index gets filled with weird data. I'll try to look into this in a couple of days. I'm very glad I've finally found the reason my indexes were only partially useful. |
Fixed by #7814 |
[x]
):Description
When using the repository indexer, files with multi-byte character sets don't get correctly indexed. This happens when characters look like valid utf-8 code points but they are not. Once a bad sequence is encontered the rest of the file is indexed as a single token; e.g. if the file is 100KB and the bad sequence is at the middle of it, the indexer gets the first half of the file OK, and the rest as one "word" which is 50KB long (and certainly not searchable).
To reproduce this issue, files with the folloging content can be tested using utf-8 and Latin1 character sets:
Note: to test properly the files must be commited through git, not Gitea's web interface.
Searching for

sailorvenus
brings results, as it is the first word. In the Latin1 encoded file the rest of the context is garbled.Searching for

sailormoon
doesn't bring results from the Latin1 encoded file, as the indexing for the rest of the file is garbled:The text was updated successfully, but these errors were encountered: