-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Scrubbing of UTF-16 strings in minidumps #742
Conversation
This doens't yet do anything with the strings, they are also mostly unicode garbage but we don't care as they also contain the readable parts which is what we want to match on.
This is a rough first scetch.
This now compiles without warnings.
This is starting to look nicer.
This gets rid of another vector doing things inline in an iterator instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preliminary review of unsafe code
Makes the tests easier to read. Most of these tests were written before the safe version existed...
// Off-by-one is devastating, nearly everything is valid unicode. | ||
let segment = iter.next().unwrap(); | ||
assert_eq!(segment.decoded, "棘攀氀氀漀"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is here purely to demonstrate the limitations btw.
They already are.
Changing these methods to have this restriction is makes them easier to be correct, and it is all we need. This also avoid using undefined behaviour in WStr.
Also, break when we're done.
Well, yes that was my intention when I first mentioned this. But then I realised this could also work fairly elegantly. It avoids the large allocations of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @flub! This is excellent, especially how the string search and UTF-16 handling is separated out into reviewable and maintainable units. Also, I agree with Iterator
as a choice for string search, even though it requires unsafe internally.
Reviewed everything but the transmutes in MutSegmentIter
. See comments below.
@@ -0,0 +1,512 @@ | |||
//! A UTF-16 little-endian string type. | |||
//! | |||
//! The main type in this crate is [WStr] which is a type similar to [str] but with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a const fn so comes for free and it's more readable this way.
And hard to avoid.
And a lot of doc comments.
This inlines a lot more, doing so where the stdlib also does this for str.
We now have a beautifully failing tests. It's related but unrelated. It's interesting.
Now we decided to stay with the BytesRegex matching we won't be parametrising this on the encoding (at least not until we add support for other encodings). So it's cleaner for the iterator to handle the unsafe WStr stuff rather than push it to the user.
This does the same magic "please interpret these bits over here as the same type but with a different lifetime" as the transmute. But with the same type being forced by using more words and a custom function instead of a turbofish. Thus avoiding the word which shall not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G2G. Two final nit-picks below, aside from the doc comment link situation. Excited about shipping this!
If UTF-8 modified any region, exclude that from UTF-16 scrubbing, since it might now more likely match UTF-16.
oops
This first scrubs UTF-8 from the binary blobs, than scrubs UTF-16 from the remainder.
The functionality is there and mostly in the right order and shape.
StringMods
for UTF-8/str
apply_regex_to_*_slices