Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Scrubbing of UTF-16 strings in minidumps #742

Merged
merged 47 commits into from
Sep 21, 2020
Merged

Conversation

flub
Copy link
Contributor

@flub flub commented Sep 1, 2020

This first scrubs UTF-8 from the binary blobs, than scrubs UTF-16 from the remainder.

The functionality is there and mostly in the right order and shape.

  • Refactor to avoid matches vector
  • Maybe implement StringMods for UTF-8/str
  • Maybe merge apply_regex_to_*_slices
  • Merge master
  • Add changelog

This doens't yet do anything with the strings, they are also mostly
unicode garbage but we don't care as they also contain the readable
parts which is what we want to match on.
@flub flub changed the title Basic UTF-16 string extraction from random memory Scrubbing of UTF-16 strings in minidumps Sep 17, 2020
@flub flub changed the title Scrubbing of UTF-16 strings in minidumps feat: Scrubbing of UTF-16 strings in minidumps Sep 17, 2020
@flub flub requested a review from jan-auer September 17, 2020 07:30
This gets rid of another vector doing things inline in an iterator instead.
Copy link
Member

@untitaker untitaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preliminary review of unsafe code

Floris Bruynooghe added 2 commits September 17, 2020 15:13
Makes the tests easier to read.  Most of these tests were written
before the safe version existed...
Comment on lines +767 to +769
// Off-by-one is devastating, nearly everything is valid unicode.
let segment = iter.next().unwrap();
assert_eq!(segment.decoded, "棘攀氀氀漀");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is here purely to demonstrate the limitations btw.

Floris Bruynooghe added 5 commits September 17, 2020 16:09
Changing these methods to have this restriction is makes them easier
to be correct, and it is all we need.  This also avoid using undefined
behaviour in WStr.
Also, break when we're done.
@flub
Copy link
Contributor Author

flub commented Sep 18, 2020

ah now I get how you wanted to share code. I thought you wanted to always match on a String at some point

Well, yes that was my intention when I first mentioned this. But then I realised this could also work fairly elegantly. It avoids the large allocations of the Strings in return for having to allocate the SmallVec of matches. I can't really decide which approach is better.

Copy link
Member

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @flub! This is excellent, especially how the string search and UTF-16 handling is separated out into reviewable and maintainable units. Also, I agree with Iterator as a choice for string search, even though it requires unsafe internally.

Reviewed everything but the transmutes in MutSegmentIter. See comments below.

@@ -0,0 +1,512 @@
//! A UTF-16 little-endian string type.
//!
//! The main type in this crate is [WStr] which is a type similar to [str] but with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we still need to generate docs with stable, can you please replace all these with proper links and quoting? Currently, it will render like this on our docs page:

image

Floris Bruynooghe added 12 commits September 18, 2020 16:32
It's a const fn so comes for free and it's more readable this way.
And hard to avoid.
This inlines a lot more, doing so where the stdlib also does this for str.
We now have a beautifully failing tests.  It's related but unrelated.
It's interesting.
Now we decided to stay with the BytesRegex matching we won't be
parametrising this on the encoding (at least not until we add support
for other encodings).  So it's cleaner for the iterator to handle the
unsafe WStr stuff rather than push it to the user.
This does the same magic "please interpret these bits over here as the
same type but with a different lifetime" as the transmute.  But with
the same type being forced by using more words and a custom function
instead of a turbofish.  Thus avoiding the word which shall not be
used.
Copy link
Member

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G2G. Two final nit-picks below, aside from the doc comment link situation. Excited about shipping this!

Floris Bruynooghe added 5 commits September 21, 2020 08:50
If UTF-8 modified any region, exclude that from UTF-16 scrubbing,
since it might now more likely match UTF-16.
@flub flub requested review from jan-auer and untitaker September 21, 2020 08:57
@flub flub merged commit 1f1eec9 into master Sep 21, 2020
@flub flub deleted the feat/utf16-scrub branch September 21, 2020 12:02
jan-auer added a commit that referenced this pull request Sep 21, 2020
* master:
  feat: Scrubbing of UTF-16 strings in minidumps (#742)
  meta: Update CI badges (#782)
  fix(pii): Fix issue where `$span` would not be recognized in Advanced Data Scrubbing (#781)
  ci: Port CI to GitHub Actions (#780)
  fix(setup): Log when reporting to Sentry is disabled (#779)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants