feat: Scrubbing of UTF-16 strings in minidumps #742

flub · 2020-09-01T12:19:27Z

This first scrubs UTF-8 from the binary blobs, than scrubs UTF-16 from the remainder.

The functionality is there and mostly in the right order and shape.

This doens't yet do anything with the strings, they are also mostly unicode garbage but we don't care as they also contain the readable parts which is what we want to match on.

This is a rough first scetch.

This now compiles without warnings.

This is starting to look nicer.

This gets rid of another vector doing things inline in an iterator instead.

untitaker

preliminary review of unsafe code

relay-general/src/pii/attachments.rs

relay-wstring/src/lib.rs

Makes the tests easier to read. Most of these tests were written before the safe version existed...

flub · 2020-09-17T13:56:38Z

relay-general/src/pii/attachments.rs

+        // Off-by-one is devastating, nearly everything is valid unicode.
+        let segment = iter.next().unwrap();
+        assert_eq!(segment.decoded, "棘攀氀氀漀");


This test is here purely to demonstrate the limitations btw.

They already are.

Changing these methods to have this restriction is makes them easier to be correct, and it is all we need. This also avoid using undefined behaviour in WStr.

Also, break when we're done.

flub · 2020-09-18T13:33:54Z

ah now I get how you wanted to share code. I thought you wanted to always match on a String at some point

Well, yes that was my intention when I first mentioned this. But then I realised this could also work fairly elegantly. It avoids the large allocations of the Strings in return for having to allocate the SmallVec of matches. I can't really decide which approach is better.

jan-auer

Thanks, @flub! This is excellent, especially how the string search and UTF-16 handling is separated out into reviewable and maintainable units. Also, I agree with Iterator as a choice for string search, even though it requires unsafe internally.

Reviewed everything but the transmutes in MutSegmentIter. See comments below.

relay-wstring/Cargo.toml

jan-auer · 2020-09-18T13:05:36Z

relay-wstring/src/lib.rs

@@ -0,0 +1,512 @@
+//! A UTF-16 little-endian string type.
+//!
+//! The main type in this crate is [WStr] which is a type similar to [str] but with the


Since we still need to generate docs with stable, can you please replace all these with proper links and quoting? Currently, it will render like this on our docs page:

relay-wstring/src/lib.rs

relay-general/src/pii/attachments.rs

It's a const fn so comes for free and it's more readable this way.

And hard to avoid.

And a lot of doc comments.

This inlines a lot more, doing so where the stdlib also does this for str.

We now have a beautifully failing tests. It's related but unrelated. It's interesting.

Now we decided to stay with the BytesRegex matching we won't be parametrising this on the encoding (at least not until we add support for other encodings). So it's cleaner for the iterator to handle the unsafe WStr stuff rather than push it to the user.

This does the same magic "please interpret these bits over here as the same type but with a different lifetime" as the transmute. But with the same type being forced by using more words and a custom function instead of a turbofish. Thus avoiding the word which shall not be used.

jan-auer

G2G. Two final nit-picks below, aside from the doc comment link situation. Excited about shipping this!

relay-general/src/pii/attachments.rs

If UTF-8 modified any region, exclude that from UTF-16 scrubbing, since it might now more likely match UTF-16.

CHANGELOG.md

oops

* master: feat: Scrubbing of UTF-16 strings in minidumps (#742) meta: Update CI badges (#782) fix(pii): Fix issue where `$span` would not be recognized in Advanced Data Scrubbing (#781) ci: Port CI to GitHub Actions (#780) fix(setup): Log when reporting to Sentry is disabled (#779)

Basic UTF-16 string extraction from random memory

d3afd9c

This doens't yet do anything with the strings, they are also mostly unicode garbage but we don't care as they also contain the readable parts which is what we want to match on.

untitaker mentioned this pull request Sep 9, 2020

ref: Remove unused redaction options #760

Merged

Floris Bruynooghe added 14 commits September 9, 2020 14:31

implement scrubing of utf16 bytes

80c943b

This is a rough first scetch.

Add unittests, fix bug to count u16 as 2 bytes

ef15805

Merge branch 'master' into feat/utf16-scrub

6f4a18e

Fixup for simplified configs in master

dddaaa2

Add a WStr UTF-16LE type

a7a4bf3

Add a safe constructor for WStr

5961d9c

Move wstring to it's own crate

23a010f

Clean up the public interface

79a5112

This now compiles without warnings.

Make clippy happy

b5621df

Fixup docs

5117d58

Rename to relay-wstring for convenience

f5fe61a

Split off SliceIndex to it's own module

dea4279

Back to an iterator, this is much nicer

9c58cb1

Make this work again, using iteration and WStr

cc62be2

This is starting to look nicer.

flub changed the title ~~Basic UTF-16 string extraction from random memory~~ Scrubbing of UTF-16 strings in minidumps Sep 17, 2020

flub changed the title ~~Scrubbing of UTF-16 strings in minidumps~~ feat: Scrubbing of UTF-16 strings in minidumps Sep 17, 2020

Some clippy fixes

4299949

flub requested a review from jan-auer September 17, 2020 07:30

Break out scrubbing of matches to a fuction

db47b23

This gets rid of another vector doing things inline in an iterator instead.

untitaker requested changes Sep 17, 2020

View reviewed changes

Floris Bruynooghe added 2 commits September 17, 2020 15:13

Do not needlessly use unsafe in tests

29212d9

Makes the tests easier to read. Most of these tests were written before the safe version existed...

Explicit lifetimes transmute

ee548ff

flub commented Sep 17, 2020

View reviewed changes

Floris Bruynooghe added 5 commits September 17, 2020 16:09

Trade off unsafe for more if statements and unwrap

ad1180f

Implement FusedIterator for char iters

a3344ce

They already are.

Force padding characters to be encoded to a single code

0efd108

Changing these methods to have this restriction is makes them easier to be correct, and it is all we need. This also avoid using undefined behaviour in WStr.

Use zip to reduce iteractions some more

68ec570

Also, break when we're done.

Remove obsolete comment

8c6546f

jan-auer requested changes Sep 18, 2020

View reviewed changes

Floris Bruynooghe added 12 commits September 18, 2020 16:32

Wire up docs and logging for new crate

0d624e2

Use std::mem::size_of::<u16>() directly

a68d2a1

It's a const fn so comes for free and it's more readable this way.

Inits are fine apparently

9d2a48d

And hard to avoid.

Back to fast unsafe code

3e30947

Replace some magic numbers with helper functions

3c05277

And a lot of doc comments.

Doc comments in third person

aa103da

Inline a lot more, taking cues from the stdlib

47917e1

This inlines a lot more, doing so where the stdlib also does this for str.

More elegant iterator usage

b2e5bf5

Correct import blocks

66d3b2b

Fix full matches

786707a

We now have a beautifully failing tests. It's related but unrelated. It's interesting.

Merge WStr into the SegmentIter

ceb8d76

Now we decided to stay with the BytesRegex matching we won't be parametrising this on the encoding (at least not until we add support for other encodings). So it's cleaner for the iterator to handle the unsafe WStr stuff rather than push it to the user.

jan-auer approved these changes Sep 18, 2020

View reviewed changes

relay-general/src/pii/attachments.rs Outdated Show resolved Hide resolved

relay-general/src/pii/attachments.rs Show resolved Hide resolved

Floris Bruynooghe added 5 commits September 21, 2020 08:50

I really do want a regex dear clippy

ddfca69

Use const instead of static

8600065

Do not scrub regions twice

802a487

If UTF-8 modified any region, exclude that from UTF-16 scrubbing, since it might now more likely match UTF-16.

Merge branch 'master' into feat/utf16-scrub

a0d6512

Changelog entry

802f767

flub requested review from jan-auer and untitaker September 21, 2020 08:57

untitaker approved these changes Sep 21, 2020

View reviewed changes

untitaker reviewed Sep 21, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Fix changelog

7d5942f

oops

jan-auer approved these changes Sep 21, 2020

View reviewed changes

Merge branch 'master' into feat/utf16-scrub

76f40d6

flub merged commit 1f1eec9 into master Sep 21, 2020

flub deleted the feat/utf16-scrub branch September 21, 2020 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Scrubbing of UTF-16 strings in minidumps #742

feat: Scrubbing of UTF-16 strings in minidumps #742

flub commented Sep 1, 2020 •

edited

Loading

untitaker left a comment

flub Sep 17, 2020

flub commented Sep 18, 2020

jan-auer left a comment

jan-auer Sep 18, 2020

jan-auer left a comment

feat: Scrubbing of UTF-16 strings in minidumps #742

feat: Scrubbing of UTF-16 strings in minidumps #742

Conversation

flub commented Sep 1, 2020 • edited Loading

untitaker left a comment

Choose a reason for hiding this comment

flub Sep 17, 2020

Choose a reason for hiding this comment

flub commented Sep 18, 2020

jan-auer left a comment

Choose a reason for hiding this comment

jan-auer Sep 18, 2020

Choose a reason for hiding this comment

jan-auer left a comment

Choose a reason for hiding this comment

flub commented Sep 1, 2020 •

edited

Loading