Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError for speedscope #256

Open
njgrisafi opened this issue Feb 20, 2025 · 3 comments
Open

UnicodeDecodeError for speedscope #256

njgrisafi opened this issue Feb 20, 2025 · 3 comments

Comments

@njgrisafi
Copy link

I generated an austin out with austin -g -o austin-latest.out pytest

Then when running austin2speedscope austin-latest.out austin.speedscope
I get the following error

Traceback (most recent call last):
  File "/home/admin/workspace/app/.venv/bin/austin2speedscope", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/admin/workspace/app/.venv/lib/python3.11/site-packages/austin/format/speedscope.py", line 214, in main
    for line in fin:
  File "/home/admin/workspace/app/.venv/lib/python3.11/site-packages/austin/stats.py", line 434, in _
    for line in self._stream_iter:
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2875: invalid start byte

Our austin output is fairly large ~7GB not sure if that has anything to do with it.

Running this on a debian machine

Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

Python version is 3.11.8

@P403n1x87
Copy link
Owner

@njgrisafi it looks like the test suite is quite large and Austin is collecting a lot of samples at the default interval (100 microseconds between samples). I also see that you're using the -g option to get insight into the garbage collector. The problem is probably due to Austin collecting an invalid string from the frames (this is usually rare, but more likely in long-running jobs). I would suggest either of two things:

  1. Increase the sampling interval to reduce the number of samples collected
  2. Run only the more relevant tests for the profile analysis (e.g. those that are more likely to interact with the GC, if that's what's being investigated).

@dooferlad
Copy link

I have experienced the same issue. The only way around it was to modify the stats reader to ignore the bad samples and continue.

@mrexodia
Copy link

mrexodia commented Mar 4, 2025

Ran into the same issue today as well. I added an event history buffer to track what is going on. The relevant data snippet:

Image

As hex:

02 B4 F2 05 00 31 36 62 66 35 37 30 30 30 00 05
94 80 80 9D AD CF 02 05 A7 83 80 9E A0 F8 02 05
9E 80 80 8B AD CF 02 05 B9 82 80 B2 C7 FC 02 09
B1 01 02 B4 F2 05 00 31 66 38 33 37 30 38 34 30
00 05 A1 83 80 C7 8B F8 02 05 BD 80 80 E0 D0 C0
03 05 94 84 80 80 D5 A0 03 05 A9 80 80 FF F0 D1
02 05 9E 86 80 C0 BE 80 03 05 96 80 80 93 FB D1
02 09 B1 01 02 B5 F2 05 00 31 66 38 33 37 30 38
34 30 00 0B B0 8E BA 33 30 86 50 1A 03 00 04 09
B1 01 02 B4 F2 05 00 31 36 64 66 36 66 30 30 30
00 05 94 80 80 9D AD CF 02 05 A7 83 80 9E A0 F8
02 05 9E 80 80 8B AD CF 02 05 84 80 80 DF A1 C0
02 09 B3 01 02 B4 F2 05 00 31 36 63 66 36 33 30
30 30 00 05 94 80 80 9D AD CF 02 05 A7 83 80 9E
A0 F8 02 05 9E 80 80 8B AD CF 02 05 84 80 80 DF
A1 C0 02 09 B3 01 02 B4 F2 05 00 31 36 62 66 35
37 30 30 30 00

My event history buffer (I log the offset of every event id encountered in a ring buffer):

  [0xcaa8cc] 2
  [0xcaa8db] 5
  [0xcaa8e3] 5
  [0xcaa8eb] 5
  [0xcaa8f3] 5
  [0xcaa8fb] 9
  [0xcaa8fe] 2
  [0xcaa90d] 5
  [0xcaa915] 5
  [0xcaa91d] 5
  [0xcaa925] 5
  [0xcaa92d] 5
  [0xcaa935] 5
  [0xcaa93d] 9
  [0xcaa940] 2
  [0xcaa94f] 11 <-- pretty confident this is not actually a string

My code changes:

    def read_string(self) -> str:
        """Read a string from the MOJO file."""
        encoded = self.read_until()
        try:
            return encoded.decode()
        except UnicodeDecodeError as e:
            encoded_offset = self._offset - len(encoded)
            print(f"[invalid string] offset: {hex(encoded_offset)}, data: {encoded.hex(' ')}")
            print(f"event history: ")
            for offset, event_id in self._event_history:
                print(f"  [{hex(offset)}] {event_id}")
            raise e

    def parse_event(self) -> t.Generator[t.Optional[MojoEvent], None, None]:
        """Parse a single event."""
        try:
            (event_id,) = self.read(1)
            self._event_history.append((self._offset, event_id))
        except ValueError:
            yield None
            return
        ...

I will just use the non-binary format for now, but hopefully this helps with debugging. My guess is that this is related to varint handling, but 🤷‍♂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants