UnicodeDecodeError for speedscope #256

njgrisafi · 2025-02-20T01:46:18Z

I generated an austin out with austin -g -o austin-latest.out pytest

Then when running austin2speedscope austin-latest.out austin.speedscope
I get the following error

Traceback (most recent call last):
  File "/home/admin/workspace/app/.venv/bin/austin2speedscope", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/admin/workspace/app/.venv/lib/python3.11/site-packages/austin/format/speedscope.py", line 214, in main
    for line in fin:
  File "/home/admin/workspace/app/.venv/lib/python3.11/site-packages/austin/stats.py", line 434, in _
    for line in self._stream_iter:
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2875: invalid start byte

Our austin output is fairly large ~7GB not sure if that has anything to do with it.

Running this on a debian machine

Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

Python version is 3.11.8

The text was updated successfully, but these errors were encountered:

P403n1x87 · 2025-02-23T14:16:55Z

@njgrisafi it looks like the test suite is quite large and Austin is collecting a lot of samples at the default interval (100 microseconds between samples). I also see that you're using the -g option to get insight into the garbage collector. The problem is probably due to Austin collecting an invalid string from the frames (this is usually rare, but more likely in long-running jobs). I would suggest either of two things:

Increase the sampling interval to reduce the number of samples collected
Run only the more relevant tests for the profile analysis (e.g. those that are more likely to interact with the GC, if that's what's being investigated).

dooferlad · 2025-02-28T15:19:58Z

I have experienced the same issue. The only way around it was to modify the stats reader to ignore the bad samples and continue.

mrexodia · 2025-03-04T15:13:57Z

Ran into the same issue today as well. I added an event history buffer to track what is going on. The relevant data snippet:

As hex:

02 B4 F2 05 00 31 36 62 66 35 37 30 30 30 00 05
94 80 80 9D AD CF 02 05 A7 83 80 9E A0 F8 02 05
9E 80 80 8B AD CF 02 05 B9 82 80 B2 C7 FC 02 09
B1 01 02 B4 F2 05 00 31 66 38 33 37 30 38 34 30
00 05 A1 83 80 C7 8B F8 02 05 BD 80 80 E0 D0 C0
03 05 94 84 80 80 D5 A0 03 05 A9 80 80 FF F0 D1
02 05 9E 86 80 C0 BE 80 03 05 96 80 80 93 FB D1
02 09 B1 01 02 B5 F2 05 00 31 66 38 33 37 30 38
34 30 00 0B B0 8E BA 33 30 86 50 1A 03 00 04 09
B1 01 02 B4 F2 05 00 31 36 64 66 36 66 30 30 30
00 05 94 80 80 9D AD CF 02 05 A7 83 80 9E A0 F8
02 05 9E 80 80 8B AD CF 02 05 84 80 80 DF A1 C0
02 09 B3 01 02 B4 F2 05 00 31 36 63 66 36 33 30
30 30 00 05 94 80 80 9D AD CF 02 05 A7 83 80 9E
A0 F8 02 05 9E 80 80 8B AD CF 02 05 84 80 80 DF
A1 C0 02 09 B3 01 02 B4 F2 05 00 31 36 62 66 35
37 30 30 30 00

My event history buffer (I log the offset of every event id encountered in a ring buffer):

  [0xcaa8cc] 2
  [0xcaa8db] 5
  [0xcaa8e3] 5
  [0xcaa8eb] 5
  [0xcaa8f3] 5
  [0xcaa8fb] 9
  [0xcaa8fe] 2
  [0xcaa90d] 5
  [0xcaa915] 5
  [0xcaa91d] 5
  [0xcaa925] 5
  [0xcaa92d] 5
  [0xcaa935] 5
  [0xcaa93d] 9
  [0xcaa940] 2
  [0xcaa94f] 11 <-- pretty confident this is not actually a string

My code changes:

    def read_string(self) -> str:
        """Read a string from the MOJO file."""
        encoded = self.read_until()
        try:
            return encoded.decode()
        except UnicodeDecodeError as e:
            encoded_offset = self._offset - len(encoded)
            print(f"[invalid string] offset: {hex(encoded_offset)}, data: {encoded.hex(' ')}")
            print(f"event history: ")
            for offset, event_id in self._event_history:
                print(f"  [{hex(offset)}] {event_id}")
            raise e

    def parse_event(self) -> t.Generator[t.Optional[MojoEvent], None, None]:
        """Parse a single event."""
        try:
            (event_id,) = self.read(1)
            self._event_history.append((self._offset, event_id))
        except ValueError:
            yield None
            return
        ...

I will just use the non-binary format for now, but hopefully this helps with debugging. My guess is that this is related to varint handling, but 🤷‍♂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError for speedscope #256

UnicodeDecodeError for speedscope #256

njgrisafi commented Feb 20, 2025

P403n1x87 commented Feb 23, 2025

dooferlad commented Feb 28, 2025

mrexodia commented Mar 4, 2025 •

edited

Loading

UnicodeDecodeError for speedscope #256

UnicodeDecodeError for speedscope #256

Comments

njgrisafi commented Feb 20, 2025

P403n1x87 commented Feb 23, 2025

dooferlad commented Feb 28, 2025

mrexodia commented Mar 4, 2025 • edited Loading

mrexodia commented Mar 4, 2025 •

edited

Loading