Ensure the `read_to_end` buffer always has enough room to fit a single UTF-8 code point #142872

ChrisDenton · 2025-06-22T14:16:45Z

This is a quick fix to resolve #142847. So long as the buffer we read into has space for at least one UTF-8 code point then we can avoid any issues caused by splitting between code point boundaries (thus avoiding a lot of complexity). I think this is also good in general as it avoids some unnecessarily short reads.

rustbot · 2025-06-22T14:16:50Z

r? @ibraheemdev

rustbot has assigned @ibraheemdev.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

a1phyr · 2025-06-22T20:30:28Z

library/std/src/io/mod.rs

+        // Ensure there's at least enough space for one UTF-8 encoded code point.
+        if buf.spare_capacity_mut().len() < char::MAX_LEN_UTF8 {
            // buf is full, need more space
            buf.try_reserve(PROBE_SIZE)?;
        }


I think that this the condition could even be replaced by an unconditional call to try_reserve(PROBE_SIZE).

That might lead to some unnecessary reallocations around some tipping-points, see #89165 (also mentioned in the function's top comment). It's tricky to get these tradeoffs right.

Maybe we can limit this to cfg(windows)? Specialization could work too but feels a bit overkill here.

Also, please add the issue to the list of masters this function is serving. Anyone who wants to touch it will have to keep track of all these things.

the8472 · 2025-06-23T08:19:23Z

Since default_read_to_end is a crate-private we could also add a "min read capacity" option which can then be passed down from the stdin impl

ChrisDenton · 2025-06-23T08:41:11Z

My reason for thinking this is more general useful is that currently if a read is just short of the available buffer then the next read will have a buffer size of one or two bytes. If we've not reached the end then this is wasteful.

Another alternative is to use small_read_probe, which at least is better than using a one byte buffer or so.

the8472 · 2025-06-23T09:40:47Z

The user may have passed a buffer or size hint with exactly the right size and block-sizes just happen to align so that the last read will be small. In that case it might lead to a huge reallocation just to read the last few bytes.

ChrisDenton · 2025-06-23T09:49:24Z

Right but reading into a stack buffer would side-step that issue. We could also unconditionally realloc if we've already reallocated at least once.

the8472 · 2025-06-23T09:55:23Z

yeah, that could work.

ChrisDenton · 2025-06-23T10:51:23Z

Ok I'd appreciate someone checking my logic here but I've changed it so that if we've not reallocated yet and there's less than PROBE_SIZE bytes in the spare capacity then we keep using the stack buffer until either we reach the end or we need to reallocate. Otherwise we just reserve the space directly in the Vec.

the8472 · 2025-06-23T11:26:23Z

I can't think of any case where this would be problematic. The use of the stack buffer can trigger OOMs (while the main loop will turn those into errors), but the extra looping just happens in a window where it also could have happened before anyway.

For a fix this is fine, though I think this impl is growing in ugliness and could use some cleanup and more tests but that can be done separately.

the8472 · 2025-06-23T11:33:28Z

Hrrmm, actually... isn't Stdin buffered? Shouldn't that buffer prevent that tearing? I don't see how read_to_end bypasses that.

ChrisDenton · 2025-06-23T11:45:35Z

StdinRaw has an implementation of read_to_end and BufReader will defer to that once its drained its own buffer.

rust/library/std/src/io/buffered/bufreader.rs

Lines 408 to 410 in 22be76b

    
           // The inner reader might have an optimized `read_to_end`. Drain our buffer and then 
        
           // delegate to the inner implementation. 
        
           fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {

a1phyr · 2025-06-23T11:41:17Z

library/std/src/io/mod.rs

@@ -452,7 +452,7 @@ pub(crate) fn default_read_to_end<R: Read + ?Sized>(
    let mut consecutive_short_reads = 0;

    loop {
-        if buf.len() == buf.capacity() && buf.capacity() == start_cap {
+        if buf.spare_capacity_mut().len() < PROBE_SIZE && buf.capacity() == start_cap {


I thought that it would be a good idea too, but small_probe_read is designed to check for EOF so taking this branch if we still have space in the vector does not seem right.

It will be a tight loop until either EOF or it exceeds the spare space. It'll only ever get past that point once a reallocation is required.

Having thought a bit about that, your solution avoids an issue with small reads if the vector was preallocated.

library/std/src/io/mod.rs

On Windows, the UTF-16 to UTF-8 translation is made simpler by ensuring we don't split code points.

a1phyr

This version looks great! Thanks!

Side note: I think that these lines are no longer relevant, now that default_read_to_end uses the unstable read_buf API.

rust/library/std/src/io/mod.rs

Lines 404 to 405 in ae2fc97

    
           // - avoid passing large buffers to readers that always initialize the free capacity if they perform short reads (#23815, #23820) 
        
           // - pass large buffers to readers that do not initialize the spare capacity. this can amortize per-call overheads

the8472 · 2025-06-23T14:05:49Z

Not all readers implement readbuf, the default impl will still initialize the spare capacity.

rustbot assigned ibraheemdev Jun 22, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 22, 2025

a1phyr reviewed Jun 22, 2025

View reviewed changes

ChrisDenton force-pushed the fixup branch 2 times, most recently from d8f8162 to 108dc3e Compare June 23, 2025 10:48

a1phyr reviewed Jun 23, 2025

View reviewed changes

Avoid short reads in read_to_end

5473402

On Windows, the UTF-16 to UTF-8 translation is made simpler by ensuring we don't split code points.

ChrisDenton force-pushed the fixup branch from 108dc3e to 5473402 Compare June 23, 2025 12:20

a1phyr approved these changes Jun 23, 2025

View reviewed changes

	// - avoid passing large buffers to readers that always initialize the free capacity if they perform short reads (#23815, #23820)
	// - pass large buffers to readers that do not initialize the spare capacity. this can amortize per-call overheads

Ensure the read_to_end buffer always has enough room to fit a single UTF-8 code point #142872

Are you sure you want to change the base?

Ensure the read_to_end buffer always has enough room to fit a single UTF-8 code point #142872

Conversation

ChrisDenton commented Jun 22, 2025 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Jun 22, 2025

Uh oh!

a1phyr Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

the8472 Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the8472 Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

the8472 commented Jun 23, 2025

Uh oh!

ChrisDenton commented Jun 23, 2025

Uh oh!

the8472 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisDenton commented Jun 23, 2025

Uh oh!

the8472 commented Jun 23, 2025

Uh oh!

ChrisDenton commented Jun 23, 2025

Uh oh!

the8472 commented Jun 23, 2025

Uh oh!

the8472 commented Jun 23, 2025

Uh oh!

ChrisDenton commented Jun 23, 2025

Uh oh!

a1phyr Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisDenton Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a1phyr Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

a1phyr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

the8472 commented Jun 23, 2025

Uh oh!

Uh oh!

Ensure the `read_to_end` buffer always has enough room to fit a single UTF-8 code point #142872

Ensure the `read_to_end` buffer always has enough room to fit a single UTF-8 code point #142872

ChrisDenton commented Jun 22, 2025 •

edited by rustbot

Loading

the8472 Jun 22, 2025 •

edited

Loading

the8472 commented Jun 23, 2025 •

edited

Loading

ChrisDenton Jun 23, 2025 •

edited

Loading

a1phyr Jun 23, 2025 •

edited

Loading

a1phyr left a comment •

edited

Loading