Skip to content

Ensure the read_to_end buffer always has enough room to fit a single UTF-8 code point #142872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChrisDenton
Copy link
Member

@ChrisDenton ChrisDenton commented Jun 22, 2025

This is a quick fix to resolve #142847. So long as the buffer we read into has space for at least one UTF-8 code point then we can avoid any issues caused by splitting between code point boundaries (thus avoiding a lot of complexity). I think this is also good in general as it avoids some unnecessarily short reads.

@rustbot
Copy link
Collaborator

rustbot commented Jun 22, 2025

r? @ibraheemdev

rustbot has assigned @ibraheemdev.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 22, 2025
Comment on lines 467 to 475
// Ensure there's at least enough space for one UTF-8 encoded code point.
if buf.spare_capacity_mut().len() < char::MAX_LEN_UTF8 {
// buf is full, need more space
buf.try_reserve(PROBE_SIZE)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this the condition could even be replaced by an unconditional call to try_reserve(PROBE_SIZE).

Copy link
Member

@the8472 the8472 Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might lead to some unnecessary reallocations around some tipping-points, see #89165 (also mentioned in the function's top comment). It's tricky to get these tradeoffs right.

Maybe we can limit this to cfg(windows)? Specialization could work too but feels a bit overkill here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the issue to the list of masters this function is serving. Anyone who wants to touch it will have to keep track of all these things.

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

Since default_read_to_end is a crate-private we could also add a "min read capacity" option which can then be passed down from the stdin impl

@ChrisDenton
Copy link
Member Author

My reason for thinking this is more general useful is that currently if a read is just short of the available buffer then the next read will have a buffer size of one or two bytes. If we've not reached the end then this is wasteful.

Another alternative is to use small_read_probe, which at least is better than using a one byte buffer or so.

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

The user may have passed a buffer or size hint with exactly the right size and block-sizes just happen to align so that the last read will be small. In that case it might lead to a huge reallocation just to read the last few bytes.

@ChrisDenton
Copy link
Member Author

Right but reading into a stack buffer would side-step that issue. We could also unconditionally realloc if we've already reallocated at least once.

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

yeah, that could work.

@ChrisDenton ChrisDenton force-pushed the fixup branch 2 times, most recently from d8f8162 to 108dc3e Compare June 23, 2025 10:48
@ChrisDenton
Copy link
Member Author

Ok I'd appreciate someone checking my logic here but I've changed it so that if we've not reallocated yet and there's less than PROBE_SIZE bytes in the spare capacity then we keep using the stack buffer until either we reach the end or we need to reallocate. Otherwise we just reserve the space directly in the Vec.

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

I can't think of any case where this would be problematic. The use of the stack buffer can trigger OOMs (while the main loop will turn those into errors), but the extra looping just happens in a window where it also could have happened before anyway.

For a fix this is fine, though I think this impl is growing in ugliness and could use some cleanup and more tests but that can be done separately.

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

Hrrmm, actually... isn't Stdin buffered? Shouldn't that buffer prevent that tearing? I don't see how read_to_end bypasses that.

@ChrisDenton
Copy link
Member Author

StdinRaw has an implementation of read_to_end and BufReader will defer to that once its drained its own buffer.

// The inner reader might have an optimized `read_to_end`. Drain our buffer and then
// delegate to the inner implementation.
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {

@@ -452,7 +452,7 @@ pub(crate) fn default_read_to_end<R: Read + ?Sized>(
let mut consecutive_short_reads = 0;

loop {
if buf.len() == buf.capacity() && buf.capacity() == start_cap {
if buf.spare_capacity_mut().len() < PROBE_SIZE && buf.capacity() == start_cap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it would be a good idea too, but small_probe_read is designed to check for EOF so taking this branch if we still have space in the vector does not seem right.

Copy link
Member Author

@ChrisDenton ChrisDenton Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a tight loop until either EOF or it exceeds the spare space. It'll only ever get past that point once a reallocation is required.

Copy link
Contributor

@a1phyr a1phyr Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought a bit about that, your solution avoids an issue with small reads if the vector was preallocated.

On Windows, the UTF-16 to UTF-8 translation is made simpler by ensuring we don't split code points.
Copy link
Contributor

@a1phyr a1phyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version looks great! Thanks!

Side note: I think that these lines are no longer relevant, now that default_read_to_end uses the unstable read_buf API.

// - avoid passing large buffers to readers that always initialize the free capacity if they perform short reads (#23815, #23820)
// - pass large buffers to readers that do not initialize the spare capacity. this can amortize per-call overheads

@the8472
Copy link
Member

the8472 commented Jun 23, 2025

Not all readers implement readbuf, the default impl will still initialize the spare capacity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

io::stdin().read_to_end() drops a byte on certain Unicode input (Windows only)
5 participants