-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Ensure the read_to_end
buffer always has enough room to fit a single UTF-8 code point
#142872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
r? @ibraheemdev rustbot has assigned @ibraheemdev. Use |
library/std/src/io/mod.rs
Outdated
// Ensure there's at least enough space for one UTF-8 encoded code point. | ||
if buf.spare_capacity_mut().len() < char::MAX_LEN_UTF8 { | ||
// buf is full, need more space | ||
buf.try_reserve(PROBE_SIZE)?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this the condition could even be replaced by an unconditional call to try_reserve(PROBE_SIZE)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might lead to some unnecessary reallocations around some tipping-points, see #89165 (also mentioned in the function's top comment). It's tricky to get these tradeoffs right.
Maybe we can limit this to cfg(windows)? Specialization could work too but feels a bit overkill here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add the issue to the list of masters this function is serving. Anyone who wants to touch it will have to keep track of all these things.
Since |
My reason for thinking this is more general useful is that currently if a read is just short of the available buffer then the next read will have a buffer size of one or two bytes. If we've not reached the end then this is wasteful. Another alternative is to use |
The user may have passed a buffer or size hint with exactly the right size and block-sizes just happen to align so that the last read will be small. In that case it might lead to a huge reallocation just to read the last few bytes. |
Right but reading into a stack buffer would side-step that issue. We could also unconditionally realloc if we've already reallocated at least once. |
yeah, that could work. |
d8f8162
to
108dc3e
Compare
Ok I'd appreciate someone checking my logic here but I've changed it so that if we've not reallocated yet and there's less than |
I can't think of any case where this would be problematic. The use of the stack buffer can trigger OOMs (while the main loop will turn those into errors), but the extra looping just happens in a window where it also could have happened before anyway. For a fix this is fine, though I think this impl is growing in ugliness and could use some cleanup and more tests but that can be done separately. |
Hrrmm, actually... isn't Stdin buffered? Shouldn't that buffer prevent that tearing? I don't see how read_to_end bypasses that. |
rust/library/std/src/io/buffered/bufreader.rs Lines 408 to 410 in 22be76b
|
@@ -452,7 +452,7 @@ pub(crate) fn default_read_to_end<R: Read + ?Sized>( | |||
let mut consecutive_short_reads = 0; | |||
|
|||
loop { | |||
if buf.len() == buf.capacity() && buf.capacity() == start_cap { | |||
if buf.spare_capacity_mut().len() < PROBE_SIZE && buf.capacity() == start_cap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that it would be a good idea too, but small_probe_read
is designed to check for EOF so taking this branch if we still have space in the vector does not seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a tight loop until either EOF or it exceeds the spare space. It'll only ever get past that point once a reallocation is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought a bit about that, your solution avoids an issue with small reads if the vector was preallocated.
On Windows, the UTF-16 to UTF-8 translation is made simpler by ensuring we don't split code points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version looks great! Thanks!
Side note: I think that these lines are no longer relevant, now that default_read_to_end
uses the unstable read_buf
API.
rust/library/std/src/io/mod.rs
Lines 404 to 405 in ae2fc97
// - avoid passing large buffers to readers that always initialize the free capacity if they perform short reads (#23815, #23820) | |
// - pass large buffers to readers that do not initialize the spare capacity. this can amortize per-call overheads |
Not all readers implement readbuf, the default impl will still initialize the spare capacity. |
This is a quick fix to resolve #142847. So long as the buffer we read into has space for at least one UTF-8 code point then we can avoid any issues caused by splitting between code point boundaries (thus avoiding a lot of complexity). I think this is also good in general as it avoids some unnecessarily short reads.