Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing valid UTF-8 to the Windows terminal using stdout().write() can incorrectly return an error #83258

Closed
Count-Count opened this issue Mar 18, 2021 · 4 comments · Fixed by #83342
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` A-Unicode Area: Unicode C-bug Category: This is a bug. O-windows Operating system: Windows T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Count-Count
Copy link
Contributor

Count-Count commented Mar 18, 2021

I tried this code:

use std::io::Write;

fn standard_write_all<W: std::io::Write>(mut w: W, mut buf: &[u8]) -> Result<(), std::io::Error> {
    while !buf.is_empty() {
        match w.write(buf) {
            Ok(0) => {
                return Err(std::io::Error::new(
                    std::io::ErrorKind::WriteZero,
                    "failed to write whole buffer",
                ));
            }
            Ok(n) => buf = &buf[n..],
            Err(ref e) if e.kind() == std::io::ErrorKind::Interrupted => {}
            Err(e) => return Err(e),
        }
    }
    Ok(())
}

fn main() {
    let poison = "Text here does not matter.\n".to_owned() + "x".repeat(1024 - 1).as_str() + "😀";

    // The special stdout().write_all() impl works.
    std::io::stdout().write_all(poison.as_bytes()).unwrap();
    println!();
    println!();

    // A standard write_all() implementation fails.
    //
    // 1. First write call
    // The bytes up to the newline are passed through to the console in one write.
    //
    // The next 1023 b'x' bytes and the first byte of "😀" are buffered leading to
    // an incomplete codepoint at the end of the buffer. The buffer is now poisoned
    // meaning that the next write() call will fail even if the missing UTF-8 bytes
    // are supplied.
    //
    // 2. Second write call
    // The buffer is flushed completely on the next call, the console writes 1023 b'x' bytes
    // and is then forced to write the remaining byte from the incomplete UTF-8 sequence.
    standard_write_all(std::io::stdout(), poison.as_bytes()).unwrap();
}

I expected to see this happen: Program does not panic.

Instead, this happened: Program panics with the error Windows stdio in console mode does not support writing non-UTF-8 byte sequences.

Meta

rustc --version --verbose:

rustc 1.50.0 (cb75ad5db 2021-02-10)
binary: rustc
commit-hash: cb75ad5db02783e8b0222fee363c5f63f7e2cf5b
commit-date: 2021-02-10
host: x86_64-pc-windows-msvc
release: 1.50.0
Backtrace

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: InvalidData, error: "Windows stdio in console mode does not support writing non-UTF-8 byte sequences" }', src\main.rs:35:65
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\/library\std\src\panicking.rs:493
   1: core::panicking::panic_fmt
             at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\/library\core\src\panicking.rs:92
   2: core::option::expect_none_failed
             at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b\/library\core\src\option.rs:1268
   3: core::result::Result<tuple<>, std::io::error::Error>::unwrap<tuple<>,std::io::error::Error>
             at C:\Users\hans\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\core\src\result.rs:973
   4: termcolorbug::main
             at .\src\main.rs:35
   5: core::ops::function::FnOnce::call_once<fn(),tuple<>>
             at C:\Users\hans\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\core\src\ops\function.rs:227
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Update

To clarify: This problem occurs with all Write trait implementations that don't override the default Write::write_all() implementation, e.g. with this simple DelegatingWrite:

use std::io::Write;

struct DelegatingWrite<W> {
    delegate: W,
}

impl<W: Write> DelegatingWrite<W> {
    fn new(delegate: W) -> DelegatingWrite<W> {
        DelegatingWrite { delegate }
    }
}

impl<W: Write> Write for DelegatingWrite<W> {
    fn write(&mut self, b: &[u8]) -> std::io::Result<usize> {
        self.delegate.write(b)
    }

    fn flush(&mut self) -> std::io::Result<()> {
        self.delegate.flush()
    }
}

fn main() {
    let poison = "Text here does not matter.\n".to_owned() + "x".repeat(1024 - 1).as_str() + "😀";
    let mut w = DelegatingWrite::new(std::io::stdout());
    write!(w, "{}", poison).unwrap(); // <-- panics
}
@Count-Count
Copy link
Contributor Author

Count-Count commented Mar 18, 2021

I see two distinct possibilites to resolve this:

  1. Treat incomplete write()s not ending on a codepoint boundary as a bug and fix that in the stdlib. Also clarify in the docs that indivual write() calls must be passed complete UTF-8 sequences if a Windows terminal is attached.

  2. In the vain of "Be liberal in what you accept" handle incomplete UTF-8 passed in via write() calls appropriately, as long as the sequence is completed properly in the next write() call(s).

@ChrisDenton
Copy link
Member

Hm, so this is ultimately a bug in the buffer std::io::stdout() uses? It attempts to write 1024 bytes but only 1023 bytes can written in the first OS call. It then has a leftover byte, which I guess it tries to write by itself thus causing this error?

@Count-Count
Copy link
Contributor Author

@ChrisDenton The logic which errors if incomplete UTF-8 is passed in at index 0 is in the std lib at sys/windows/stdio.rs. That is where the error message comes from.

As far as I can see the stdio buffer implementation (in stable, it is in linewritershim.rs, apparently currently being revamped) can not know of the requirement that only complete UTF-8 may be written to the underlying Windows terminal so I think that it should be encapsulated in sys/windows/stdio.rs by just accepting and buffering incomplete UTF-8 when necessary.

@ChrisDenton
Copy link
Member

I mean, a generic buffer can attempt to avoid small writes (say, less than 16 bytes) by copying those left over bytes to the front of the buffer and waiting for either another write to refill the buffer or a flush to empty it.

But I see what you mean. Introducing a small buffer to windows/stdio would fix the problem.

@Count-Count Count-Count changed the title Writing valid UTF-8 to the Windows terminal using stdout().write() can panic. Writing valid UTF-8 to the Windows terminal using stdout().write() can incorrectly return an error Mar 28, 2021
@JohnTitor JohnTitor added A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` A-Unicode Area: Unicode O-windows Operating system: Windows T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jul 13, 2021
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Sep 1, 2021
…utf8, r=m-ou-se

Allow writing of incomplete UTF-8 sequences to the Windows console via stdout/stderr

# Problem
Writes of just an incomplete UTF-8 byte sequence (e.g. `b"\xC3"` or `b"\xF0\x9F"`)  to stdout/stderr with a Windows console attached error with `io::ErrorKind::InvalidData, "Windows stdio in console mode does not support writing non-UTF-8 byte sequences"` even though further writes could complete the codepoint. This is currently a rare occurence since the [linewritershim](https://github.com/rust-lang/rust/blob/2c56ea38b045624dc8b42ec948fc169eaff1206a/library/std/src/io/buffered/linewritershim.rs) implementation flushes complete lines immediately and buffers up to 1024 bytes for incomplete lines. It can still happen as described in rust-lang#83258.

The problem will become more pronounced once the developer can switch stdout/stderr from line-buffered to block-buffered or immediate when the changes in the "Switchable buffering for Stdout" pull request (rust-lang#78515) get merged.

# Patch description
If there is at least one valid UTF-8 codepoint all valid UTF-8 is passed through to the extracted `write_valid_utf8_to_console()` fn. The new code only comes into play if `write()` is being passed a short byte slice comprising an incomplete UTF-8 codepoint. In this case up to three bytes are buffered in the `IncompleteUtf8` struct associated with `Stdout` / `Stderr`. The bytes are accepted one at a time. As soon as an error can be detected `io::ErrorKind::InvalidData, "Windows stdio in console mode does not support writing non-UTF-8 byte sequences"` is returned. Once a complete UTF-8 codepoint is received it is passed to the `write_valid_utf8_to_console()` and the buffer length is set to zero.

Calling `flush()` will neither error nor write anything if an incomplete codepoint is present in the buffer.

# Tests
Currently there are no Windows-specific tests for console writing code at all. Writing (regression) tests for this problem is a bit challenging since unit tests and UI tests don't run in a console and suddenly popping up another console window might be surprising to developers running the testsuite and it might not work at all in CI builds. To just test the new functionality in unit tests the code would need to be refactored. Some guidance on how to proceed would be appreciated.

# Public API changes
* `std::str::verifications::utf8_char_width()` would be exposed as `std::str::utf8_char_width()` behind the "str_internals" feature gate.

# Related issues
* Fixes rust-lang#83258.
* PR rust-lang#78515 will exacerbate the problem.

# Open questions
* Add tests?
* Squash into one commit with better commit message?
@bors bors closed this as completed in cc9bb15 Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` A-Unicode Area: Unicode C-bug Category: This is a bug. O-windows Operating system: Windows T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
3 participants