Skip to content

Confusing Debug of chars #62947

Closed
Closed
@max-sixty

Description

@max-sixty
Contributor

Currently, the Debug of Chars prints the underlying bytes, rather than the chars:

Playground link

#![allow(unused)]
fn main() {
    let s = String::from(" é 😀 ");
    let c = s.chars();
    dbg!("Debug of Chars: ", &c);
    dbg!("Debug of each char: ");
    for x in c {
        dbg!(x);
    }
}

Returns:

[src/main.rs:5] "Debug of Chars: " = "Debug of Chars: "
[src/main.rs:5] &c = Chars {
    iter: Iter(
        [
            32,
            195,
            169,
            32,
            240,
            159,
            152,
            128,
            32,
        ],
    ),
}
[src/main.rs:6] "Debug of each char: " = "Debug of each char: "
[src/main.rs:8] x = ' '
[src/main.rs:8] x = 'é'
[src/main.rs:8] x = ' '
[src/main.rs:8] x = '😀'
[src/main.rs:8] x = ' '

As I was trying to work out what chars was (whether it was unicode points or bytes or something else), the first output was v confusing - is there a reason we don't print something like the second case?

Would you take a PR to change this?

I couldn't find any previous discussion on this - #49283 was the closest I could find.

Activity

added
C-enhancementCategory: An issue proposing an enhancement or a PR with one.
T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.
on Jul 24, 2019
ExpHP

ExpHP commented on Jul 25, 2019

@ExpHP
Contributor

The first one is UTF-8 bytes. You see this because the Debug impl for Chars is auto-generated:

#[derive(Clone, Debug)]
pub struct Chars<'a> {
    iter: slice::Iter<'a, u8>
}

Strictly speaking, since the bytes inside of Chars should always be valid UTF-8, this could have a custom Debug impl that makes it pretend to contain a string by formatting the member as str::from_utf8(self.iter.as_slice()).unwrap().


The reason it doesn't display as individual chars is because it doesn't have individual chars; determining their boundaries is the entire point of the Chars iterator. I suppose this same argument could be used against the call to str::from_utf8, which needs to scan the whole string to validate it.

(but then the solution seems to be to use str::from_utf8_unchecked, which seems awfully heavy-handed for a Debug impl. And perhaps it doesn't even matter, because the cost of most io::Write impls probably outweighs the cost of this validation)

I guess that, questionable concerns of efficiency aside, my main concern is simply that showing a list of individual chars would be... dishonest, I guess.

max-sixty

max-sixty commented on Jul 25, 2019

@max-sixty
ContributorAuthor

I guess that, questionable concerns of efficiency aside, my main concern is simply that showing a list of individual chars would be... dishonest, I guess.

Do str & String contain more data than Chars? I had thought they both contained the underlying bytes and then these were decoded as needed - including as part of the Display & Debug implementations

ExpHP

ExpHP commented on Jul 25, 2019

@ExpHP
Contributor

No, they contain the same data. They're all just UTF-8 bytes. And the Display implementation of str simply writes the bytes contained in the str directly to the io::Write instance.


Suppose we are writing to STDOUT. On UNIX, the io::Write for Stdout impl writes these bytes directly to the underlying file descriptor with no processing:

impl io::Write for Stdout {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
ManuallyDrop::new(FileDesc::new(libc::STDOUT_FILENO)).write(buf)
}

pub fn write(&self, buf: &[u8]) -> io::Result<usize> {
let ret = cvt(unsafe {
libc::write(self.fd,
buf.as_ptr() as *const c_void,
cmp::min(buf.len(), max_len()))
})?;
Ok(ret as usize)
}

I would imagine this is because the console on any UNIX platform almost certainly uses UTF-8.1 It is your terminal application that is then responsible for decoding these bytes and producing glyphs. Considering that UTF-8 dominates much of the web space as well, it's quite possible that even on the playground, these bytes are ultimately sent over the wire to your PC with minimal processing, where your browser is responsible for decoding and displaying them.

On Windows, io::Write for Stdout transcodes the UTF-8 into the UTF-16 format expected by the windows APIs:

let mut utf16 = [0u16; MAX_BUFFER_SIZE / 2];
let mut len_utf16 = 0;
for (chr, dest) in utf8.encode_utf16().zip(utf16.iter_mut()) {
*dest = chr;
len_utf16 += 1;
}
let utf16 = &utf16[..len_utf16];
let mut written = write_u16s(handle, &utf16)?;

and then Windows does whatever it does with those UTF-16 code units. (Quite likely, it hands them directly to the console, which is then responsible for decoding and displaying them)


Footnotes

  1. (I think in actuality UNIX accepts arbitrary strings of bytes, and then the portions of these strings which are valid UTF-8 are rendered appropriately by the console. I don't know; doesn't really matter)

max-sixty

max-sixty commented on Jul 25, 2019

@max-sixty
ContributorAuthor

OK, so given that - is there still an objection to displaying the unicode characters for Chars but not String?

added 4 commits that reference this issue on Jul 29, 2019

Rollup merge of rust-lang#63000 - max-sixty:chars-display, r=alexcric…

2575e53

Rollup merge of rust-lang#63000 - max-sixty:chars-display, r=alexcric…

c91e647

Rollup merge of rust-lang#63000 - max-sixty:chars-display, r=alexcric…

4838953

Rollup merge of rust-lang#63000 - max-sixty:chars-display, r=alexcric…

51e50ed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-iteratorsArea: IteratorsC-enhancementCategory: An issue proposing an enhancement or a PR with one.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @ExpHP@jonas-schievink@max-sixty

      Issue actions

        Confusing Debug of chars · Issue #62947 · rust-lang/rust