Skip to content

Debug printing of combining characters is wrong #41922

Closed
@clarfonthey

Description

@clarfonthey
Contributor

Minimal example:

fn main() {
    let s = "e\u{301}";
    println!("str: {:?}", s);
    println!("bytes: {:?}", s.chars().collect::<Vec<_>>());
}

(playground link)

Expected output is either:

str: "é"
bytes: ['e', '\u{301}']

Or:

str: "é"
bytes: ['e', '◌́']

Actual output:

str: "é"
bytes: ['e', '́']

Note that the combining accent prints over the single quote. This is confusing and shouldn't happen.

Activity

clarfonthey

clarfonthey commented on May 11, 2017

@clarfonthey
ContributorAuthor

cc @tbu- who made the change to debug printing and @alexcrichton who approved it

tbu-

tbu- commented on May 11, 2017

@tbu-
Contributor

Python seems to do the same thing.

>>> '\u0301'
'́'
tbu-

tbu- commented on May 11, 2017

@tbu-
Contributor

@clarcharr That is, do you know some implementation we could copy?

clarfonthey

clarfonthey commented on May 11, 2017

@clarfonthey
ContributorAuthor

@tbu- not that I can think of; the current way seems wrong, though. Perhaps we could just check if a character is within the combining character range?

I found this and it probably could help: http://stackoverflow.com/a/17052803

Perhaps we could make a similar script?

clarfonthey

clarfonthey commented on May 11, 2017

@clarfonthey
ContributorAuthor

Also got some help on Twitter for this:

https://twitter.com/FakeUnicode/status/862798986238873601

added
T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.
on Jun 22, 2017
behnam

behnam commented on Aug 11, 2017

@behnam
Contributor

I would not consider this a bug, as it's common behavior to not touch or change Unicode characters when printed out to stdout or a file, specially when it's for debug mode.

One reason to not do this is the fact that it can easily mislead the user. Let's say I got the output and copy-pasted the output in a Unicode decoder, to see what character we have in the spot. I will see two codepoints in the decoder, one of which had not existed in the original string.

So, IMHO, there are pros in doing so, specially nicer-looking output, but the main con being the Debug output not telling you the truth, which is very unfortunate, specially since there will be almost no work around it! I think it's better to keep these fancy features for the high-level parts of a stack, like Display, instead of Debug.

If Rust wants to do anything special about these characters, the filter would be GC=Mn (Nonspacing_Mark). But, it should be noted that this would mean the result would depend on the Unicode version of the compiler, and newly assigned characters won't get the special treatment until the internal Unicode data of Rust gets updated.

That said, I think we also need to take a look at what other modern Unicode-savvy languages, like Swift, are doing in this area, before making a decision.

varkor

varkor commented on Mar 22, 2018

@varkor
Member

For reference, in Swift:

let str = "e\u{301}";
// Array of unicode scalars, equivalent to Rust's chars
print("\(Array(str.unicodeScalars))"); // ["e", "\u{0301}"]
// Array of unicode scalars converted into strings
print("\(Array(str.unicodeScalars).map({ String.init($0) }))"); // ["e", "́"]

Swift opts to print code points for unicode scalars (but when converted to strings they display as in Rust).
This seems like reasonable behaviour (@clarcharr's first suggestion).

varkor

varkor commented on Mar 22, 2018

@varkor
Member

This seems to have been deliberately changed to the current output as a result of #24588.

clarfonthey

clarfonthey commented on Mar 22, 2018

@clarfonthey
ContributorAuthor

I still just think that checking if the character is combining and then escaping if it's by itself is the best option.

varkor

varkor commented on Mar 22, 2018

@varkor
Member

Oh, I see: you already mentioned the earlier change! I agree: this would make sense for combining characters. The range described on Wikipedia should probably be sufficient?

added a commit that references this issue on May 21, 2018

Auto merge of #49283 - varkor:combining-chars-escape_debug, r=SimonSapin

65a16c0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-UnicodeArea: UnicodeC-bugCategory: This is a bug.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @behnam@varkor@Mark-Simulacrum@tbu-@clarfonthey

        Issue actions

          Debug printing of combining characters is wrong · Issue #41922 · rust-lang/rust