Closed
Description
Minimal example:
fn main() {
let s = "e\u{301}";
println!("str: {:?}", s);
println!("bytes: {:?}", s.chars().collect::<Vec<_>>());
}
Expected output is either:
str: "é"
bytes: ['e', '\u{301}']
Or:
str: "é"
bytes: ['e', '◌́']
Actual output:
str: "é"
bytes: ['e', '́']
Note that the combining accent prints over the single quote. This is confusing and shouldn't happen.
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
clarfonthey commentedon May 11, 2017
cc @tbu- who made the change to debug printing and @alexcrichton who approved it
tbu- commentedon May 11, 2017
Python seems to do the same thing.
tbu- commentedon May 11, 2017
@clarcharr That is, do you know some implementation we could copy?
clarfonthey commentedon May 11, 2017
@tbu- not that I can think of; the current way seems wrong, though. Perhaps we could just check if a character is within the combining character range?
I found this and it probably could help: http://stackoverflow.com/a/17052803
Perhaps we could make a similar script?
clarfonthey commentedon May 11, 2017
Also got some help on Twitter for this:
https://twitter.com/FakeUnicode/status/862798986238873601
behnam commentedon Aug 11, 2017
I would not consider this a bug, as it's common behavior to not touch or change Unicode characters when printed out to stdout or a file, specially when it's for debug mode.
One reason to not do this is the fact that it can easily mislead the user. Let's say I got the output and copy-pasted the output in a Unicode decoder, to see what character we have in the spot. I will see two codepoints in the decoder, one of which had not existed in the original string.
So, IMHO, there are pros in doing so, specially nicer-looking output, but the main con being the Debug output not telling you the truth, which is very unfortunate, specially since there will be almost no work around it! I think it's better to keep these fancy features for the high-level parts of a stack, like
Display
, instead ofDebug
.If Rust wants to do anything special about these characters, the filter would be
GC=Mn
(Nonspacing_Mark). But, it should be noted that this would mean the result would depend on the Unicode version of the compiler, and newly assigned characters won't get the special treatment until the internal Unicode data of Rust gets updated.That said, I think we also need to take a look at what other modern Unicode-savvy languages, like Swift, are doing in this area, before making a decision.
varkor commentedon Mar 22, 2018
For reference, in Swift:
Swift opts to print code points for unicode scalars (but when converted to strings they display as in Rust).
This seems like reasonable behaviour (@clarcharr's first suggestion).
varkor commentedon Mar 22, 2018
This seems to have been deliberately changed to the current output as a result of #24588.
clarfonthey commentedon Mar 22, 2018
I still just think that checking if the character is combining and then escaping if it's by itself is the best option.
varkor commentedon Mar 22, 2018
Oh, I see: you already mentioned the earlier change! I agree: this would make sense for combining characters. The range described on Wikipedia should probably be sufficient?
Auto merge of #49283 - varkor:combining-chars-escape_debug, r=SimonSapin