Various Unicode BCP 47 locale identifiers issues #330

anba · 2019-03-13T16:00:49Z

6.2 Language Tags

The link to UTS 35 should use https instead of http.
"[...] identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors)" should be changed to refer to Unicode BCP 47 locale identifiers exclusively.
"Unicode BCP 47 Locale Identifiers that meet those validity criteria of Unicode Technical Standard 35, section 3.2 [....]" needs to be reworded, because "validity" can now be misunderstood to mean "validity" as specified in UTS 35 (cf. the "Validity / Comments" column in UTS 35).
- IIUC "structurally valid" in ECMA-402 maps to "syntactically well-formed" in UTS 35.
"[...] without reference to the IANA Language Subtag Registry" may no longer be needed resp. it should be said, that ECMA-402 considers those languages tags as valid which match the syntax of Unicode BCP 47 locale identifiers, but that it is not required to validate them according to the Unicode validation data. (For example "aaj" is a valid language tag in ECMA-402 even though "aaj" is not included in https://unicode.org/repos/cldr/tags/latest/common/validity/language.xml.)

6.2.1 Unicode Locale Extension Sequences

The definition should be changed to refer to unicode_locale_extensions from https://unicode.org/reports/tr35/#Unicode_locale_identifier.

6.2.2 IsStructurallyValidLanguageTag

The next revision of UTS 35 will remove the ABNF grammar, so IsStructurallyValidLanguageTag will need to refer to the EBNF grammar.
Ref: http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Unicode_language_identifier
It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?

6.2.3 CanonicalizeLanguageTag

The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them? (Note: rev34 also has "canonical form", which is only a subset of "canonical syntax" from rev35.)
- "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" in rev53 and rev54 does not replace variant subtags, which were replaced before the switch to Unicode BCP 47 locale ids. For example IETF BCP 47 language tags canonicalises "hy-arevmda" to "hyw", whereas "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" doesn't touch any variant subtags. Canonical Unicode locale identifiers in rev35 will support this canonicalisation. (But "ja-Latn-hepburn-heploc" is still not canonicalised to "ja-Latn-alalc97", instead "ja-Latn-hepburn-alalc97" is used. Not sure if this a bug or unsupported canonicalisation mode in CLDR?)
- "canonical syntax" reorders variant subtags in alphabetical order, which is not allowed per RFC 5646. For example "sl-rozaj-biske" is reordered to "sl-biske-rozaj" in UTS 35, but this actually invalidates the language tag per IANA, because the required prefix for "biske" is "sl-rozaj".
- Unfortunately the "canonical form" in UTS 35 rev54 also adds many more canonicalisation requirements and I'm not sure these make sense for ECMA-402 (at least for the moment).
Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?

The text was updated successfully, but these errors were encountered:

sffc · 2019-03-14T00:37:16Z

@FrankYFTang

FrankYFTang · 2019-03-14T01:03:31Z

@anba As in a big picture, all the issues you mentioned seems reasonable to me. I suggest you create a PR based on what you stated above and we can review the wording of the changes together.

leobalter · 2019-03-14T20:05:47Z

It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?
The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them?
Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?

While I can provide changes for most of other parts, these are questions we should evolve within a discussion next Thursday. I don't have any immediate answer for these, at least.

I'll have a PR with the other parts as I already did with the https parts (see #331)

Ref tc39#330

Ref #330

leobalter · 2019-03-21T16:47:41Z

cc @FrankYFTang @zbraniecki to follow up and reference the public spec once it's published

sffc · 2019-04-29T17:07:32Z

@FrankYFTang Have you followed up with Mark about this in the CLDR spec?

jswalden · 2020-02-07T19:50:27Z

The duplicated-variants restriction and whether it ought apply to tlang is rearing its head in SpiderMonkey patchwork and reviewing at this point. Allowing tlang to contain duplicate variants, while unicode_language_id cannot contain them, is forcing the addition of a enum class DuplicateVariants { Allow, Reject } to our canonicalize-language-id operation, with corresponding complexity to only reject duplicates when DuplicateVariants::Reject is passed. This seems undesirable.

Either duplicates should be allowed in both productions (but canonicalization should remove all but one of each duplicate variant), or they should be allowed in neither. I don't remember why IsStructurallyValidLanguageTag includes a no-duplicate-variants restriction. Revision history on Github doesn't reveal a rationale for the choice.

If the reason for the restriction is sensible and good, I think we ought apply it everywhere. But if it is questionable in any way, being slightly more liberal about allowing harmlessly-duplicate variants (but removing the duplication during canonicalizing) seems like the right approach.

anba · 2020-02-07T23:55:13Z

The duplicate variant restriction may come from BCP 47, §2.2.5, item 5:

The same variant subtag MUST NOT be used more than once within a language tag.

For example, the tag "de-DE-1901-1901" is not valid.

jswalden · 2020-02-08T00:42:52Z

Hmm, okay. That seems pretty clear and direct about invalidity. I can't think of a serious case for not applying that to tlang as well -- anything that actually wanted to interpret transform extensions, well, it's going to have to apply that restriction internally, right?

anba · 2020-02-10T10:07:54Z

BCP 47, § 2.2.9 is probably a better reference point, because it also contains the other restrictions present in IsStructurallyValidLanguageTag.

6.2.2 IsStructurallyValidLanguageTag:

The IsStructurallyValidLanguageTag abstract operation verifies that the locale argument (which must be a String value)

represents a well-formed Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2, or successor,

does not include duplicate variant subtags, and

does not include duplicate singleton subtags.

and BCP 47, § 2.2.9:

A tag is considered "valid" if it satisfies these conditions:

The tag is well-formed.

Either the tag is in the list of grandfathered tags or all of its primary language, extended language, script, region, and variant subtags appear in the IANA Language Subtag Registry as of the particular registry date.

There are no duplicate variant subtags.

There are no duplicate singleton (extension) subtags.

(The second bullet point isn't present in ECMA-402, because it'd require shipping an up-to-date language tag registry.)

jswalden · 2020-02-27T20:02:13Z

We discussed this today and concluded figuring out the duplicate-variant concern does not have to be immediately resolved, and if an ECMA-402 published edition ends up lagging the "living standard" spec, that's okay.

I'll look into creating a PR to additionally forbid duplicate variants in tlang.

sffc · 2023-09-18T23:45:57Z

@ben-allen to evaluate which, if any, of the items in the OP still need to be addressed.

ben-allen · 2024-05-02T13:46:40Z

All of the above appear to be resolved by the following commits:

commit 1e5df59e7b6ee6fe549dec2429dcb71e19b0e368
Author: Leo Balter <[email protected]>
Date:   Thu Mar 14 16:06:55 2019 -0400

    Normative: Apply recommended updates for BCP 47 Locale Identifiers

    Ref #330

and

commit 378ba6f03aa36e2d4fa70c8e087bdb99e6ed1b20
Author: Jeff Walden <[email protected]>
Date:   Wed Feb 17 17:06:22 2021 -0800

    Do not allow duplicate variants within the tlang component of a transformed content extension. (#429)

I believe this one should be closed.

ben-allen · 2024-05-09T21:07:44Z

Closed because all but one bullet point has been addressed in PRs from 2019 and 2021. The remaining bullet point, on sl-rozaj-biske being reordered to sl-biske-rozaj against RFC 5646 rules, was resolved by the removal of RFC 5646 from the normative references. See:

commit 90bd833eda51047ce9b40c73ee753a2a1a08f971 (HEAD)
Author: André Bargull <[email protected]>
Date:   Mon Mar 16 02:28:15 2020 -0700

    Editorial: Replace more BCP 47 language tag with Unicode BCP 47 locale identifier

    Also remove the reference to BCP 47 RFCs in the normative references section.

leobalter mentioned this issue Mar 14, 2019

Editorial: Prefer ssl links over regular http #331

Merged

leobalter added a commit to leobalter/ecma402 that referenced this issue Mar 14, 2019

Normative: Apply recommended updates for BCP 47 Locale Identifiers

e2f7cc7

Ref tc39#330

leobalter mentioned this issue Mar 14, 2019

Normative: Apply recommended updates for BCP 47 Locale Identifiers #333

Merged

littledan pushed a commit that referenced this issue Mar 18, 2019

Normative: Apply recommended updates for BCP 47 Locale Identifiers

1e5df59

Ref #330

sffc added c: locale Component: locale identifiers s: discuss Status: TG2 must discuss to move forward labels Mar 19, 2019

jswalden mentioned this issue Apr 24, 2020

Normative: Do not allow duplicate variants within the tlang component of a transformed_extensions #429

Merged

sffc assigned jswalden Jun 5, 2020

sffc added s: in progress Status: the issue has an active proposal Small Smaller change solvable in a Pull Request and removed s: discuss Status: TG2 must discuss to move forward labels Jun 5, 2020

sffc added this to the ES 2021 milestone Jun 5, 2020

sffc modified the milestones: ES 2021, ES 2022 Mar 22, 2021

sffc modified the milestones: ES 2022, ES 2023 Jun 1, 2022

sffc assigned ben-allen and unassigned jswalden Sep 18, 2023

ben-allen closed this as completed May 9, 2024

nnmrts mentioned this issue Dec 5, 2024

Why is there no Intl.Locale.prototype.variants? #900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various Unicode BCP 47 locale identifiers issues #330

Various Unicode BCP 47 locale identifiers issues #330

anba commented Mar 13, 2019 •

edited

Loading

sffc commented Mar 14, 2019

FrankYFTang commented Mar 14, 2019

leobalter commented Mar 14, 2019

leobalter commented Mar 21, 2019

sffc commented Apr 29, 2019

jswalden commented Feb 7, 2020

anba commented Feb 7, 2020

jswalden commented Feb 8, 2020

anba commented Feb 10, 2020

jswalden commented Feb 27, 2020

sffc commented Sep 18, 2023

ben-allen commented May 2, 2024 •

edited

Loading

ben-allen commented May 9, 2024

Various Unicode BCP 47 locale identifiers issues #330

Various Unicode BCP 47 locale identifiers issues #330

Comments

anba commented Mar 13, 2019 • edited Loading

sffc commented Mar 14, 2019

FrankYFTang commented Mar 14, 2019

leobalter commented Mar 14, 2019

leobalter commented Mar 21, 2019

sffc commented Apr 29, 2019

jswalden commented Feb 7, 2020

anba commented Feb 7, 2020

jswalden commented Feb 8, 2020

anba commented Feb 10, 2020

jswalden commented Feb 27, 2020

sffc commented Sep 18, 2023

ben-allen commented May 2, 2024 • edited Loading

ben-allen commented May 9, 2024

anba commented Mar 13, 2019 •

edited

Loading

ben-allen commented May 2, 2024 •

edited

Loading