Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various Unicode BCP 47 locale identifiers issues #330

Closed
anba opened this issue Mar 13, 2019 · 13 comments
Closed

Various Unicode BCP 47 locale identifiers issues #330

anba opened this issue Mar 13, 2019 · 13 comments
Assignees
Labels
c: locale Component: locale identifiers s: in progress Status: the issue has an active proposal Small Smaller change solvable in a Pull Request
Milestone

Comments

@anba
Copy link
Contributor

anba commented Mar 13, 2019

6.2 Language Tags

  • The link to UTS 35 should use https instead of http.
  • "[...] identifies locales using language tags as defined by IETF BCP 47 (RFCs 5646 and 4647 or their successors)" should be changed to refer to Unicode BCP 47 locale identifiers exclusively.
  • "Unicode BCP 47 Locale Identifiers that meet those validity criteria of Unicode Technical Standard 35, section 3.2 [....]" needs to be reworded, because "validity" can now be misunderstood to mean "validity" as specified in UTS 35 (cf. the "Validity / Comments" column in UTS 35).
    • IIUC "structurally valid" in ECMA-402 maps to "syntactically well-formed" in UTS 35.
  • "[...] without reference to the IANA Language Subtag Registry" may no longer be needed resp. it should be said, that ECMA-402 considers those languages tags as valid which match the syntax of Unicode BCP 47 locale identifiers, but that it is not required to validate them according to the Unicode validation data. (For example "aaj" is a valid language tag in ECMA-402 even though "aaj" is not included in https://unicode.org/repos/cldr/tags/latest/common/validity/language.xml.)

6.2.1 Unicode Locale Extension Sequences


6.2.2 IsStructurallyValidLanguageTag


6.2.3 CanonicalizeLanguageTag

  • The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them? (Note: rev34 also has "canonical form", which is only a subset of "canonical syntax" from rev35.)
    • "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" in rev53 and rev54 does not replace variant subtags, which were replaced before the switch to Unicode BCP 47 locale ids. For example IETF BCP 47 language tags canonicalises "hy-arevmda" to "hyw", whereas "BCP 47 Language Tag to Unicode BCP 47 Locale Identifier" doesn't touch any variant subtags. Canonical Unicode locale identifiers in rev35 will support this canonicalisation. (But "ja-Latn-hepburn-heploc" is still not canonicalised to "ja-Latn-alalc97", instead "ja-Latn-hepburn-alalc97" is used. Not sure if this a bug or unsupported canonicalisation mode in CLDR?)
    • "canonical syntax" reorders variant subtags in alphabetical order, which is not allowed per RFC 5646. For example "sl-rozaj-biske" is reordered to "sl-biske-rozaj" in UTS 35, but this actually invalidates the language tag per IANA, because the required prefix for "biske" is "sl-rozaj".
    • Unfortunately the "canonical form" in UTS 35 rev54 also adds many more canonicalisation requirements and I'm not sure these make sense for ECMA-402 (at least for the moment).
  • Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?
@sffc
Copy link
Contributor

sffc commented Mar 14, 2019

@FrankYFTang

@FrankYFTang
Copy link
Contributor

@anba As in a big picture, all the issues you mentioned seems reasonable to me. I suggest you create a PR based on what you stated above and we can review the wording of the changes together.

@leobalter
Copy link
Member

  • It should be clarified whether or not "does not include duplicate variant subtags" also applies to variant subtags in tlang. For example is "en-t-en-emodeng-emodeng" valid or not?
  • The next revision of UTS 35 adds definitions for "canonical syntax" and "canonical form" of Unicode locale identifiers. Does it make sense to switch to using them?
  • Do we require to normalise the case in the tlang extension? For example should "en-t-en-us" be case regularised to "en-t-en-US"?

While I can provide changes for most of other parts, these are questions we should evolve within a discussion next Thursday. I don't have any immediate answer for these, at least.

I'll have a PR with the other parts as I already did with the https parts (see #331)

leobalter added a commit to leobalter/ecma402 that referenced this issue Mar 14, 2019
@sffc sffc added c: locale Component: locale identifiers s: discuss Status: TG2 must discuss to move forward labels Mar 19, 2019
@leobalter
Copy link
Member

cc @FrankYFTang @zbraniecki to follow up and reference the public spec once it's published

@sffc
Copy link
Contributor

sffc commented Apr 29, 2019

@FrankYFTang Have you followed up with Mark about this in the CLDR spec?

@jswalden
Copy link
Collaborator

jswalden commented Feb 7, 2020

The duplicated-variants restriction and whether it ought apply to tlang is rearing its head in SpiderMonkey patchwork and reviewing at this point. Allowing tlang to contain duplicate variants, while unicode_language_id cannot contain them, is forcing the addition of a enum class DuplicateVariants { Allow, Reject } to our canonicalize-language-id operation, with corresponding complexity to only reject duplicates when DuplicateVariants::Reject is passed. This seems undesirable.

Either duplicates should be allowed in both productions (but canonicalization should remove all but one of each duplicate variant), or they should be allowed in neither. I don't remember why IsStructurallyValidLanguageTag includes a no-duplicate-variants restriction. Revision history on Github doesn't reveal a rationale for the choice.

If the reason for the restriction is sensible and good, I think we ought apply it everywhere. But if it is questionable in any way, being slightly more liberal about allowing harmlessly-duplicate variants (but removing the duplication during canonicalizing) seems like the right approach.

@anba
Copy link
Contributor Author

anba commented Feb 7, 2020

The duplicate variant restriction may come from BCP 47, §2.2.5, item 5:

The same variant subtag MUST NOT be used more than once within a language tag.

  • For example, the tag "de-DE-1901-1901" is not valid.

@jswalden
Copy link
Collaborator

jswalden commented Feb 8, 2020

Hmm, okay. That seems pretty clear and direct about invalidity. I can't think of a serious case for not applying that to tlang as well -- anything that actually wanted to interpret transform extensions, well, it's going to have to apply that restriction internally, right?

@anba
Copy link
Contributor Author

anba commented Feb 10, 2020

BCP 47, § 2.2.9 is probably a better reference point, because it also contains the other restrictions present in IsStructurallyValidLanguageTag.

6.2.2 IsStructurallyValidLanguageTag:

The IsStructurallyValidLanguageTag abstract operation verifies that the locale argument (which must be a String value)

  • represents a well-formed Unicode BCP 47 Locale Identifier" as specified in Unicode Technical Standard 35 section 3.2, or successor,
  • does not include duplicate variant subtags, and
  • does not include duplicate singleton subtags.

and BCP 47, § 2.2.9:

A tag is considered "valid" if it satisfies these conditions:

  • The tag is well-formed.
  • Either the tag is in the list of grandfathered tags or all of its primary language, extended language, script, region, and variant subtags appear in the IANA Language Subtag Registry as of the particular registry date.
  • There are no duplicate variant subtags.
  • There are no duplicate singleton (extension) subtags.

(The second bullet point isn't present in ECMA-402, because it'd require shipping an up-to-date language tag registry.)

@jswalden
Copy link
Collaborator

We discussed this today and concluded figuring out the duplicate-variant concern does not have to be immediately resolved, and if an ECMA-402 published edition ends up lagging the "living standard" spec, that's okay.

I'll look into creating a PR to additionally forbid duplicate variants in tlang.

@sffc sffc added s: in progress Status: the issue has an active proposal Small Smaller change solvable in a Pull Request and removed s: discuss Status: TG2 must discuss to move forward labels Jun 5, 2020
@sffc sffc added this to the ES 2021 milestone Jun 5, 2020
@sffc sffc modified the milestones: ES 2021, ES 2022 Mar 22, 2021
@sffc sffc modified the milestones: ES 2022, ES 2023 Jun 1, 2022
@sffc sffc assigned ben-allen and unassigned jswalden Sep 18, 2023
@sffc
Copy link
Contributor

sffc commented Sep 18, 2023

@ben-allen to evaluate which, if any, of the items in the OP still need to be addressed.

@ben-allen
Copy link
Contributor

ben-allen commented May 2, 2024

All of the above appear to be resolved by the following commits:

commit 1e5df59e7b6ee6fe549dec2429dcb71e19b0e368
Author: Leo Balter <[email protected]>
Date:   Thu Mar 14 16:06:55 2019 -0400

    Normative: Apply recommended updates for BCP 47 Locale Identifiers

    Ref #330

and

commit 378ba6f03aa36e2d4fa70c8e087bdb99e6ed1b20
Author: Jeff Walden <[email protected]>
Date:   Wed Feb 17 17:06:22 2021 -0800

    Do not allow duplicate variants within the tlang component of a transformed content extension. (#429)

I believe this one should be closed.

@ben-allen
Copy link
Contributor

Closed because all but one bullet point has been addressed in PRs from 2019 and 2021. The remaining bullet point, on sl-rozaj-biske being reordered to sl-biske-rozaj against RFC 5646 rules, was resolved by the removal of RFC 5646 from the normative references. See:

commit 90bd833eda51047ce9b40c73ee753a2a1a08f971 (HEAD)
Author: André Bargull <[email protected]>
Date:   Mon Mar 16 02:28:15 2020 -0700

    Editorial: Replace more BCP 47 language tag with Unicode BCP 47 locale identifier

    Also remove the reference to BCP 47 RFCs in the normative references section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Component: locale identifiers s: in progress Status: the issue has an active proposal Small Smaller change solvable in a Pull Request
Projects
None yet
Development

No branches or pull requests

6 participants