Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

Open
wants to merge 1 commit into
base: release-8.2.0
Choose a base branch
from

Conversation

bazel-io
Copy link
Member

@bazel-io bazel-io commented Mar 4, 2025

Bazel internally uses String as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of String as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character":

  • {l,r,}strip must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character.
  • json.decode has to emit UTF-8 bytes rather than UTF-16 characters.

To avoid affecting other users of the Starlark interpreter, a new StarlarkSemantics.INTERNAL_BAZEL_ONLY_UTF_8_BYTE_STRINGS setting, defaulting to false but overridden to true for Bazel, determines which of the two behaviors to adopt.

Compatibility is verified by running all script-based tests under both values of the setting.

Closes #24417.

PiperOrigin-RevId: 733329982
Change-Id: I3e1605be28a844ab52a3239a2f753a29d4eb217a

Commit fcd3d19

Bazel internally uses `String` as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of `String` as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character":

* `{l,r,}strip` must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character.
* `json.decode` has to emit UTF-8 bytes rather than UTF-16 characters.

To avoid affecting other users of the Starlark interpreter, a new `StarlarkSemantics.INTERNAL_BAZEL_ONLY_UTF_8_BYTE_STRINGS` setting, defaulting to false but overridden to true for Bazel, determines which of the two behaviors to adopt.

Compatibility is verified by running all script-based tests under both values of the setting.

Closes bazelbuild#24417.

PiperOrigin-RevId: 733329982
Change-Id: I3e1605be28a844ab52a3239a2f753a29d4eb217a
@bazel-io bazel-io requested a review from a team as a code owner March 4, 2025 16:04
@bazel-io bazel-io added team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols awaiting-review PR is awaiting review from an assigned reviewer labels Mar 4, 2025
@bazel-io bazel-io requested a review from tjgq March 4, 2025 16:04
@iancha1992 iancha1992 enabled auto-merge March 4, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants