[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

bazel-io · 2025-03-04T16:04:00Z

Bazel internally uses String as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of String as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character":

{l,r,}strip must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character.
json.decode has to emit UTF-8 bytes rather than UTF-16 characters.

To avoid affecting other users of the Starlark interpreter, a new StarlarkSemantics.INTERNAL_BAZEL_ONLY_UTF_8_BYTE_STRINGS setting, defaulting to false but overridden to true for Bazel, determines which of the two behaviors to adopt.

Compatibility is verified by running all script-based tests under both values of the setting.

Closes #24417.

PiperOrigin-RevId: 733329982
Change-Id: I3e1605be28a844ab52a3239a2f753a29d4eb217a

Commit fcd3d19

Bazel internally uses `String` as a container for raw bytes assumed to be UTF-8, which differs from ordinary usage of `String` as a container for UTF-16 characters. This requires special implementations of certain Starlark functions that care about the notion of a "character": * `{l,r,}strip` must not strip non-ASCII whitespace as it may be part of a UTF-8-encoded non-whitespace character. * `json.decode` has to emit UTF-8 bytes rather than UTF-16 characters. To avoid affecting other users of the Starlark interpreter, a new `StarlarkSemantics.INTERNAL_BAZEL_ONLY_UTF_8_BYTE_STRINGS` setting, defaulting to false but overridden to true for Bazel, determines which of the two behaviors to adopt. Compatibility is verified by running all script-based tests under both values of the setting. Closes bazelbuild#24417. PiperOrigin-RevId: 733329982 Change-Id: I3e1605be28a844ab52a3239a2f753a29d4eb217a

bazel-io requested a review from a team as a code owner March 4, 2025 16:04

bazel-io added team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols awaiting-review PR is awaiting review from an assigned reviewer labels Mar 4, 2025

bazel-io requested a review from tjgq March 4, 2025 16:04

bazel-io mentioned this pull request Mar 4, 2025

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25357

Open

iancha1992 enabled auto-merge March 4, 2025 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

bazel-io commented Mar 4, 2025

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

Are you sure you want to change the base?

[8.2.0] Fix Unicode encoding issues in Bazel's use of Starlark #25451

Conversation

bazel-io commented Mar 4, 2025