Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel test doesn't work on Windows x64 machine #485

Open
steedmicro opened this issue Dec 14, 2023 · 8 comments
Open

Bazel test doesn't work on Windows x64 machine #485

steedmicro opened this issue Dec 14, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@steedmicro
Copy link
Contributor

I've tried Bazel test on Windows x64 machine.
I ran the server with cargo command and it started running correctly.
cargo run --bin cas -- ./nativelink-config/examples/basic_cas.json

But bazel test command fails with this error message.
Is it a known issue for us? I think we didn't add Windows x64 configuration for the local execution build file.

Capture

@steedmicro steedmicro changed the title Bazel test doesn't work on Windows x64 machine. Bazel test doesn't work on Windows x64 machine Dec 14, 2023
@aaronmondal aaronmondal self-assigned this Dec 14, 2023
@aaronmondal
Copy link
Member

Oh I think this is caused by #471 and just got lost in the logs. We're importing rules_cc in the WORKSPACE file now and registering additional toolchains. I'm surprised that this is influencing non---config=lre builds.

@aaronmondal aaronmondal added the bug Something isn't working label Dec 14, 2023
@steedmicro
Copy link
Contributor Author

Hello, @aaronmondal .
Thanks for considering this issue.
I'd like to work on Windows x64 machine from now since none of us are using Windows for development.
Previously, when I worked on Ubuntu, those errors didn't arise and it seems like they are clearly platform-specific error.
I'd like to work on this issue with you. ❤️

@steedmicro
Copy link
Contributor Author

steedmicro commented Dec 15, 2023

Plus, I found out that running rustfmt via bazel wasn't working as well on Windows x64 machine.
I will be working on this issue as well to set up my development environment on Windows x64 machine.

Capture

@aaronmondal
Copy link
Member

aaronmondal commented Dec 15, 2023

I'd like to work on this issue with you. ❤️

I'd love to as well ❤️

Yesterday we ran into the same thing on MacOS which is currently not enabled in CI but was kinda working for ~4 days before #471 broke it 😅

For more context to what I think is happening:

#471 Introduced rules_cc and some autogenerated nix-backed C++ toolchains that require --incompatible_enable_cc_toolchain_resolution to be set. I assumed that this wouldn't affect anything that was not using the --config=lre. However, these lines here seem to "override" the default resolution mechanism even in the default config:

register_execution_platforms(
"@nativelink//local-remote-execution/generated/config:platform",
)
register_toolchains(
"@nativelink//local-remote-execution/generated/config:cc-toolchain",
"@nativelink//local-remote-execution/generated/java:all",
)

On linux this isn't noticeable as the toolchain suite that triggers this is valid on linux:

# This is the entry point for --crosstool_top. Toolchains are found
# by lopping off the name of --crosstool_top and searching for
# the "${CPU}" entry in the toolchains attribute.
cc_toolchain_suite(
name = "toolchain",
toolchains = {
"k8|clang": ":cc-compiler-k8",
"k8": ":cc-compiler-k8",
"armeabi-v7a|compiler": ":cc-compiler-armeabi-v7a",
"armeabi-v7a": ":cc-compiler-armeabi-v7a",
},
)

On non-nix this toolchain doesn't get selected though and so the native-Bazel CI workflows keep working as expected. So far this is intended behavior.

The issue seems to be that the above cc_toolchain_suite is not valid on Windows and MacOS. Usually I'd expect this to not be an issue because I'd assume that Bazel just "skips" this suite entirely. Also, the platform/toolchain that is associated with this toolchain_suite is explicitly marked to only be compatible with x86_64-linux:

package(default_visibility = ["//visibility:public"])
toolchain(
name = "cc-toolchain",
exec_compatible_with = [
"@platforms//os:linux",
"@platforms//cpu:x86_64",
"@bazel_tools//tools/cpp:clang",
],
target_compatible_with = [
"@platforms//os:linux",
"@platforms//cpu:x86_64",
],
toolchain = "//local-remote-execution/generated/cc:cc-compiler-k8",
toolchain_type = "@bazel_tools//tools/cpp:toolchain_type",
)
platform(
name = "platform",
constraint_values = [
"@platforms//os:linux",
"@platforms//cpu:x86_64",
"@bazel_tools//tools/cpp:clang",
],
exec_properties = {
"container-image": "docker://nativelink-toolchain:h8andgczmivnnz4blabfqfya4bdqrnhg",
"OSFamily": "Linux",
},
parents = ["@local_config_platform//:host"],
)


There was an issue in Bazel that I can't find anymore which suggested that this might be fixed in bazel 7.0.0. I think it might be worth investigating that. Otherwise it might work if we register additional toolchains for Windows and MacOS which wrap regular toolchain autodetection.

Also one note for testing: The workspace is read in serial. Moving things like the register_* and rules_cc stuff around does influence how this issue manifests, but so far I haven't found an import order that satisfies all usecases 😆

@steedmicro
Copy link
Contributor Author

steedmicro commented Dec 15, 2023

Thanks for your reply, @aaronmondal.
Just fyi, I did some testing on my machine after I rollback the Local Remote Execution changes and tried to see what happens.
This time, bazel test didn't raise error but while execution, it failed with following messages.
I'd like to know if we've done any Windows testing before. Thanks.

Capture
Capture

@aaronmondal
Copy link
Member

@steed924 Yes, the previous runs on main for the native Windows Bazel build worked are here: https://github.com/TraceMachina/nativelink/actions/workflows/native-bazel.yaml?query=branch%3Amain

However, we did not run integration tests on Windows. Here's what's happening:

  • Bazel rules internally construct actions which are more or less shell commands that create a sandbox with some environment variables and an action command.
  • When you use remote execution you construct the "action string" locally and then Nativelink forwards that to some remote worker.
  • In your case that remote worker is your local machine.
  • The command failing means that the build environment is most likely not hermetic, or at least that the local environment and the remote execution environment don't perfectly align. Locally you have the files available, and so the action string contains these strings to local paths. But remotely these files don't exist and so the fails when invoked in this sandbox.
  • This issue was kind of hard to see because the current integration tests (on linux) brute-force mount local directories into the remote environment. That is, the tests don't actually create a "true" remote execution environment but a hybrid that has visibility into the local filesystem.
  • Introduce Local Remote Execution #471 and Add Kubernetes example #479 address this by creating truly encapsulated remote execution environments where the remote execution container is completely airgapped from the local environment. This also exposed some issues in rules_rust (Local remote toolchains don't work with rules_rust #477) and building nlink in an airgapped remote execution setup is currently not possible. The LRE setup only works for C++.

The only environment that accurately constructs a remote execution setup is the K8s example. That doesn't work on Windows and we can't make that exact setup compatible with Windows because Nix only works in WSL2 but not on native Windows.

However, what we can do is distribute a C++ toolchain for windows as part of the Bazel build which overrides any local compilers. This way we get this "airgapped" setup on windows as well. When we ship the toolchain we're no longer susceptible to differing host toolchain configurations and I think that's the way to fix this issue.

FYI the workaround which is used in the (linux) production deployments at the moment is to run nativelink in container images that "just happen" to have all relevant dependencies at the right places. For instance by using a FROM ubuntu with an apt install clang image you'd get "some" compiler at /usr/bin/clang that just happens to work. That compiler isn't binary-identical to the one on your hypothetical local Ubuntu machine, but the paths are the same and so things still work.

Note that this is actually quite bad because it means that even in the cases that currently work the same action command will create different artifacts on your local machine and a remote machine. In terms of reproducibility/hermeticity I'd consider this a "critical silent bug" for all existing deployments except the LRE setup.

@aaronmondal
Copy link
Member

Oh and actually regarding your initial question, remote execution on Windows was not tested and I don't think it ever worked 😅 We'll have to implement a windows toolchain for remote execution. From some initial googling I couldn't find any existing remote-exec compatible Windows toolchains.

@p00f
Copy link

p00f commented Aug 24, 2024

@aaronmondal
Hi, I have a similar error (note that it fails to find the params file):

 Error { code: NotFound, messages: ["The system
cannot find the path specified. (os error 3)", "Could not execute command [\"external/MINGW_TOOLCHAIN/bin/g++.exe\", \"@bazel-out/windows_platform-fastbuild/bin/xxxxx/_objs/TARGET_NAME/OBJECT_NAME.o.params\"]"] }

The command failing means that the build environment is most likely not hermetic, or at least that the local environment and the remote execution environment don't perfectly align. Locally you have the files available, and so the action string contains these strings to local paths. But remotely these files don't exist and so the fails when invoked in this sandbox.

I don't think so, at least in my case - there is such a file in the directory specified in the config file's workers.local.work_directory/SOME_HASH/work, except without the @ in the beginning.

Also, the hash in workers.local.work_directory/SOME_HASH/work changes periodically, is that causing the error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants