Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiled dag #17

Closed
wants to merge 91 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
3044247
ip
Nov 16, 2023
8c5efd8
basic working.
Nov 17, 2023
664b07a
enhancement
Nov 17, 2023
8f6f8d2
working now.
Nov 17, 2023
14c3a44
Merge remote-tracking branch 'sang/dag-api' into compiled-dag
stephanie-wang Dec 2, 2023
888950a
Merge remote-tracking branch 'upstream/master' into compiled-dag
stephanie-wang Dec 2, 2023
bdfbb8a
tmp
stephanie-wang Dec 2, 2023
b6a66f2
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 2, 2023
b6150a3
scatter-gather DAG works
stephanie-wang Dec 5, 2023
5336262
fix
stephanie-wang Dec 5, 2023
e88c40f
fix
stephanie-wang Dec 6, 2023
95e871b
compile?
stephanie-wang Dec 6, 2023
13b1d53
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 6, 2023
e54972b
unit test
stephanie-wang Dec 6, 2023
5fbfac5
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 6, 2023
9396810
tmp
stephanie-wang Dec 6, 2023
950bbb4
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 7, 2023
494cb53
Revert "tmp"
stephanie-wang Dec 7, 2023
05b002f
cleanup
stephanie-wang Dec 7, 2023
521c73b
Support no-OutputNode DAGs
stephanie-wang Dec 7, 2023
5b58250
Support non-DAG args
stephanie-wang Dec 7, 2023
b5beca4
errors
stephanie-wang Dec 7, 2023
cc2e795
lint
stephanie-wang Dec 7, 2023
c17c367
doc
stephanie-wang Dec 7, 2023
b1f3f34
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 8, 2023
563f7d8
[docs] [data] remove Data Processing tag from sidenav and examples (#…
angelinalg Dec 8, 2023
7001982
[RLlib] Add and enhance fault-tolerance tests for APPO. (#40743)
sven1977 Dec 8, 2023
aa86ef6
[docs][serve] Fix subsection title hierarchy (#41746)
angelinalg Dec 8, 2023
2487553
[Serve] allow gRPC deployment to use grpc context (#41667)
GeneDer Dec 8, 2023
f783234
[serve] Update replica ID format (and other random strings) to better…
edoakes Dec 8, 2023
d6d2689
[Serve] Update system_logging_config to logging_config (#41749)
sihanwang41 Dec 8, 2023
567e574
[Serve] Patching ActorProxyWrapper to properly handle `is_drained` RP…
alexeykudinkin Dec 8, 2023
208e452
[core] retryable exceptions for method (#41194)
rynewang Dec 8, 2023
1dffb4d
[Core] Fix core worker client pool leak (#41535)
jjyao Dec 9, 2023
2f29500
[Core] Fix check failure when sync and async tasks are mixed up (#41724)
rkooo567 Dec 9, 2023
ab36f94
[Core] Speed up single_client_tasks_sync (#41743)
jjyao Dec 9, 2023
6b9934e
[core][task-backend] Add unit testing for gcs task manager on job fin…
rickyyx Dec 9, 2023
284617d
[ci][wins/3] add --operating-system flag to rayci (#41636)
can-anyscale Dec 9, 2023
7dde158
cleanup
stephanie-wang Dec 9, 2023
63cc16d
cleanup
stephanie-wang Dec 9, 2023
dca1239
perf
stephanie-wang Dec 9, 2023
bacb0a8
[Doc] Fix the linkcheck CI job (#41598)
peytondmurray Dec 9, 2023
956317a
[ci] turn off verbose mode when zipping logs (#41760)
can-anyscale Dec 9, 2023
84ba971
[ci] always update apt in dashboard tests (#41754)
can-anyscale Dec 9, 2023
b71e1c3
[CI] Clean the dev env after build wheels for macOS (#41694)
jiwq Dec 9, 2023
2307fd1
[RLlib] Fix `policy_to_train` logic for new API stack generically for…
sven1977 Dec 9, 2023
bfa35fd
[Core][CI] Mark test_runtime_env_working_dir_3 flaky
rkooo567 Dec 9, 2023
cb5bb4e
[core] Support mutable plasma objects (#41515)
stephanie-wang Dec 9, 2023
7b8472b
add normal DAG
stephanie-wang Dec 9, 2023
d94c485
Merge remote-tracking branch 'upstream/master' into compiled-dag
stephanie-wang Dec 9, 2023
740169b
x
stephanie-wang Dec 9, 2023
603d1a1
[core] Add an internal system concurrency group for executing compil…
ericl Dec 10, 2023
5edabc7
[lint] format build files (#41766)
aslonnie Dec 10, 2023
23972da
[core] Small redis context cleanup (#41776)
vitsai Dec 10, 2023
6fe398f
[lint] build file format for src dir (#41767)
aslonnie Dec 10, 2023
d7926fa
Revert "[core] Support mutable plasma objects (#41515)" (#41784)
jjyao Dec 11, 2023
23d3bfc
[ci] unbreak master (#41793)
can-anyscale Dec 11, 2023
3d5ffca
Merge branch 'mutable-objects-2' into compiled-dag
stephanie-wang Dec 11, 2023
62fbf48
[Data] Fix unused `columns` parameter in `read_parquet_bulk` (#41806)
bveeramani Dec 11, 2023
171b828
[core] Refactor Redis code (#41771)
vitsai Dec 12, 2023
5958771
[data] Optimize OpState.outqueue_num_blocks (#41748)
raulchen Dec 12, 2023
f7a2b33
[serve] Remove `controller_name` across all components (#41733)
edoakes Dec 12, 2023
1716b79
Remove unnecessary test file in wheel (#41775)
fishbone Dec 12, 2023
b7ed648
[ci] upload build data to go/flaky (#41796)
can-anyscale Dec 12, 2023
ee10ea6
[Data] Add example of how to read and write custom file types (#41785)
bveeramani Dec 12, 2023
2839644
[Core] Improve DAG API to support tensor parallel DAG (#41231)
rkooo567 Dec 12, 2023
1a090a0
Re-merge mutable objects (#41515) (#41789)
stephanie-wang Dec 12, 2023
8dde781
Merge remote-tracking branch 'upstream/master' into compiled-dag
stephanie-wang Dec 12, 2023
905a5bc
merge
stephanie-wang Dec 12, 2023
4436b1f
revert
stephanie-wang Dec 12, 2023
f105ed5
revert
stephanie-wang Dec 12, 2023
00f3f1c
x
stephanie-wang Dec 12, 2023
257457d
buffer size bytes
stephanie-wang Dec 12, 2023
e8273ea
[docs][serve][tune] Remove serve/tutorial/rllib and tune-sklearn exam…
angelinalg Dec 12, 2023
75b12f0
[lint] buildifier format train build files (#41833)
aslonnie Dec 12, 2023
7ea9912
[lint] buildifier format tune docs build files (#41834)
aslonnie Dec 12, 2023
907985a
remove duplicate of huggingface_vit_batch_prediction (#41738)
angelinalg Dec 12, 2023
fb521a9
[serve] Fix bug in deployment state machine (#41799)
zcin Dec 12, 2023
bf67cb1
[doc][train] Clarify error message when trying to use local storage f…
justinvyu Dec 12, 2023
6b65b56
[doc][train] Clarify `prepare_data_loader` shuffle behavior and inclu…
justinvyu Dec 12, 2023
b3b79d2
[Serve] Add gRPC context related docs (#41783)
GeneDer Dec 13, 2023
71c32ae
optional
stephanie-wang Dec 13, 2023
2ba93f0
x
stephanie-wang Dec 13, 2023
3a4b2f4
Merge remote-tracking branch 'upstream/master' into compiled-dag
stephanie-wang Dec 13, 2023
0b0431c
fix (#41854)
stephanie-wang Dec 13, 2023
973ba68
Merge remote-tracking branch 'upstream/master' into compiled-dag
stephanie-wang Dec 13, 2023
ac5fa55
x
stephanie-wang Dec 13, 2023
35a37fd
lint?
stephanie-wang Dec 13, 2023
1326331
test
stephanie-wang Dec 13, 2023
fadec07
API
stephanie-wang Dec 13, 2023
ff19557
x
stephanie-wang Dec 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions .buildkite/build.rayci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,7 @@ steps:
- label: ":tapioca: build: doc"
instance_type: medium
commands:
- cd doc
- FAST=True make html
- FAST=True make -C doc/ html
depends_on: docbuild
job_env: docbuild

Expand All @@ -90,7 +89,7 @@ steps:
instance_type: medium
commands:
- bazel run //ci/ray_ci:build_in_docker -- docker --python-version {{matrix}}
--platform cu11.5.2 --platform cu11.6.2 --platform cu11.7.1
--platform cu11.5.2 --platform cu11.6.2 --platform cu11.7.1
--platform cu11.8.0 --platform cu12.1.1 --platform cpu
--image-type ray
--upload
Expand All @@ -114,7 +113,7 @@ steps:
instance_type: medium-arm64
commands:
- bazel run //ci/ray_ci:build_in_docker -- docker --python-version {{matrix}}
--platform cu11.5.2 --platform cu11.6.2 --platform cu11.7.1
--platform cu11.5.2 --platform cu11.6.2 --platform cu11.7.1
--platform cu11.8.0 --platform cu12.1.1 --platform cpu
--image-type ray
--architecture aarch64
Expand Down
2 changes: 2 additions & 0 deletions .buildkite/core.rayci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,10 @@ steps:
tags:
- python
- oss
- skip-on-premerge
instance_type: medium
commands:
- cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/build/upload_build_info.sh; fi }; trap cleanup EXIT
- (cd dashboard/client && npm ci && npm run build)
- pip install -e python[client]
- bazel test --config=ci --jobs=1 $(./ci/run/bazel_export_options)
Expand Down
35 changes: 33 additions & 2 deletions .buildkite/lint.rayci.yml
Original file line number Diff line number Diff line change
@@ -1,63 +1,94 @@
group: lint
depends_on:
- forge
steps:
- label: ":lint-roller: lint: clang format"
depends_on:
- forge
commands:
- pip install -c python/requirements_compiled.txt clang-format
- ./ci/lint/check-git-clang-format-output.sh

- label: ":lint-roller: lint: code format"
depends_on:
- forge
commands:
- pip install -c python/requirements_compiled.txt -r python/requirements/lint-requirements.txt
- FORMAT_SH_PRINT_DIFF=1 ./ci/lint/format.sh --all-scripts

- label: ":lint-roller: lint: untested code snippet"
depends_on:
- forge
commands:
- pip install -c python/requirements_compiled.txt semgrep
- semgrep ci --config semgrep.yml

- label: ":lint-roller: lint: banned words"
depends_on:
- forge
commands:
- ./ci/lint/check-banned-words.sh

- label: ":lint-roller: lint: doc readme"
depends_on:
- forge
commands:
- pip install -c python/requirements_compiled.txt docutils
- cd python && python setup.py check --restructuredtext --strict --metadata

- label: ":lint-roller: lint: dashboard format"
depends_on:
- forge
commands:
- ./ci/lint/check-dashboard-format.sh

- label: ":lint-roller: lint: copyright format"
depends_on:
- forge
commands:
- ./ci/lint/copyright-format.sh -c

- label: ":lint-roller: lint: bazel team"
depends_on:
- forge
commands:
- bazel query 'kind("cc_test", //...)' --output=xml | python ./ci/lint/check-bazel-team-owner.py
- bazel query 'kind("py_test", //...)' --output=xml | python ./ci/lint/check-bazel-team-owner.py

- label: ":lint-roller: lint: bazel buildifier"
depends_on:
- forge
commands:
- ./ci/lint/check-bazel-buildifier.sh

- label: ":lint-roller: lint: pytest format"
depends_on:
- forge
commands:
- pip install -c python/requirements_compiled.txt yq
- ./ci/lint/check-pytest-format.sh

- label: ":lint-roller: lint: test coverage"
depends_on:
- forge
commands:
- python ci/pipeline/check-test-run.py

- label: ":lint-roller: lint: api annotations"
depends_on:
- forge
instance_type: medium
commands:
- RAY_DISABLE_EXTRA_CPP=1 pip install -e python/[all]
- ./ci/lint/check_api_annotations.py

- label: ":lint-roller: lint: documentation style"
depends_on:
- forge
commands:
- ./ci/lint/check-documentation-style.sh

- label: ":lint-roller: lint: linkcheck"
instance_type: medium
command: make -C doc/ linkcheck
depends_on: docbuild
job_env: docbuild
tags: skip-on-premerge
2 changes: 1 addition & 1 deletion .buildkite/pipeline.macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ prelude_commands: &prelude_commands |-
epilogue_commands: &epilogue_commands |-
# Persist ray logs
mkdir -p /tmp/artifacts/.ray/
tar -cvzf /tmp/artifacts/.ray/logs.tgz /tmp/ray
tar -czf /tmp/artifacts/.ray/logs.tgz /tmp/ray
# Cleanup runtime environment to save storage
rm -rf /tmp/ray || true
# Cleanup local caches - this should not clean up global disk cache
Expand Down
7 changes: 0 additions & 7 deletions .buildkite/pipeline.test.yml

This file was deleted.

13 changes: 13 additions & 0 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -1537,6 +1537,19 @@ ray_cc_test(
],
)

ray_cc_test(
name = "core_worker_client_pool_test",
size = "small",
srcs = [
"src/ray/rpc/worker/test/core_worker_client_pool_test.cc",
],
tags = ["team:core"],
deps = [
":worker_rpc",
"@com_google_googletest//:gtest_main",
],
)

ray_cc_test(
name = "gcs_server_rpc_test",
size = "small",
Expand Down
1 change: 0 additions & 1 deletion ci/ray_ci/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ py_test(
],
)


py_test(
name = "test_utils",
size = "small",
Expand Down
1 change: 1 addition & 0 deletions ci/ray_ci/rllib.tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ flaky_tests:
- //rllib:learning_tests_two_step_game_qmix
- //rllib:learning_tests_pendulum_cql
- //rllib:learning_tests_two_step_game_qmix_no_mixer
- //rllib:learning_tests_cartpole_crashing_and_stalling_appo
# algorithm, model and other tests
- //rllib:test_bc
- //rllib:test_a3c
Expand Down
24 changes: 23 additions & 1 deletion ci/ray_ci/test_tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import pytest

from ci.ray_ci.linux_tester_container import LinuxTesterContainer
from ci.ray_ci.windows_tester_container import WindowsTesterContainer
from ci.ray_ci.tester import (
_add_default_except_tags,
_get_container,
Expand Down Expand Up @@ -34,12 +35,33 @@ def test_get_container() -> None:
with mock.patch(
"ci.ray_ci.linux_tester_container.LinuxTesterContainer.install_ray",
return_value=None,
), mock.patch(
"ci.ray_ci.windows_tester_container.WindowsTesterContainer.install_ray",
return_value=None,
):
container = _get_container("core", 3, 1, 2, 0)
container = _get_container(
team="core",
operating_system="linux",
workers=3,
worker_id=1,
parallelism_per_worker=2,
gpus=0,
)
assert isinstance(container, LinuxTesterContainer)
assert container.docker_tag == "corebuild"
assert container.shard_count == 6
assert container.shard_ids == [2, 3]

container = _get_container(
team="serve",
operating_system="windows",
workers=3,
worker_id=1,
parallelism_per_worker=2,
gpus=0,
)
assert isinstance(container, WindowsTesterContainer)


def test_get_test_targets() -> None:
_TEST_YAML = "flaky_tests: [//python/ray/tests:flaky_test_01]"
Expand Down
45 changes: 35 additions & 10 deletions ci/ray_ci/tester.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
DEFAULT_ARCHITECTURE,
)
from ci.ray_ci.linux_tester_container import LinuxTesterContainer
from ci.ray_ci.windows_tester_container import WindowsTesterContainer
from ci.ray_ci.tester_container import TesterContainer
from ci.ray_ci.utils import docker_login

Expand Down Expand Up @@ -135,12 +136,19 @@
),
default="optimized",
)
@click.option(
"--operating-system",
default="linux",
type=click.Choice(["linux", "windows"]),
help=("Operating system to run tests on"),
)
def main(
targets: List[str],
team: str,
workers: int,
worker_id: int,
parallelism_per_worker: int,
operating_system: str,
except_tags: str,
only_tags: str,
run_flaky_tests: bool,
Expand All @@ -155,14 +163,18 @@ def main(
if not bazel_workspace_dir:
raise Exception("Please use `bazelisk run //ci/ray_ci`")
os.chdir(bazel_workspace_dir)
docker_login(_DOCKER_ECR_REPO.split("/")[0])
# TODO(can): only linux uses intermediate dockers publised to ECR; remove this when
# we can build intermediate dockers for Windows
if operating_system == "linux":
docker_login(_DOCKER_ECR_REPO.split("/")[0])

if build_type == "wheel" or build_type == "wheel-aarch64":
# for wheel testing, we first build the wheel and then use it for running tests
architecture = DEFAULT_ARCHITECTURE if build_type == "wheel" else "aarch64"
BuilderContainer(DEFAULT_PYTHON_VERSION, DEFAULT_BUILD_TYPE, architecture).run()
container = _get_container(
team,
operating_system,
workers,
worker_id,
parallelism_per_worker,
Expand Down Expand Up @@ -195,6 +207,7 @@ def _add_default_except_tags(except_tags: str) -> str:

def _get_container(
team: str,
operating_system: str,
workers: int,
worker_id: int,
parallelism_per_worker: int,
Expand All @@ -208,15 +221,27 @@ def _get_container(
shard_start = worker_id * parallelism_per_worker
shard_end = (worker_id + 1) * parallelism_per_worker

return LinuxTesterContainer(
build_name or f"{team}build",
test_envs=test_env,
shard_count=shard_count,
shard_ids=list(range(shard_start, shard_end)),
gpus=gpus,
skip_ray_installation=skip_ray_installation,
build_type=build_type,
)
if operating_system == "linux":
return LinuxTesterContainer(
build_name or f"{team}build",
test_envs=test_env,
shard_count=shard_count,
shard_ids=list(range(shard_start, shard_end)),
gpus=gpus,
skip_ray_installation=skip_ray_installation,
build_type=build_type,
)

if operating_system == "windows":
return WindowsTesterContainer(
build_name or f"{team}build",
test_envs=test_env,
shard_count=shard_count,
shard_ids=list(range(shard_start, shard_end)),
skip_ray_installation=skip_ray_installation,
)

assert False, f"Unsupported operating system: {operating_system}"


def _get_tag_matcher(tag: str) -> str:
Expand Down
9 changes: 7 additions & 2 deletions cpp/src/ray/runtime/task/local_mode_task_submitter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,13 @@ ObjectID LocalModeTaskSubmitter::Submit(InvocationSpec &invocation,
TaskID::ForActorCreationTask(invocation.actor_id);
const ObjectID actor_creation_dummy_object_id =
ObjectID::FromIndex(actor_creation_task_id, 1);
builder.SetActorTaskSpec(
invocation.actor_id, actor_creation_dummy_object_id, invocation.actor_counter);
// NOTE: Ray CPP doesn't support retries and retry_exceptions.
builder.SetActorTaskSpec(invocation.actor_id,
actor_creation_dummy_object_id,
/*max_retries=*/0,
/*retry_exceptions=*/false,
/*serialized_retry_exception_allowlist=*/"",
invocation.actor_counter);
} else {
throw RayException("unknown task type");
}
Expand Down
7 changes: 7 additions & 0 deletions cpp/src/ray/runtime/task/native_task_submitter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,17 @@ ObjectID NativeTaskSubmitter::Submit(InvocationSpec &invocation,
options.generator_backpressure_num_objects = -1;
std::vector<rpc::ObjectReference> return_refs;
if (invocation.task_type == TaskType::ACTOR_TASK) {
// NOTE: Ray CPP doesn't support per-method max_retries and retry_exceptions
const auto native_actor_handle = core_worker.GetActorHandle(invocation.actor_id);
int max_retries = native_actor_handle->MaxTaskRetries();

auto status = core_worker.SubmitActorTask(invocation.actor_id,
BuildRayFunction(invocation),
invocation.args,
options,
max_retries,
/*retry_exceptions=*/false,
/*serialized_retry_exception_allowlist=*/"",
return_refs);
if (!status.ok()) {
return ObjectID::Nil();
Expand Down
4 changes: 3 additions & 1 deletion dashboard/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ py_test_run_all_subdirectory(
include = ["**/test*.py"],
exclude = [
"client/node_modules/**",
"modules/test/**",
"modules/job/tests/test_cli_integration.py",
"modules/job/tests/test_http_job_server.py",
"modules/node/tests/test_node.py",
Expand Down Expand Up @@ -87,18 +86,21 @@ py_test(
size = "small",
srcs = ["tests/test_state_head.py"],
tags = ["team:core"],
deps = [":conftest"],
)

py_test(
name = "test_serve_dashboard",
size = "large",
srcs = ["modules/serve/tests/test_serve_dashboard.py"],
tags = ["team:serve"],
deps = [":conftest"],
)

py_test(
name = "test_data_head",
size = "small",
srcs = ["modules/data/tests/test_data_head.py"],
tags = ["team:data"],
deps = [":conftest"],
)
Loading