New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Compiled dag #17

Closed

stephanie-wang wants to merge 91 commits into mutable-objects-2 from compiled-dag

Owner

stephanie-wang commented Dec 6, 2023

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

SangBin Cho and others added 16 commits

November 17, 2023 08:37

ip


          basic working.

8c5efd8


          enhancement

664b07a


          working now.

8f6f8d2


          Merge remote-tracking branch 'sang/dag-api' into compiled-dag

14c3a44


          Merge remote-tracking branch 'upstream/master' into compiled-dag

888950a

tmp

bdfbb8a

Signed-off-by: Stephanie Wang <[email protected]>


          Merge branch 'mutable-objects-2' into compiled-dag

b6a66f2


          scatter-gather DAG works

b6150a3

Signed-off-by: Stephanie Wang <[email protected]>

fix

Signed-off-by: Stephanie Wang <[email protected]>

fix

e88c40f

Signed-off-by: Stephanie Wang <[email protected]>


          compile?

95e871b

Signed-off-by: Stephanie Wang <[email protected]>


          Merge branch 'mutable-objects-2' into compiled-dag

13b1d53


          unit test

e54972b

Signed-off-by: Stephanie Wang <[email protected]>


          Merge branch 'mutable-objects-2' into compiled-dag

5fbfac5

tmp

Signed-off-by: Stephanie Wang <[email protected]>

ericl reviewed

View reviewed changes

python/ray/dag/dag_node.py Outdated

+                      *args,
+                      _ray_cache_refs: bool = False,
+                      _ray_cache_actors: bool = True,
+                      compiled: bool = False,

ericl Dec 6, 2023

Shall we make an experimental_compile() method that returns a compiled node type instead of adding an arg here? I think the arg is a bit confusing since it's not clear that we cache the compiled DAG on the first interaction with execute.

Owner Author

stephanie-wang Dec 7, 2023

Yeah agree, I was thinking that too actually.

python/ray/dag/dag_node.py Outdated

                           self.cache_from_last_execute = executor.cache
                       return result
+                  def destroy_compiled_dag(self):
+                      _, _, _, monitor = self.compiled()
+                      monitor.destroy()

ericl Dec 6, 2023

Drop the fault tolerance/cancellation stuff from this PR?

Owner Author

stephanie-wang Dec 7, 2023

Oops, missed this...

stephanie-wang added 8 commits

December 7, 2023 09:28


          Merge branch 'mutable-objects-2' into compiled-dag

950bbb4


          Revert "tmp"

494cb53

This reverts commit 9396810.


          cleanup

05b002f

Signed-off-by: Stephanie Wang <[email protected]>


          Support no-OutputNode DAGs

521c73b

Signed-off-by: Stephanie Wang <[email protected]>


          Support non-DAG args

5b58250

Signed-off-by: Stephanie Wang <[email protected]>


          errors

b5beca4

Signed-off-by: Stephanie Wang <[email protected]>


          lint

cc2e795

Signed-off-by: Stephanie Wang <[email protected]>

doc

c17c367

Signed-off-by: Stephanie Wang <[email protected]>

rkooo567 reviewed

View reviewed changes

rkooo567 left a comment •

edited

Loading

Q: is the perf benchmark result similar or the same from before?

python/ray/dag/compiled_dag_node.py Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/tests/test_accelerated_dag.py Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated

+                      # Find the (multi-)output node to the DAG.
+                      for idx, task in self.idx_to_task.items():
+                          if len(task.dependent_node_idxs) == 0:

rkooo567 Dec 8, 2023

NIT: this could be included in the prvious loop?

for idx, task in self.idx_to_task.items():
            if isinstance(task.dag_node, InputNode):
                assert self.input_task_idx is None, "more than one InputNode found"
                self.input_task_idx = idx
            if len(task.dependent_node_idxs) == 0:
                assert self.output_task_idx is None, "More than one output node found"
                self.output_task_idx = idx

Owner Author

stephanie-wang Dec 8, 2023 •

edited

Loading

I just did this for readability. The performance here doesn't matter much since we only do it once.

python/ray/dag/compiled_dag_node.py Outdated

+                      # Find the (multi-)output node to the DAG.
+                      for idx, task in self.idx_to_task.items():
+                          if len(task.dependent_node_idxs) == 0:

rkooo567 Dec 8, 2023

This part seems a bit confusing to me.

Isn't dependent_node == arguments? Why does output node have 0 dependent node id?

Owner Author

stephanie-wang Dec 8, 2023

Ah it's a reverse index for arguments. dependent_node is downstream tasks. I'll call it "downstream_node_idxs" name to make it clearer.

python/ray/dag/compiled_dag_node.py

+                              )
+                          # Assign the task with the correct input and output buffers.
+                          worker_fn = task.dag_node._get_remote_method("__ray_call__")

rkooo567 Dec 8, 2023

Maybe we can make a wrapper?

def invoke_remote_method_from_task(task, func, *args, **kwargs):
    worker_fn = task.dag_node._get_remote_method("__ray_call__")
    return worker_fn.remote(func, args, kwargs)

something like this?

Owner Author

stephanie-wang Dec 9, 2023

I think it's fine not to do.

python/ray/dag/compiled_dag_node.py

		"""


		class CompiledDAG:

rkooo567 Dec 8, 2023

Should we add tests that validate DAG itself (without running execute)?

Convert regular DAG and make sure returned compiled dag contains all regular DAG nodes + correct worker_task_refs
Test cases where exceptions are raised

Owner Author

stephanie-wang Dec 8, 2023

Makes sense, I'll do 1.

2 is already done (see test_dag_errors).

stephanie-wang and others added 4 commits

December 8, 2023 11:16


          Merge branch 'mutable-objects-2' into compiled-dag

b1f3f34


          [docs] [data] remove Data Processing tag from sidenav and examples (r…

563f7d8

…ay-project#41686)

After much discussion, the Data Processing tag is too general. Removing it from the Example Gallery.
cc: @amogkam @scottjlee @pcmoritz

Signed-off-by: angelinalg <[email protected]>


          [RLlib] Add and enhance fault-tolerance tests for APPO. (ray-project#…

…40743)


          [docs][serve] Fix subsection title hierarchy (ray-project#41746)

aa86ef6

Subsections are appearing in the Serve Examples page as separate docs vs sections of the Streaming doc.

Signed-off-by: angelinalg <[email protected]>

can-anyscale and others added 28 commits

December 11, 2023 20:24


          [ci] upload build data to go/flaky (ray-project#41796)

b7ed648

For a few build that doesn't use rayci, need to add code to upload build data to go/flaky

Signed-off-by: can <[email protected]>


          [Data] Add example of how to read and write custom file types (ray-pr…

ee10ea6

…oject#41785)

ray-project#40127 removed the "Implementing a Custom Datasource" example because it used deprecated APIs. This PR introduces a new example that uses up-to-date APIs.

---------

Signed-off-by: Balaji Veeramani <[email protected]>


          [Core] Improve DAG API to support tensor parallel DAG (ray-project#41231

)

Support OutputNode
Allow to create bind from regular actor. It is needed because
Actor needs to be reused
Currently, you can have only 1 DAG per actor because the actor is a starting point of the DAG. This allows us to
make a task as a starting node of the DAG instead of the actor
This also allows to have more than one InputNode per each actor
This PR also removes the unused code


          Re-merge mutable objects (ray-project#41515) (ray-project#41789)

1a090a0

See ray-project#41515.

This updates to only compile new code on linux. OSX does not support shared memory semaphores, only named semaphores.

---------

Signed-off-by: Stephanie Wang <[email protected]>


          Merge remote-tracking branch 'upstream/master' into compiled-dag

8dde781


          merge

905a5bc

Signed-off-by: Stephanie Wang <[email protected]>


          revert

4436b1f

Signed-off-by: Stephanie Wang <[email protected]>


          revert

f105ed5

Signed-off-by: Stephanie Wang <[email protected]>

00f3f1c

Signed-off-by: Stephanie Wang <[email protected]>


          buffer size bytes

257457d

Signed-off-by: Stephanie Wang <[email protected]>


          [docs][serve][tune] Remove serve/tutorial/rllib and tune-sklearn exam…

e8273ea

…ples (ray-project#41758)

Purging outdated or low-value examples from the Example Gallery.
@pcmoritz @richardliaw

---------

Signed-off-by: angelinalg <[email protected]>


          [lint] buildifier format train build files (ray-project#41833)

75b12f0

Signed-off-by: Lonnie Liu <[email protected]>


          [lint] buildifier format tune docs build files (ray-project#41834)

7ea9912

formatting only

Signed-off-by: Lonnie Liu <[email protected]>


          remove duplicate of huggingface_vit_batch_prediction (ray-project#41738)

907985a

Part of a larger effort to better curate examples. This example appears twice, once as a Code Example, and a second time as a Tutorial. I thought that was redundant so I removed the duplication.

Signed-off-by: angelinalg <[email protected]>


          [serve] Fix bug in deployment state machine (ray-project#41799)

fb521a9

If a deployment is autoscaling and replicas take a long time to start, there is a bug that makes the state transition to (UPDATING, AUTOSCALING) which is a combination that should never occur. Instead, we should just update the message but not the status.

---------

Signed-off-by: Cindy Zhang <[email protected]>


          [doc][train] Clarify error message when trying to use local storage f…

bf67cb1

…or multi-node distributed training and checkpointing (ray-project#41832)

Ray 2.7 removed support for using the head node as the persistent storage for checkpoints and artifacts in a multi-node distributed training. The alternative recommendation is to use cloud storage or a shared filesystem instead via `RunConfig(storage_path)`.

Ray Train/Tune will error if the user attempts to checkpoint `ray.train.report(..., checkpoint=...)` from a worker that's on a remote node. This is because the new assumption is that all worker nodes have access to read/write from the same persistent storage, and the "head node local storage" is not accessible by all nodes.

However, the error message that shows up is confusing. All nodes can technically access the local path in the message -- the problem is that not all nodes can access the the SAME local path. This PR improves the error message to make this more clear and to suggest an actionable fix. This PR also updates most of the getting started user guides to mention the multi-node storage requirement and links to the storage user guide.

---------

Signed-off-by: Justin Yu <[email protected]>


          [doc][train] Clarify prepare_data_loader shuffle behavior and inclu…

6b65b56

…de `set_epoch` usage in all examples (ray-project#41807)

`prepare_data_loader` adds a `DistributedSampler` to an existing pytorch `DataLoader` object. To do this, it recreates a `DataLoader` object and passes most arguments through from the original object, but also makes some implicit assumptions that are not configurable/visible to the caller.

For example, if using just vanilla pytorch by itself, it's possible to do: `train_dataloader = DataLoader(..., shuffle=False, sampler=DistributedSampler(shuffle=True))`. Here, the `DataLoader` sets `shuffle=False`, but the `DistributedSampler` will still do a shuffle on every epoch so that the training data order is not always the same. The `shuffle=False` argument of the `DataLoader` is pretty much ignored because a custom sampler is supplied.

**However, with Ray Train, since this `prepare_data_loader` utility injects the `DistributedSampler` for the user, there's no visibility on the `shuffle` parameter.** Ray Train will detect the `shuffle` parameter set on the *original* dataloader, then pass that along to the `DistributedSampler`. So, it's not possible to have this `False+True` situation.

**Additionally, if `shuffle=True`, `DistributedSampler.set_epoch` must be called at the start of each epoch in order for the dataset ordering to be different for all workers *on every epoch.*** This is because the seed of the sampler is determined at the epoch start (`epoch seed = base random seed + epoch number`).

Shuffling can be very important for training a model successfully -- if the data order remains the same every epoch, it's possible that training never converges (ex: we ran into this issue training resnet18 on imagenet).

Signed-off-by: Justin Yu <[email protected]>


          [Serve] Add gRPC context related docs (ray-project#41783)

b3b79d2

Adding example code and descriptions for using gRPC context in Serve.

---------

Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Der Su <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>


          optional

71c32ae

Signed-off-by: Stephanie Wang <[email protected]>

2ba93f0

Signed-off-by: Stephanie Wang <[email protected]>


          Merge remote-tracking branch 'upstream/master' into compiled-dag

3a4b2f4


          fix (ray-project#41854)

0b0431c

Un-break CI.

Signed-off-by: Stephanie Wang <[email protected]>


          Merge remote-tracking branch 'upstream/master' into compiled-dag

973ba68

ac5fa55

Signed-off-by: Stephanie Wang <[email protected]>


          lint?

35a37fd

Signed-off-by: Stephanie Wang <[email protected]>


          test

Signed-off-by: Stephanie Wang <[email protected]>

API

fadec07

Signed-off-by: Stephanie Wang <[email protected]>

ff19557

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang closed this

stephanie-wang deleted the compiled-dag branch

December 15, 2023 02:39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ericl ericl left review comments

rkooo567 rkooo567 approved these changes

edoakes Awaiting requested review from edoakes edoakes is a code owner

pcmoritz Awaiting requested review from pcmoritz pcmoritz is a code owner

Labels

None yet

22 participants