RFC: Accelerated DAGs #48

stephanie-wang · 2023-12-04T17:47:44Z

No description provided.

Signed-off-by: Stephanie Wang <[email protected]>

reps/2023-12-04-accelerated-dag.md

ericl · 2023-12-07T17:32:29Z

Isn't this signal very high overhead (basically we will have default transport + custom transport overhead)? Is it necessary if the transport happens under the channel (as an output of each DAG node)?

Good point, especially since custom TCP/RDMA transports primarily improve latency and not throughput.

…

On Thu, Dec 7, 2023, 7:13 AM SangBin Cho ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In reps/2023-12-04-accelerated-dag.md <#48 (comment)> : > +*Vision for Ray Train in Ray 3.0: Ray Train is built on Ray's single-controller model, but with much finer-grained tasks and more control over the execution. The physical execution is as fast as SPMD.* + +The "accelerated DAGs" effort aims to keep Ray's flexibility and fault +tolerance, but: (a) reduce control plane overheads, and (b) support +specialized communication transports. The basic ideas are to: (a) reuse +control plane decisions from past executions, (b) support +application-defined transports such as +[NCCL](https://developer.nvidia.com/nccl) and +[UCX](https://openucx.org/). Ultimately, the goal is to +make Ray into a native execution substrate for distributed ML +applications. + +Architecture +============ + +## Compiled DAGs I think it'd be nice to show concrete API example (for a common computational pattern like tensor parallel using DAG and custom transport) in this section? ------------------------------ In reps/2023-12-04-accelerated-dag.md <#48 (comment)> : > +```python +class Transport: + def __init__(self, sender: ray.ActorHandle, receiver: ray.ActorHandle): + pass + def send(self, send_meta: BufferMetadata): + """Ideally async. Called by the sender""" + pass + def recv(self, recv_meta: BufferMetadata) -> Buffer: + """Ideally async. Called by the receiver.""" + pass +``` + +When an application-defined transport is used, we will use the default +transport to synchronize between the sender and receiver, but we only +send the BufferMetadata in along the default transport. This will act as +a signal to begin the actual \`Transport.send\` and \`Transport.recv\` Isn't this signal very high overhead (basically we will have default transport + custom transport overhead)? Is it necessary if the transport happens under the channel (as an output of each DAG node)? — Reply to this email directly, view it on GitHub <#48 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQMZ6QXA6H4WZOSDCLYIHMKLAVCNFSM6AAAAABAGL4ZWWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONZQGMZTINZXGA> . You are receiving this because you were assigned.Message ID: ***@***.***>

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2023-12-15T17:47:11Z

Another TODO: Add some more detail on interaction with load-balancing and autoscaling, ideally some psuedocode.

edoakes

Overall the motivation makes a lot of sense. Right using the Ray API as the dataplane for ML workloads is a performance loss over other libraries and we should close that gap.

From a serving perspective, there are basically three categories of communication that happens (excluding the pattern used by vLLM where you use multiple GPUs for a single logical replica):

(1) Point-to-point communication passing small messages between two groups of actors (often colocated on the same node). This is the pattern for HTTP proxy -> replica communication and many user-defined applications doing basic model composition.

Requirements:

Backpressure, load balancing (including adding/removing actors), pipelining.

Non-requirements:

Dynamic control flow
Passing output to a downstream actor.

(2) An extension of (1) where there is composition across multiple groups of actors (for example A calls B and passes the output to C). In some cases the output of (B) might be large or on a GPU, so passing it by reference may be helpful.

Requirements:

Backpressure, load balancing (including adding/removing actors), pipelining.
Passing outputs to a downstream actor.

Non-requirements:

Dynamic control flow

(3) An extension of (1) and (2) where there is also dynamic control flow (i.e., users choose the downstream group of actors based on some input or intermediate output).

Requirements:

Backpressure, load balancing (including adding/removing actors), pipelining.
Passing outputs to a downstream actor.
Dynamic control flow.

To me, it's clear that the faster transport layer could be beneficial to all of the above, particularly (1) where we mostly want fast message passing between two actors on the same node.

It's unclear how the optimized DAG as proposed could satisfy the requirements around load balancing and autoscaling across groups of actors. Would be interested to hear if you have any ideas there.

reps/2023-12-04-accelerated-dag.md

edoakes · 2023-12-21T20:47:21Z

reps/2023-12-04-accelerated-dag.md

+
+<img style="background-color:white" src="2023-12-04-accelerated-dag-figures/image2.png">
+
+*Left: Instantiation. "Compile" a DAG before the first execution, by allocating buffers for task inputs/outputs and sending the task descriptions to the actor executors. Right: Execute the compiled dataplane. Communication edges can take place over shared memory when local.*


Suggested change

*Left: Instantiation. "Compile" a DAG before the first execution, by allocating buffers for task inputs/outputs and sending the task descriptions to the actor executors. Right: Execute the compiled dataplane. Communication edges can take place over shared memory when local.*

*Left: Instantiation. "Compile" a DAG before the first execution, by allocating buffers for task inputs/outputs and sending the task descriptions to the actor executors. Right: Execute the compiled DAG over an optimized dataplane. Communication edges can take place over shared memory when local.*

reps/2023-12-04-accelerated-dag.md

edoakes · 2023-12-21T20:54:06Z

reps/2023-12-04-accelerated-dag.md

+sender/receiver are colocated, and shared memory/Ray Core for
+cross-node.


Any plan to experiment with faster transports such as RDMA for cross-node communication?

Yes, but not for a while probably. The initial focus is GPU communication.

reps/2023-12-04-accelerated-dag.md

edoakes · 2023-12-21T21:01:14Z

reps/2023-12-04-accelerated-dag.md

+|-----------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------|
+| [Pipedream](https://arxiv.org/pdf/1806.03377.pdf) style pipeline-parallel distributed training (PP)     | Iterative P2P GPU communication                                      | Performance parity; Increase flexibility of partitioning scheme   |
+| vLLM pipeline parallelism on heterogeneous GPUs                 | Asymmetric compute                                                   | Reduce implementation burden                                     |
+| Fault-tolerant distributed serving                              | Resume execution w/o restarting everyone                             | Reduce downtime via greater recovery flexibility                 |


As I understand the fault tolerance section above, failure handling will be done at the application layer by propagating the error back to the driver. In this case it seems that the entire DAG would need to be re-run. So how would the proposal help address this requirement?

It's also a little unclear to me how important this requirement is for most serving applications. Most often latencies are low (~seconds at most) and failures are rare, so needing to re-run the full inference is not a big problem.

Restarting here meaning the worker processes, not individual tasks.

edoakes · 2023-12-21T21:13:23Z

reps/2023-12-04-accelerated-dag.md

+|                                                                 | Key properties / requirements                                        | Goals                                                            |
+|-----------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------------------------------------|
+| vLLM tensor parallelism                                         |                                                                      | Reduce Ray overheads                                             |
+| vLLM pipeline parallelism                                       | P2P/cross-node GPU communication                                     | Reduce (expected) Ray overheads; Validate cross-node performance  |


My understanding from vLLM folks' benchmarks was that cross-node GPU communication overhead is too high (even if using optimized transports) so for now they don't have a desire to use it. I could be wrong though, this is hearsay.

For tensor parallelism, yes, but I don't think that's the case for pipeline parallelism.

ericl · 2023-12-22T00:25:56Z

Point-to-point communication passing small messages between two groups of actors

Would the current proposal be viable for even this? As I understand this would require dynamic M:N communication, whereas the compiled dag is assuming a fully static and exclusive communication topology between actors.

For example, if group 1 is a single actor dispatching work to group 2 actors, that would work (we could have multiple DAGs capturing "1->2" communication), but if group 1 had multiple dispatching actors then that would break the assumption that each actor doing work is part of a most one DAG at a time.

rkooo567 · 2023-12-22T01:21:57Z

I have impression the custom/faster transport has to be decoupled from DAG. It is a requirement for DAG, but DAG shouldn't be a requirement to use custom/faster transport. I think it could be easily implemented with existing ray API if we just establish the channel when the first communication is initiated. And if it works with existing Ray API, we don't have to support ^ from core?

edoakes · 2023-12-28T21:32:51Z

Yes you're correct @ericl. To optimize the basic case of M:N communication we essentially just want the fast transport layer. Whether we get that through the DAG abstraction (single-node DAG) or use the lower-level primitive directly doesn't really matter too much.

Signed-off-by: Stephanie Wang <[email protected]>

ericl · 2024-01-08T19:45:37Z

reps/2023-12-04-accelerated-dag.md

+
+with InputNode() as inp:
+    with InputGPUChannel() as gpu_channel:
+        _, tensors = producer.produce_all_sync.bind(inp.input)


This syntax is pretty strange, I'd expect something more like tensors.set_device("gpu") or tensors.set_channel_impl(GPUChannel).

Ideally, this would be done automatically, but I'm not sure we can detect the return type is a cuda array easily.

Hmm, tensors.set_channel_impl seems good. Mainly I want there to be a way to pass a specific Channel or Channel type, so tensors.set_device("gpu") seems less ideal.

Signed-off-by: Stephanie Wang <[email protected]>

zhe-thoughts

Vote on ray-committers just passed

stephanie-wang added 4 commits December 4, 2023 09:46

wip

15e30fa

Signed-off-by: Stephanie Wang <[email protected]>

format

fbf5202

Signed-off-by: Stephanie Wang <[email protected]>

REP

47b1a89

Signed-off-by: Stephanie Wang <[email protected]>

title

67b63ac

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang changed the title ~~wip~~ RFC: Accelerated DAGs Dec 4, 2023

stephanie-wang marked this pull request as ready for review December 4, 2023 18:30

stephanie-wang requested review from ericl, pcmoritz, scv119, rkooo567, richardliaw and zhe-thoughts December 4, 2023 18:30

background

00722bb

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang assigned ericl, pcmoritz and scv119 Dec 4, 2023

stephanie-wang requested review from raulchen and matthewdeng December 4, 2023 19:04

ericl reviewed Dec 4, 2023

View reviewed changes

reps/2023-12-04-accelerated-dag.md Show resolved Hide resolved

ericl reviewed Dec 4, 2023

View reviewed changes

reps/2023-12-04-accelerated-dag.md Outdated Show resolved Hide resolved

ericl reviewed Dec 4, 2023

View reviewed changes

reps/2023-12-04-accelerated-dag.md Outdated Show resolved Hide resolved

ericl reviewed Dec 4, 2023

View reviewed changes

reps/2023-12-04-accelerated-dag.md Show resolved Hide resolved

rkooo567 reviewed Dec 7, 2023

View reviewed changes

reps/2023-12-04-accelerated-dag.md Show resolved Hide resolved

image swap

f9b94a5

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang mentioned this pull request Dec 9, 2023

[core][experimental] Add experimental compiled DAG ray-project/ray#41769

Merged

rkooo567 mentioned this pull request Dec 19, 2023

[Ray Integration] Integrate vllm with experimental accelerated DAG API vllm-project/vllm#2201

Closed

edoakes reviewed Dec 21, 2023

View reviewed changes

update

cf5aa62

Signed-off-by: Stephanie Wang <[email protected]>

ericl reviewed Jan 8, 2024

View reviewed changes

stephanie-wang added 4 commits January 8, 2024 20:06

Add out-of-band GPU example

de3e229

Signed-off-by: Stephanie Wang <[email protected]>

update

6b62f33

Signed-off-by: Stephanie Wang <[email protected]>

update

7773421

Signed-off-by: Stephanie Wang <[email protected]>

update dynamic workloads

0ffba10

Signed-off-by: Stephanie Wang <[email protected]>

ericl added the vote-approved label Mar 21, 2024

zhe-thoughts approved these changes Mar 21, 2024

View reviewed changes

zhe-thoughts merged commit a8bac2d into main Mar 21, 2024

hongchaodeng deleted the accelerated-dag branch January 3, 2025 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Accelerated DAGs #48

RFC: Accelerated DAGs #48

stephanie-wang commented Dec 4, 2023

ericl commented Dec 7, 2023 via email •

edited by rkooo567

Loading

stephanie-wang commented Dec 15, 2023

edoakes left a comment •

edited

Loading

edoakes Dec 21, 2023

edoakes Dec 21, 2023

stephanie-wang Jan 8, 2024

edoakes Dec 21, 2023

stephanie-wang Jan 8, 2024

edoakes Dec 21, 2023

stephanie-wang Jan 8, 2024

ericl commented Dec 22, 2023 •

edited

Loading

rkooo567 commented Dec 22, 2023

edoakes commented Dec 28, 2023

ericl Jan 8, 2024 •

edited

Loading

stephanie-wang Jan 8, 2024

zhe-thoughts left a comment


		<img style="background-color:white" src="2023-12-04-accelerated-dag-figures/image2.png">

		Left: Instantiation. "Compile" a DAG before the first execution, by allocating buffers for task inputs/outputs and sending the task descriptions to the actor executors. Right: Execute the compiled dataplane. Communication edges can take place over shared memory when local.

		sender/receiver are colocated, and shared memory/Ray Core for
		cross-node.

RFC: Accelerated DAGs #48

RFC: Accelerated DAGs #48

Conversation

stephanie-wang commented Dec 4, 2023

ericl commented Dec 7, 2023 via email • edited by rkooo567 Loading

stephanie-wang commented Dec 15, 2023

edoakes left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Dec 22, 2023 • edited Loading

rkooo567 commented Dec 22, 2023

edoakes commented Dec 28, 2023

ericl Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhe-thoughts left a comment

Choose a reason for hiding this comment

ericl commented Dec 7, 2023 via email •

edited by rkooo567

Loading

edoakes left a comment •

edited

Loading

ericl commented Dec 22, 2023 •

edited

Loading

ericl Jan 8, 2024 •

edited

Loading