Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Improve DAG API to support tensor parallel DAG #41231

Merged
merged 26 commits into from
Dec 12, 2023

Conversation

rkooo567
Copy link
Contributor

Why are these changes needed?

  1. Support OutputNode
  2. Allow to create bind from regular actor. It is needed because
    • Actor needs to be reused
    • Currently, you can have only 1 DAG per actor because the actor is a starting point of the DAG. This allows us to
      make a task as a starting node of the DAG instead of the actor
    • This also allows to have more than one InputNode per each actor

This PR also removes the unused code

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

IN_CONTEXT_MANAGER = "__in_context_manager__"


class OutputNode(DAGNode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only useful when there are multiple outputs? If so, how about calling it something like MultiOutputNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I thought we want this even for a single output (since we need to allocate the buffer ahead of time). cc @stephanie-wang can we allocate the output buffer without this Output node? for an single output case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-output, it's kind of a nuisance to require users to specify it. We don't really need it for compiled DAG; it's just useful to know which node is the sink. My plan for compiled DAG was to add it implicitly if the user didn't specify it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes I agree that we can just call it MultiOutputNode, since users only need to use if there are multiple outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

the final semantic is

MultiOutputNode([output_1, output_2, ...]) -> [output_1, output2, ...]

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 30, 2023
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 4, 2023
@rkooo567 rkooo567 requested a review from a team as a code owner December 5, 2023 14:52
@rkooo567
Copy link
Contributor Author

rkooo567 commented Dec 5, 2023

cc @ericl @stephanie-wang. I addressed comments!

@rkooo567
Copy link
Contributor Author

rkooo567 commented Dec 5, 2023

Working on CI failures...

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 5, 2023
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 7, 2023
@rkooo567
Copy link
Contributor Author

rkooo567 commented Dec 7, 2023

@ericl can you take a look at doc changes?

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits.

Comment on lines 32 to 36
from ray.dag.class_node import (
PARENT_CLASS_NODE_KEY,
PREV_CLASS_METHOD_CALL_KEY,
ClassMethodNode,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test failure, I'm wondering if this is causing an increase in startup time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this just adds 1ms overhead when ray.init() is called. but let me try. nothing to lose...

Copy link
Contributor Author

@rkooo567 rkooo567 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this fixes the issue... I think it is really a bug from that test, and it should be fixed, but I will just move import here for now

@rkooo567 rkooo567 merged commit 2839644 into ray-project:master Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants