Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Introduce new tag FI_TAG_RPC #10792

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions include/rdma/fabric.h
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ enum {
FI_TAG_BITS,
FI_TAG_MPI,
FI_TAG_CCL,
FI_TAG_RPC,
FI_TAG_MAX_FORMAT = (1ULL << 16),
};

Expand Down
10 changes: 10 additions & 0 deletions man/fi_endpoint.3.md
Original file line number Diff line number Diff line change
Expand Up @@ -919,6 +919,16 @@ wire protocols. The following tag formats are defined:
Applications that use the CCL format pass in the payload identifier
directly as the tag and set ignore bits to 0.

*FI_TAG_RPC*

: The FI_TAG_RPC flag is used to indicate that tags are being utilized to match RPC
replies with requests. When specified via fi_getinfo, the caller ensures that
a receive buffer with a corresponding tag to receive an RPC reply has been posted
prior to sending an RPC request. As a result, unexpected message handling may be
Comment on lines +925 to +927
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can callers ensure that a receive buffer is posted? The data transfer APIs are asynchronous. In theory, there is a race here which could result in fi_tsend being issued before provider/HW has processed fi_trecv. In addition, should this race be hit with FI_ORDER_SAS, API contract may be broken.

Copy link
Author

@jxiong jxiong Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know the trecv APIs are async. As far as I can see, it's not async for the TCP provider.

This sounds like a serious issue to me. Because if there exists providers who require this for zerocopy or DMA, this behavior will make zerocopy/DMA impossible since the buffer is not ready while data comes. @shefty @iziemba can you share some more insights? It's unlikely for me to work on that but it is still interesting to know how you would approach it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that once fi_trecv() or fi_recv() return, that the buffers are available. This is needed by apps trying to implement flow control. If fi_trecv() or fi_recv() just mean that the buffer may someday in the future be made available, we're almost certainly making life impossible for apps. No credit-based flow control will ever work

The completion of a posted receive buffer is asynchronous. The posting should not be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unexpected messaging closes the race between a buffer receive post and an incoming message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that the unexpected buffers would be unlimited and nobody knows how long they would stay in that buffer.

Coming back to this issue, if the fi_addr is not configured when the unexpected messages comes, it will disable polling the socket fd, which will choke the tcp connection.

Copy link
Author

@jxiong jxiong Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand there might be other complications but this is certainly helping us. I wonder if we could do it the current way and we can always come back to improve it when new information is discovered. @shefty @iziemba as long as there are something to debate, it wouldn't be terribly wrong or an obvious mistake, lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mem_tag_format is to be used, part of the FI_TAG_RPC needs to address the format. I think Mercury today uses FI_TAG_BITS. FI_TAG_RPC could be defined to be the following:

FI_TAG_RPC extends FI_TAG_BITS by indicating that tags are being utilized to match RPC replies with requests. When specified, the caller ensures that a receive buffer with a corresponding tag to receive an RPC reply has been posted prior to sending an RPC request. Unmatched tagged received messages must be discarded.

The number of supported tag bits follows the FI_TAG_BITS definition.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been out for about a week. IMO, it would be nice if tag_rpc could also define more semantic meaning around the tag format. For example, ignore bits aren't being used.

I also think it's worth considering this from the angle of resource management, but that's harder to fit a solution into without redefining resource_mgmt (it's currently an enum that's a boolean for practical purposes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iziemba I think FI_TAG_RPC is closer to FI_TAG_CCL in the sense that exact match is required. Note that both FI_TAG_MPI and FI_TAG_CCL are derivatives of FI_TAG_BITS. They all use leading 0s to indicate number of reserved bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that RPC is better aligned exact match than wildcard.

optimized.

When FI_TAG_RPC is specified, unmatched received messages must be discarded.

*FI_TAG_MAX_FORMAT*
: If the value of mem_tag_format is >= FI_TAG_MAX_FORMAT, the tag format
is treated as a set of bit fields. The behavior is functionally the same
Expand Down