Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve libfabric RDM EP for storage #10798

Open
iziemba opened this issue Feb 19, 2025 · 2 comments
Open

Improve libfabric RDM EP for storage #10798

iziemba opened this issue Feb 19, 2025 · 2 comments

Comments

@iziemba
Copy link
Contributor

iziemba commented Feb 19, 2025

The following issue stemmed from #10792 and discussions between @soumagne, @swelch, and myself.

Storage over libfabric RDM EPs imposes a different set of requirements on libfabric when compared to HPC/AI jobs. For example, many HPC/AI jobs are not resilient. A single RDMA failure or node failure is enough to result in job termination. This is not the case with storage.

The following are example of some storage requirements. These requirements came from discussing what DAOS/Mercury needs from libfabric RDMA EPs.

1.) libfabric operations must complete within a bounded time window. Looking at provider implementations, one issue I see is providers may internal switch to rendezvous when the message size exceeds some provider-defined threshold. Once exceeded, it is the responsibility of the target to drive forward progress of the libfabric operation. If the EP is truly connectionless and the target is unable to issue rendezvous operation (e.g., software bug, process crash, etc.), the source will be left waiting indefinitely for completing the libfabric operation.

2.) Storage may use tagged buffers to identify unique RPCs. Due to the async nature of posting tagged receive buffers, there is a potential race where a source issues a tagged send before the tagged receive buffer is processed by hardware/provider. Providers need to implement some mechanism to handle this unexpected message case.

Because these operations are async, immediately failing the libfabric unexpected message is the wrong answer. If this is the only option a provider could support, it seems like a synchronous tagged send/recv API is needed by the provider and this would come at a performance cost (e.g., no more pipelining tagged receive operations).

In addition, buffering unexpected tagged sends at a target may be problematic due to resource leak. For example, a client unexpectedly rebooted will lose all context (tag bits) of inflight RPCs. But, the server is aware of the RPC and issues a tagged send as an RPC response back to client. Since the client has no idea about this tagged operation, the stale tagged send would be buffered "indefinitely".

What is worse is if this operation eventually matched a posted tagged receive, unexpected behavior may happen. If the libfabric user carefully manages tagged bits, tagged receives should not match a stale unexpected tagged send.

3.) Storage protocols are setup to handle libfabric operation failure. The behavior will change based on EP type. For example, connected EPs will require a listen/connect/accept phase before being used again. For connectionless EPs, since a single EP may be targeting N-peers, a libfabric operation failure to one peer must not disable the EP thus impacting N-1 peers as well. In other words, libfabric RDMA operation failure must not impact connectionless EP enable/disable state. Only local EP errors (e.g., HW failure of some sort) should impact connectionless EP state.

@iziemba iziemba changed the title Improve libfabric RDMA for storage Improve libfabric RDM EP for storage Feb 19, 2025
@iziemba
Copy link
Contributor Author

iziemba commented Mar 3, 2025

Update to resource mgmt: #10837

@shefty
Copy link
Member

shefty commented Mar 5, 2025

It may be possible to solve 1 & 2 using the same mechanism. 3 should mostly or entirely be an implementation issue recovering from lost connections / failed communications.

The API could be extended to associate a timeout or deadline with an operation (probably through more abuse of the context parameter). A timeout would be relative to the current time (e.g. 2 seconds), while a deadline could be based on the system clock (e.g. now() + 2s). A timeout is suitable if we wanted to configure an endpoint with a value applied to all defined operations. Deadline is only suited for a per operation value.

I prefer a per operation timeout/deadline from the perspective of the API. It makes the implementation more challenging but also enables flexibility. It can enable prioritizing transfers based on different deadlines or even allow for failing an operation quickly if the provider knows that it is highly unlikely to meet a given deadline given current queue depths.

There's no technical reason why a timeout / deadline couldn't be used receive operations. If a message isn't matched within a given window, the receive just times out. It should be possible to extend this definition to unmatched receives as an endpoint configuration. I.e. a timeout = 0 indicates unmatched receives should be discarded immediately. (A provider can adjust this value internally to account for any slowness in posting receive buffers to HW.)

Some of this may be difficult for a provider to implement. There's also a counter argument that the app could manage the deadlines, calling cancel once the operation times out. However, a timeout could even be carried through to the peer via the transport. That would allow a peer to abort an operation which may have already timed out at the initiator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants