Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(abci): ignore comet ctx and use app ctx #2568

Merged
merged 71 commits into from
Mar 6, 2025
Merged

fix(abci): ignore comet ctx and use app ctx #2568

merged 71 commits into from
Mar 6, 2025

Conversation

shotes
Copy link
Contributor

@shotes shotes commented Mar 4, 2025

Currently, beacond hangs on SIGTERM if its inside a retry timeout. This is because the context used in the retry is not cancelled in our signal hander.

To address this, this PR makes the context available to the cometbft service which it gets from the context passed in to Start(). Now, when signal handler catches a signal, the context cancellation will be checked by the ABCI++ endpoints properly. This causes us to properly handle signal cancellation in the retry timeout.

Unfortunately, CometBFT does not yet support context cancellation, or context usage at all. It currently passes in context.TODO() into all of the ABCI methods, so the ctx we see in the ABCI function implementations is not connected to the ctx from our node. I also can't seem to find anything in cometBFT RFC or docs that indicate supporting this.

Since CometBFT does not support context cancellation, this means that there is no clean way out -- we MUST return error in ProcessProposal and FinalizeBlock (which causes the CONSENSUS FAILURE!!! panic). If we don't then our node will be attempting to participate in consensus with potentially corrupt, invalid, or missing data. I have tested this behavior by not returning an error from FinalizeBlock during context cancellation, and indeed my node still proceeded to call Commit.

@shotes shotes changed the title fix(abci): combine ABCI context with parent fix(abci): ignore comet ctx and use app ctx Mar 5, 2025
@shotes
Copy link
Contributor Author

shotes commented Mar 5, 2025

Opted to entirely remove the combineContexts functionality. The original reason for implementing the combineContexts functionality was so that we could additionally capture the ctx provided by comet, IF they ever decided to implement it.

This original reason was not a very good reason - IF they implement the ctx, then we'd want to remove combineContexts anyways, since ctx would have become a child context of s.ctx.

Removing this functionality entirely streamlines the whole thing, and it was unnecessary in the first place. Now, if comet ever implements ctx properly, we can simply replace the s.ctx with ctx.

@@ -54,6 +54,9 @@ func (s *Service) finalizeBlockInternal(
// CometBFT.
if s.finalizeBlockState == nil {
s.finalizeBlockState = s.resetState(ctx)
} else {
// Preserve the CosmosSDK context while using the correct base ctx.
s.finalizeBlockState.SetContext(s.finalizeBlockState.Context().WithContext(ctx))
Copy link
Contributor

@rezbera rezbera Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit unsure on this. Why was introducing this needed?

Copy link
Contributor Author

@shotes shotes Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the cosmos SDK context, which I will refer to as cosmosCtx, contains a reference to the baseCtx, which is the regular golang context.Context. This cosmosCtx also stores a bunch of data in it for our state and commit multi store and whatnot, but we don't want to affect any of that.

When we resetState(ctx), we are creating a new cosmosCtx, where baseCtx = ctx.

If we do not resetState(ctx), then the already existing finalizeBlockState is used. This state has the old cosmosCtx, which has the old ctx from whichever other function is was set in, likely the previous ProcessProposal. We don't mind using the old cosmosCtx, but we want to make sure we are using the new ctx that is passed in for this function. So we can simply set baseCtx = ctx here.

Indeed, this is not as big of a problem now, since all of the contexts are the same and the context isn't getting canceled inbetween each ABCI++ call like it was in a previous iteration of this PR. However, purely for correctness, we want to make sure our s.Blockchain.FinalizeBlock(s.finalizeBlockState.Context()) is using the ctx passed in via finalizeBlockInteral(ctx, ...) and NOT the ctx from some other function call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it is still not accurately using the correct ctx unless we make this change.

I had assumed it was always going to be using this context.

For my understanding, is it fair to say that we don't need to change the context here given we have this universal base context which currently never changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rezbera if you check the call hierarchy you will see that FinalizeBlock does use s.ctx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @rezbera we can solve the issue by just not passing ctx as parameter, but reusing s.ctx everywhere, which this PR basically does. So no need to look up the call hierarchy

Copy link
Contributor

@rezbera rezbera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 non-blocking query , but rest lgtm

@@ -54,6 +54,9 @@ func (s *Service) finalizeBlockInternal(
// CometBFT.
if s.finalizeBlockState == nil {
s.finalizeBlockState = s.resetState(ctx)
} else {
// Preserve the CosmosSDK context while using the correct base ctx.
s.finalizeBlockState.SetContext(s.finalizeBlockState.Context().WithContext(ctx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a helper function on state to reset/update the base ctx

@@ -95,7 +95,7 @@ func (s *SimulatedSuite) SetupTest() {
}()

// Allow a short period for services to fully initialize.
time.Sleep(2 * time.Second)
time.Sleep(5 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting here that this is a hacky way to make sure that goroutine running s.TestNode.Start(s.Ctx) above completes before we move on (otherwise consensus service may crash due to ctx being nil).
We will have to rework this to use a different way to check stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to use a reliable method in #2572

It's not ideal, because ideally we'd have a readiness probe API but this is better than sleeping arbitrary time.

Base automatically changed from engine-api-retry to main March 6, 2025 12:51
@abi87 abi87 merged commit 6699ea9 into main Mar 6, 2025
19 checks passed
@abi87 abi87 deleted the abci_context_cancel branch March 6, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants