Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(testing): modified start-reth to enforce persistence #2536

Merged
merged 1 commit into from
Feb 25, 2025

Conversation

abi87
Copy link
Collaborator

@abi87 abi87 commented Feb 24, 2025

By default, reth does not persist every finalized block. Instead it stores blocks up to a certain threshold (by default 2 heights below tip, which in our case means a finalized block may be not persisted).
This may cause an issue while syncing Beacond, since we have SSF and we assume finalized blocks are duly persisted.

This PR setup configs in order to duly persist all the blocks we need

@abi87 abi87 requested a review from a team as a code owner February 24, 2025 17:29
@abi87 abi87 self-assigned this Feb 24, 2025
Copy link

codecov bot commented Feb 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.77%. Comparing base (a23a74a) to head (9cc3658).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2536   +/-   ##
=======================================
  Coverage   32.77%   32.77%           
=======================================
  Files         351      351           
  Lines       15857    15857           
  Branches       20       20           
=======================================
  Hits         5197     5197           
  Misses      10276    10276           
  Partials      384      384           

@@ -86,7 +86,10 @@ start-reth: ## start an ephemeral `reth` node
--authrpc.addr "0.0.0.0" \
--authrpc.jwtsecret $(JWT_PATH) \
--datadir ${ETH_DATA_DIR} \
--ipcpath ${IPC_PATH}
--ipcpath ${IPC_PATH} \
-vvvvv \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not make this the default verbosity - vvvvv should be specified when debugging. Logs will get crowded and very large very quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep definitely not for real nodes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think people use make start-reth to run their production. If they do, they should not. I like increasing outputs in our tests and I think we can sustain it as is.

@rezbera rezbera self-requested a review February 24, 2025 18:31
--ipcpath ${IPC_PATH} \
-vvvvv \
--engine.persistence-threshold 0 \
--engine.memory-block-buffer-target 0
Copy link
Contributor

@rezbera rezbera Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    /// Maximum number of blocks to be kept only in memory without triggering
    /// persistence.
    persistence_threshold: u64,
    /// How close to the canonical head we persist blocks. Represents the ideal
    /// number of most recent blocks to keep in memory for quick access and reorgs.
    ///
    /// Note: this should be less than or equal to `persistence_threshold`.
    memory_block_buffer_target: u64,

Noting the above code in reth (slightly more verbose than the CLI info)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rezbera do you want to me to include this somehow in the Makefile?
I think noting it here as you did is enough (and appreciated!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is enough - as long as we're aware @abi87

@fridrik01
Copy link
Contributor

Awesome, there had to be some configuration that we could tune to change this behaviour! Great great

Copy link
Contributor

@rezbera rezbera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick sanity check in the reth code for usage of these vars and this makes sense to me.

Noting that persistence_threshold must be >= memory_block_buffer_target.

This likely has performance implications on Reth so perf testing this before release would be appropriate

@abi87
Copy link
Collaborator Author

abi87 commented Feb 24, 2025

Did a quick sanity check in the reth code for usage of these vars and this makes sense to me.

Noting that persistence_threshold must be >= memory_block_buffer_target.

This likely has performance implications on Reth so perf testing this before release would be appropriate

@rezbera I agree with the performance hit, but I don't think we need to require people using this in production yet?
At the very least, I would suggest to use the configs while syncing and remove them once node is ready to close to tip.
If anything happens and blocks are lost, scripts/rollback.sh will help.
Thoughts?

@abi87 abi87 requested a review from shotes February 24, 2025 22:22
@rezbera
Copy link
Contributor

rezbera commented Feb 24, 2025

Did a quick sanity check in the reth code for usage of these vars and this makes sense to me.
Noting that persistence_threshold must be >= memory_block_buffer_target.
This likely has performance implications on Reth so perf testing this before release would be appropriate

@rezbera I agree with the performance hit, but I don't think we need to require people using this in production yet? At the very least, I would suggest to use the configs while syncing and remove them once node is ready to close to tip. If anything happens and blocks are lost, scripts/rollback.sh will help. Thoughts?

So the problem this solves is Reth nodes crashing and then losing blocks. Such a problem could also occur once at the tip of the chain, so we would want Reth users to run it even then right?

Agree that it's not an urgent rollout, but my understanding is that longer term, users should always run with these flags

@abi87
Copy link
Collaborator Author

abi87 commented Feb 24, 2025

So the problem this solves is Reth nodes crashing and then losing blocks. Such a problem could also occur once at the tip of the chain, so we would want Reth users to run it even then right?

That would be ideal, and maybe even recommended given our relative small validator set. However as long as we can count on a sufficient majority having stored the finalized blocks, all should be fine. Nodes can rollback via script and download the missing payloads/reverted blocks from peers.

Copy link
Contributor

@calbera calbera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK. running on my local mainnet fullnode that had EL behind CL by ~1k blocks. EL catches up from the new startup procedure that we trigger in the CL.

However, I have started seeing NewPayload calls time out way more often (even with an increased timeout of 1s or 2s)

2025-02-24T15:57:59-08:00 ERRR Received undefined error during new payload call service=execution-engine payload_block_hash=0x47b3bc54ef46cdc00d07e21e0a4bd5e481c4a72f32144529a82971d51e280d49 parent_hash=0x47b3bc54ef46cdc00d07e21e0a4bd5e481c4a72f32144529a82971d51e280d49 error=Post "http://localhost:8551": engine API call timed out
2025-02-24T15:57:59-08:00 ERRR Failed to process verified beacon block service=blockchain error=Post "http://localhost:8551": engine API call timed out
2025-02-24T15:57:59-08:00 ERRR Error in proxyAppConn.FinalizeBlock module=state err=Post "http://localhost:8551": engine API call timed out
panic: Failed to process committed block (1068277:6634401D9928F5B23094EC258A6E64D1C9260D4932D46EFE74673CE71FE50D07): Post "http://localhost:8551": engine API call timed out

@abi87
Copy link
Collaborator Author

abi87 commented Feb 25, 2025

tACK. running on my local mainnet fullnode that had EL behind CL by ~1k blocks. EL catches up from the new startup procedure that we trigger in the CL.

Thanks @calbera, we'll have to monitor this

@abi87 abi87 merged commit 902f432 into main Feb 25, 2025
19 checks passed
@abi87 abi87 deleted the start-reth-unlock-mainnet-sync branch February 25, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants