-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.61.0 slow to sync on sepolia #13457
Comments
8+ hours later, it seems to be stuck here:
Note that when it started, it only needed to catch up from block from 7496738 to 7497695 (957 blocks), ie a very small range:
|
Downgrading to v2.60.10, all the steps above completed in about 20mins (and the rest fairly quickly in 7mins):
|
@keithchew can you provide all the flags you used? |
@antonis19 here are the flags I am using:
|
Hmm, I have 2 other nodes on mainnet, which the upgrade went fine last week. But today, one of them is lagging behind, and very slow to catch up:
It is interesting that it looks like each block is being handled twice? A restart did not help, and the same logs as above appears. Reverting back to v2.60.10 works, it catches up fast and without issues. It might not be related to this sepolia bug, but will keep monitoring... |
Can you try setting |
@antonis19 I have updated to 5G, and it is still painfully slow. It is similar to the mainnet logs above, ie each block is processed twice. And there are these strange
I have seen some recent issues regarding erigon3 that is broken with externalcl, I wonder if v2.61.0 pulled some work from e3 which is now causing issues with e2? |
Are you running with external CL? If so which one is it? I am getting around 400 mgas/s running with internal CL btw. |
yes, running prysm v5.2.0. |
I am sunning on SSD, and v2.60.10 working fine. v2.61.0 is consistently slow/broken. |
|
@antonis19 do you have what you need to investigate this issue? Let me know if you need me to try anything else from my end. |
@keithchew yes, we are investigating this currently. I will reach out with more updates or if we need more info. |
@keithchew I wasn't able to replicate the issue, I managed to catch up with latest block just fine using Prysm v5.1.2 following the guide here: https://docs.prylabs.network/docs/install/install-with-script. I will also try with v5.2.0 to see if the version is the cause of the problem. In the meantime, could you provide me with the command you used to run prysm? |
@keithchew actually the |
@antonis19 do you mind doing a restart on erigon, and post the logs here, so that I can compare with mine above? |
@keithchew Sure. I'll stop erigon for 1hour and let it catch up to the tip, and will paste the logs here. |
@keithchew See logs below:
|
hmm, for some reason, I don't see this in my logs:
before Let me investigate further... |
This could be related with this issue #12722 |
@antonis19 I think I found something. In my instance, I have this log:
I have tried:
and then started erigon (v2.60.10) and waited for it to rebuild. Once at chain tip, I restarted, but still seeing the Do you know how I can get the node to sync up snapshots and db to remove the |
Another hint I noticed when I shutdown v2.61.0:
the rpc time for validateAndStorePayload above it 3.8s! I wonder if the slowness in v2.61.0 is due to the communication between EL and CL? I will compare codebases between v2.60.10 and 2.61.0 and see if I can see anything... |
I have finally tracked it down! Using strace/gdb, I can see a lot of DB locks, so I had a look at the diff between v2.60.10 and v2.61.0 (because of the rename from ledgerwatch to erigontech, it had 1825 file changes!!). I tracked it down to this PR: Reverting this made it go fast again. So it looks like p2p batch updates could be choking the DB, causing congestion and slowdown. @AskAlexSharov coincidently, that is the PR that you included to fix the other issue I found, ie: |
Instead of reverting the whole PR 841733c, I am testing bringing back SafeNoSync from v2.60.10 on my local instances. The PR for this is here: It is looking good. Testing on 1 x sepolia and 2 x mainnet nodes, I am not experiencing any slowness. Let me know if you want me to alter the PR to target the next release branch when that becomes ready. |
@keithchew thanks for sharing the results of your investigation. Just to be sure, the slowness issue occurs only on Sepolia and not on mainnet? On my side, I've stopped my Sepolia node for a day and upon restarting it I notice it is stuck on this log:
Did you also observe the same behavior? Thanks for raising the PR, I will check if that indeed improves the speed. |
@keithchew Nevermind the comment I made about the log, it was a false alarm. The node managed to catch up to the tip in a matter of a few minutes, as you can see in the logs: |
I am seeing this in both sepolia and mainnet (logs above). I have my public IP published for NAT, which will increase the p2p activity on my nodes. Even to a stage of crashing erigon by exceeding the 10000 threads limit: The PR retains the previous flags from v2.60.10, ie the DB flags for no sync, so the 2s sync will allow for write backpressure. You should not see any speed improvements, but instead, when there is a spike of write activity, this will allow the writes to be flattened/smoothed out. For reference, here are the flags from v2.60.10: |
Discussion in: #13457 This PR brings back SafeNoSync from v2.60.10 to prevent DB from being congested and slowdown.
But anyway: if it helps in your case - maybe there is another explanation - let's accept your PR (because it was introduced only in recent release) and we will investigate why it impacting performance. Thank you for finding. I also port it to E3: #13545 |
Thanks @AskAlexSharov for the approval and merge. Here is a page with the flags explanation: In particular:
The default is to flush to disk immediately, which can be costly on high write activity, SafeNoSync instead will let the OS perform the write in an async flush fashion. |
Trying to upgrade from v2.60.10 to v2.61.0. It is trying to catch up with the latest block, but very very slow:
Rolling back to v2.60.10, it catches up quickly. Is there a regression issue?
The text was updated successfully, but these errors were encountered: