-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive node with latest snapshot cannot reach live #11306
Comments
I'm also trying without the "--l2.enginekind=reth" option and the node can only recover one block every 2/3 seconds, it's extremely slow considering how the Base chain is going at the moment, to get back live with a node 1 day behind, it takes about 1 week
|
Hi @MrFrogoz , what hardware are you running this on? Specifically, what kind of disk, cpu, and total ram? |
AMD EPYC 7J13 16 Core, 64 GB Ram, Nvme 50K IOPS - 680 Mbps. Usage when running: CPU 3% - RAM 30% I run many nodes of different blockchains with the same hardware, only Base has slowness issues with blocks sync |
Is there anything news? in the meantime i noticed that when the node is under many debug trace calls, the synchronization is even slower, almost always remain late, the node usage always remains low. In the meantime reading the other thread, i try the "--engine.experimental" option to see if the node can stay live |
I succeeded to put the node live with that option, then I removed it for the following problem: #11570. However, restarting the node without that option, to recover 10 minutes of offline, the node took 2 hours to go live again. I hope you will be able to improve the performance of the binary, I don't know if op-node has something to do with this continuous slowdown. I will wait a feedback |
still considering that the sync is quite slow as written before, if I try to make trace block calls only with live blocks, the node is again unable to stay live, and starts to accumulate continuous delay |
this is likely the reason you're experiencing the tracing issues, tracing is CPU bound hence the concurrent tracing requests are by default limited to number of cores minus x depending on the number of available cores
this sounds more like a disk issue @Rjected ? |
I understand, but even if I keep a node with rpc api closed to all, it is still very slow to resync to live. For disk I am using a m2 ssd with these settings: Current utilization per second: I have a cluster of 5 nodes and they all have the same problem, each node is a few hours late. |
Is it possible to have some feedback? are you working on it? are you checking if the problem is real? other developers from what I've seen online have the same slowdown even on different infrastructures |
@MrFrogoz sorry for the lack of an update - we still don't know what the issue could be. We've been running base nodes on our own infrastructure without issue, although base may just require lower disk latency (given the iops / bandwidth don't seem saturated). An exact disk model would help, since some disks perform better than others |
Unfortunately the Oracle data center does not explicitly state which disk model they use, but they definitely use NVMe SSD units, theoretically it equals the aws gp3 disk type. The fact that it doesn't stay online even with closed rpc calls and that the disk metrics data is so unused, suggests that there is some lack of optimization once a certain tps value is reached |
@MrFrogoz is the storage analogous to i.e. GCP "local SSD"s, or AWS "instance storage", for example on their r5d machines? Or is the storage a network-attached block storage system (AWS EBS, GCP hyperdisk, etc)? Is there an instance type we can take a look at? reth performs poorly on network-attached block storage because those systems have much higher latency, making it much more difficult to utilize the metered IOPS. This is because our I/O access patterns during block validation are synchronous and not parallel. |
As written above you can take as a reference aws ebs in gp3 is identical to the one I'm using.
|
I guess this task will not be handled. You will probably handle it when some chain that uses reth does like Base but x2 traffic where not even aws io2 will be enough In the meantime I'll try these values to improve the situation
|
Solved using bare metal servers with SAMSUNG MZVL2512HCJQ-00B00, I hope in the future it can be used on common cloud providers |
Just as an update, we're aware of this and will get to it eventually, but there aren't any easy fixes for this that we have time to work on right now. Going to leave this open, and will let you know when we start a more thorough effort to improve performance on cloud disks |
I'm on the same... syncing base mainnet archive op-reth 1.1.4 it does stages 1-12, then again, again, again log looks like
and never catch the tip. I'm on quite fast hw,
and cpu
storage is zfs on 6 nvme disks (SAMSUNG MZQLB3T8HALS). we're running three more nodes, they're ok - with the same configuration, from the same snapshot. cmdline (docker, ghcr.io/paradigmxyz/op-reth:v1.1.4):
|
@tmeinlschmidt very strange that it doesn't work with that type of disk, are you running on a bare metal server? Anyway to temporarily manage the problem, even if you run in reth, insert this in the op-node config: --l2.enginekind=geth , op-reth will try again to do its same stage run, once done, you will see that it stops and starts receiving blocks as if it were an op-geth |
Same issue as @tmeinlschmidt with base and reth 1.2.0. |
Describe the bug
When the node is set with "--l2.enginekind=reth" and starts downloading from a checkpoint to a target, at this point it is very slow:
and when he finishes a set of blocks, he is still always a couple of hours behind and starts again with a new set of blocks. He does this for days without ever catching up to live.
Steps to reproduce
Node logs
No response
Platform(s)
Linux (x86)
What version/commit are you on?
latest
What database version are you on?
latest
Which chain / network are you on?
base
What type of node are you running?
Archive (default)
What prune config do you use, if any?
No response
If you've built Reth from source, provide the full command you used
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: