-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base op-reth Archival Node: Can't sync #11512
Comments
for the record, without |
very odd, same as #11306 we haven't tried to reproduce this from the snapshot yet, but resynced base entirely on similar infrastructure as yours without any issues. resyncing base archive takes ~48hrs, so currently I'd recommend this |
I tried to sync over the weekend "from scratch" by using the Should doing a My working theory is that something is using a non-deterministic hash as some kind of "foreign key relation" and it's re-inserting or updating the same records over and over again, and it never does a true checkpoint.. Perhaps a table is using "wall clock time" as an index and diverges because an old record with another "wall clock time" but same hash exists? Other suspicion: does MDBX have the concept of iterators such that an index/key and isn't being reset (or contains gaps in records?) Other thing of note "my safe db_hash" reported by op-node is always 0x0000000000000 or whatever an all-zero record would be.. is something broken since it never sets a "safe" hash for a checkpoint? |
This is definitely not it, but I wonder if this is because of how op-node chooses ranges to sync, combined with poor / high latency disk |
thanks for the clarification on that -- I think the disk latency may be symptomatic rather than a root cause -- I have 20k provisoned iops on an aws io2 volume and it's saturating the iops but only reading/writing less than 10 megs total.. This log from op-node seems to be wonky but can't tell if it's just a bad log message:
the fact that the finalized/safe/pending_safe never changes would support the theory that it's resyncing the same ranges every time, but could also just be bad logging |
So I think I have narrowed it down a bit.. I noticed that when I use the I tried using hildr to see if that would work, and if I use When I use hildr, however it complains about "batches" (which I assume were previously inserted by op-node) being invalid because of the wall-clock time of the batches being too far skewed from an "expected" value. I also noticed the performance is pretty bad, which I assume is from the previous batches that op-node inserted as well. I am not running with So I think this has something to do with attempting "execution layer" sync modes with reth and either reth is not processing the batches, or op-node is not properly handing off the batches to op-reth, so they never get put in a "finalized/safe" state. Unfortunately I ran into other bugs with hildr where it would freak out about null values in safe/finalized, as well as the batch time validation mismatches, so unfortunately can't use hildr as a replacement for op-node. At the moment I am doing a sync from scratch with op-reth and op-node on base with "consesnsus-layer" sync, which seems like it's at least moving up in blocks at a good speed, but will take a while til its finshed. It does appear that doing a |
Also curious if safe/finalized should ever actually be |
I encountered the same issue, when I tried to do the initail sync from the
The Disk IOPS \ CPU \ Memory usage are all at a relatively low level |
This issue is stale because it has been open for 21 days with no activity. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Describe the bug
I have attempted on a few different setups, but it does not appear I am able to sync an archival node (without --full) and keep it in sync on AWS.
I am using a io2 storage (20k iops) with an r7a.2xlarge (64 gigs of ram, 8 AMD EPYC 9R14 cores) and it seems to keep looping through the pipeline stages but never catching up.. It seems like the culprit is MerkelExecute, and I cans ee from the performance that it is not a CPU-bound problem; the single core (since I assume this is a serlialized, single-thread step) is not maxed out, however my disk iops and utilization is always at 100%.. Also the amount of data being transferred is pretty small, so even with 20k iops I am only read/writing around 8 megs of data.
My suspicion is the mdbx file is too "sparse" and it needs some kind of online compaction or "defrag" but don't know how to debug this. Running mdbx_copy is not really a solution since it takes 5 hours to run (and is not an online operation) and I am not able to sync from the available reth-base archive snapshot.
Steps to reproduce
exec op-reth node --chain=base \ --rollup.sequencer-http https://mainnet-sequencer.base.org \ --http --http.port 8545 --ws --ws.port 8546 \ --http.api=web3,debug,eth,net,txpool \ --ws.api=web3,debug,eth,net,txpool \ --metrics=127.0.0.1:9001 \ --ws.origins="*" \ --http.corsdomain="*" \ --rollup.discovery.v4 \ --engine.experimental \ --authrpc.jwtsecret ${HOME}/jwt.hex \ --datadir ${HOME}/oprethdata
Node logs
No response
Platform(s)
Linux (x86)
What version/commit are you on?
v1.0.8
What database version are you on?
2
Which chain / network are you on?
base mainnet
What type of node are you running?
Archive (default)
What prune config do you use, if any?
n/a
If you've built Reth from source, provide the full command you used
make maxperf-op
Code of Conduct
The text was updated successfully, but these errors were encountered: