Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16GB of RAM needed to process the first block #670

Closed
4 tasks
faddat opened this issue Feb 20, 2021 · 14 comments
Closed
4 tasks

16GB of RAM needed to process the first block #670

faddat opened this issue Feb 20, 2021 · 14 comments

Comments

@faddat
Copy link
Contributor

faddat commented Feb 20, 2021

Summary of Bug

https://github.com/cosmos/gaia/blob/a96d7f50b875c557f6c5fa98ac9db50b9fee68b5/docs/migration/cosmoshub-3.md#preliminary

I think that we should treat the RAM requirement as a bug. Other Cosmos networks will face this issue and it rules out running nodes on smaller machines. I figure that this is somehow related to #669.

Version

4.0.4

Steps to Reproduce

Run gaia on Raspberry Pi, or any machine with less than 16GB RAM. All nodes have to process genesis when they start, meaning that this does not only affect validators, but instead anyone wanting to run gaia nodes.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@tac0turtle
Copy link
Member

I wouldn't call this a bug with the Gaia. This most likely stems from sdk and how it reads the genesis file.

@faddat
Copy link
Contributor Author

faddat commented Feb 20, 2021

I'd say it's a bug with Gaia until we fix it in the SDK :).

Right now only gaia is being significantly impacted, so it is users of gaia who would feel this. Is it possible to put the same issue on two repositories?

@dualsystems
Copy link

In docs right info: https://github.com/cosmos/gaia/pull/621/files

@faddat
Copy link
Contributor Author

faddat commented Feb 20, 2021

@dualsystems

I guess I am calling the level of resource consumption a bug. It's of course good to have the system requirements documented, but we're dealing with a not-terribly-large json file, I figure that the import should not require 16gb of RAM.

@shahankhatch
Copy link
Contributor

I don't believe this to be a bug for the following reasons:

  • 16GB is a reasonable server requirement, and documentation states the requirement
  • it's normal for 100+MB semi-structured files to require large amounts of memory in order to build indexes for fast lookups, and the tradeoff here is likely performance vs memory.
  • the resolution of this issue isn't clear to me. Is the target 8GB? Why spend that much time optimizing for less than 2x improvement which might be undone as more state is brought in with new modules.
  • I saw a Prometheus graph from a user for the startup that showed not much memory usage. My own top-based tests showed the same.

I think @faddat has highlighted an important point and instead of closing this I think we could transform this into a root cause verification and investigation that the json load actually requires 16GB. Not sure if it should be here or on tm repo (since gaia has no control over the underlying data management/memory system).

@faddat
Copy link
Contributor Author

faddat commented Feb 21, 2021

Comparison: Another chain I work with (graphene framework) imports this file:

https://gateway.pinata.cloud/ipfs/QmPrwVpwe4Ya46CN9LXNnrUdWvaDLMwFetMUdpcdpjFbyu

(678MB of json, 1.35 million accounts with balances and multiple public keys per account)

On a raspberry Pi 4 with 4GB RAM in a few minutes.

When I get back I will link to what I think is the relevant code in tendermint, I haven't been able to figure out why it's slow just yet. 16GB is fine I guess, but we only ever use 16gb once per node and it rules out a wide range of devices.

@faddat
Copy link
Contributor Author

faddat commented Feb 22, 2021

So we eat the json here:

https://github.com/tendermint/tendermint/blob/b6be889b97598ed0bdd662e654705a411cab3940/types/genesis.go#L111-L137

But my feeling is that the slowdown occurs when dealing with tm.db:

https://github.com/tendermint/tendermint/blob/b6be889b97598ed0bdd662e654705a411cab3940/node/node.go#L834

I am going to do a silly test, unsafe-reset-all on my 12 core / 128gb machine and start with a RAM store. The strange thing is that we don't consume much CPU while importing genesis.json, but the bottleneck probably isn't disk, that machine is where I started the ndoe and it's got RAID 0 2TB Nvme disks.

@faddat
Copy link
Contributor Author

faddat commented Feb 22, 2021

Screen Shot 2021-02-22 at 10 51 52 PM

It's no different this way in fact. Doesn't use much CPU and maybe crucially, it doesn't look like there is a "single core maxed out" type condition.

@shahankhatch
Copy link
Contributor

Gaia v4.0.5 resolves the startup time by bumping up to Cosmos SDK 0.41.4. It may be related to the 16GB memory requirement; startup is down to 10 mins. wdyt of running your test again?

@faddat
Copy link
Contributor Author

faddat commented Mar 4, 2021

Absolutely. I'll try it on a Raspberry Pi now.

@faddat
Copy link
Contributor Author

faddat commented Mar 4, 2021

@shahankhatch With what setup did you get a 10 minute start?

34 for me on my giant machine at Hetzner; I used the quckstart snippet exactly.

gaiad version
HEAD-042b7ef3bf07c4dc3d57eb733cd905b5bac22706
gaiad start --p2p.seeds bf8328b66dceb4987e5cd94430af66045e59899f@public-seed.cosmos.vitwit.com:26656,[email protected]:26656,[email protected]:26656,ba3bacc714817218562f743178228f23678b2873@public-seed-node.cosmoshub.certus.one:26656,[email protected]:26656
10:33PM INF starting ABCI with Tendermint
10:33PM INF Starting multiAppConn service impl=multiAppConn module=proxy
10:33PM INF Starting localClient service connection=query impl=localClient module=abci-client
10:33PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
10:33PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
10:33PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
10:33PM INF Starting EventBus service impl=EventBus module=events
10:33PM INF Starting PubSub service impl=PubSub module=pubsub
10:33PM INF Starting IndexerService service impl=IndexerService module=txindex
10:33PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=
10:33PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
10:37PM INF asserting crisis invariants inv=0/11 module=x/crisis
10:37PM INF asserting crisis invariants inv=1/11 module=x/crisis
10:37PM INF asserting crisis invariants inv=2/11 module=x/crisis
10:37PM INF asserting crisis invariants inv=3/11 module=x/crisis
10:37PM INF asserting crisis invariants inv=4/11 module=x/crisis
10:37PM INF asserting crisis invariants inv=5/11 module=x/crisis
10:38PM INF asserting crisis invariants inv=6/11 module=x/crisis
10:38PM INF asserting crisis invariants inv=7/11 module=x/crisis


11:07PM INF asserting crisis invariants inv=8/11 module=x/crisis
11:07PM INF asserting crisis invariants inv=9/11 module=x/crisis
11:07PM INF asserting crisis invariants inv=10/11 module=x/crisis
11:07PM INF asserted all invariants duration=1803438.12619 height=5200791 module=x/crisis
11:07PM INF created new capability module=ibc name=ports/transfer
11:07PM INF port binded module=x/ibc/port port=transfer
11:07PM INF claimed capability capability=1 module=transfer name=ports/transfer
11:07PM INF Completed ABCI Handshake - Tendermint and App are synced appHash= appHeight=0 module=consensus
11:07PM INF Version info block=11 p2p=8 software=v0.34.8
11:07PM INF This node is not a validator addr=80945A898B6D3E577AA148BF127F89EBE9020DC4 module=consensus pubKey=9zwf2S11nhuNMzripexW1kBTalBUUMIt2yie4WTT/QA=
11:07PM INF P2P Node ID ID=52af6bd5493d7811cd4c811143da80aa18370d76 file=/root/.gaia/config/node_key.json module=p2p
11:07PM INF Adding persistent peers addrs=[] module=p2p
11:07PM INF Adding unconditional peer ids ids=[] module=p2p
11:07PM INF Add our address to book addr={"id":"52af6bd5493d7811cd4c811143da80aa18370d76","ip":"0.0.0.0","port":26656} book=/root/.gaia/config/addrbook.json module=p2p
11:07PM INF Starting Node service impl=Node
11:07PM INF Starting pprof server laddr=localhost:6060
11:07PM INF Starting P2P Switch service impl="P2P Switch" module=p2p
11:07PM INF Starting Evidence service impl=Evidence module=evidence
11:07PM INF Starting StateSync service impl=StateSync module=statesync
11:07PM INF Starting PEX service impl=PEX module=pex
11:07PM INF Starting AddrBook service book=/root/.gaia/config/addrbook.json impl=AddrBook module=p2p
11:07PM INF Starting RPC HTTP server on 127.0.0.1:26657 module=rpc-server
11:07PM INF Starting Mempool service impl=Mempool module=mempool
11:07PM INF Starting BlockchainReactor service impl=BlockchainReactor module=blockchain
11:07PM INF Starting BlockPool service impl=BlockPool module=blockchain
11:07PM INF Starting Consensus service impl=ConsensusReactor module=consensus
11:07PM INF Reactor  module=consensus waitSync=true
11:07PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=0 numToDial=10
11:07PM INF Saving AddrBook to file book=/root/.gaia/config/addrbook.json module=p2p size=0
11:07PM INF No addresses to dial. Falling back to seeds module=pex
11:07PM INF Dialing peer address={"id":"cfd785a4224c7940e9a10f6c1ab24c343e923bec","ip":"164.68.107.188","port":26656} module=p2p
11:07PM ERR Error dialing seed err="auth failure: secret conn failed: read tcp *************** ->164.68.107.188:26656: i/o timeout" module=p2p seed={"id":"cfd785a4224c7940e9a10f6c1ab24c343e923bec","ip":"164.68.107.188","port":26656}
11:07PM INF Dialing peer address={"id":"ba3bacc714817218562f743178228f23678b2873","ip":"5.83.160.108","port":26656} module=p2p
11:07PM ERR Error dialing seed err="dial tcp 5.83.160.108:26656: connect: connection refused" module=p2p seed={"id":"ba3bacc714817218562f743178228f23678b2873","ip":"5.83.160.108","port":26656}
11:07PM INF Dialing peer address={"id":"3c7cad4154967a294b3ba1cc752e40e8779640ad","ip":"84.201.128.115","port":26656} module=p2p
11:07PM ERR Error dialing seed err="auth failure: secret conn failed: read tcp ************->84.201.128.115:26656: i/o timeout" module=p2p seed={"id":"3c7cad4154967a294b3ba1cc752e40e8779640ad","ip":"84.201.128.115","port":26656}

@shahankhatch
Copy link
Contributor

I started it with skipping invariants. From your logs it looks like invariants took the amount of time that would coincide with a <10min startup without the invariant checks. Do you agree?

Mind running it again with memory usage metrics? Just to determine if this issue is on point with the discussion.

@faddat
Copy link
Contributor Author

faddat commented Mar 16, 2021

I don't mind at all. Is there a flag for this?

Right now, I am starting it up on rpi, and timing it. Will be interesting.

Screen Shot 2021-03-16 at 9 47 08 PM

Fully-auto rpi4 image builds for Gaia, are also near completion.

https://github.com/faddat/sos/actions/workflows/gaia.yml

I would like to make a PR for that here, what do you think?

@faddat
Copy link
Contributor Author

faddat commented Mar 16, 2021

Gaia starting on an rpi:

Mar 16 14:44:20 alarm gaiad[5715]: 2:44PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0


Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF created new capability module=ibc name=ports/transfer
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF port binded module=x/ibc/port port=transfer
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF claimed capability capability=1 module=transfer name=ports/transfer
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF Completed ABCI Handshake - Tendermint and App are synced appHash= appHeight=0 module=consensus
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF Version info block=11 p2p=8 software=v0.34.8
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF This node is not a validator addr=EF51097CE2F1680D13465DC0A517D9AD9429B6F6 module=consensus pubKey=ZVWv7ZehR8jgIE/LGFYGoCSH7iJMkk7JKqHSAADrI0o=
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF P2P Node ID ID=02d23e2767e7f2abb2cb7d69b5c5d27d8ee6bfd0 file=/home/gaia/.gaia/config/node_key.json module=p2p
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF Adding persistent peers addrs=[] module=p2p
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF Adding unconditional peer ids ids=[] module=p2p
Mar 16 15:21:32 alarm gaiad[5715]: 3:21PM INF Add our address to book addr={"id":"02d23e2767e7f2abb2cb7d69b5c5d27d8ee6bfd0","ip":"0.0.0.0","port":26656} book=/home/gaia/.gaia/config/addrbook.json module=p2p

About 36 minutes, and there was no out of memory issue.

I could repeat this with a tool like htop, but if you've got another way to measure memory consumption you'd like me to use, just let me know :).

I kind of reckon that it is now safe to close this issue. It's a 4GB rpi.

@faddat faddat closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants