Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service restarts due to upgrades can destroy testnet deployment #1952

Closed
conorsch opened this issue Feb 6, 2023 · 5 comments
Closed

Service restarts due to upgrades can destroy testnet deployment #1952

conorsch opened this issue Feb 6, 2023 · 5 comments

Comments

@conorsch
Copy link
Contributor

conorsch commented Feb 6, 2023

Over the weekend we saw a failure of Testnet 42 Adraste (#1877). After investigation, it appears that an automatic node pool upgrade destroyed the deployment at around 2023-02-05T05:45+00:00:

gke-auto-upgrade

Ostensibly this happened because we've set the cluster config node pool options to auto_upgrade=true, here:

but the root cause is that our initContainer logic for Tendermint keygen isn't idempotent. Let's update the latter so that we can safely restart a service from the same persistent volume and have things work just fine.

@conorsch
Copy link
Contributor Author

conorsch commented Feb 6, 2023

On a clean tendermint init, we see the keyfiles and config:

tendermint init --home /tmp/sandbox-tm
I[2023-02-06|09:28:10.009] Generated private validator                  module=main keyFile=/tmp/sandbox-tm/config/priv_validator_key.json stateFile=/tmp/sandbox-tm/data/priv_validator_state.json
I[2023-02-06|09:28:10.009] Generated node key                           module=main path=/tmp/sandbox-tm/config/node_key.json
I[2023-02-06|09:28:10.009] Generated genesis file                       module=main path=/tmp/sandbox-tm/config/genesis.json

❯ tree /tmp/sandbox-tm
/tmp/sandbox-tm
├── config
│   ├── config.toml
│   ├── genesis.json
│   ├── node_key.json
│   └── priv_validator_key.json
└── data
    └── priv_validator_state.json

3 directories, 5 files

As soon as tendermint starts, it will create config/addrbook.json. Let's check for the existence of that file, and skip initialization if we find it, because clearly Tendermint will have started at least once before, so we should not touch its state. It might be as reasonable to gate on data/state.db.

conorsch added a commit that referenced this issue Feb 6, 2023
A service may restart in k8s; if that happens, we must not regenerate
the tendermint config, as doing so will break the currently running
testnet. Instead, for the initContainer logic, let's gate on whether
the `config/addrbook.json` file exists: it won't as a result of
`tendermint init`, but will exist as soon as tendermint starts.

Refs #1952.
@conorsch
Copy link
Contributor Author

conorsch commented Feb 6, 2023

That seems to work well. To test, I deployed f0775c3 to preview, then pulled logs from the first validator in the preview deployment:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-xg2ts --all-containers > val-0-before.log
$ head -n 15 val-0-before.log 
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' '!' -d /home/.tendermint ]
+ tendermint init validator --home /home/.tendermint
I[2023-02-06|18:06:45.016] Generated private validator                  module=main keyFile=/home/.tendermint/config/priv_validator_key.json stateFile=/home/.tendermint/data/priv_validator_state.json
I[2023-02-06|18:06:45.017] Generated node key                           module=main path=/home/.tendermint/config/node_key.json
I[2023-02-06|18:06:45.017] Generated genesis file                       module=main path=/home/.tendermint/config/genesis.json
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' -d /home/tmpConfig/config ]

There we can see the key-init logic running. Then I killed the pod for the first validator, via kubectl delete pod penumbra-testnet-preview-val-0-xg2ts. The replicationcontroller automatically created a replacement, visible as the youngest of the set here:

$ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet-preview                
NAME                                   READY   STATUS    RESTARTS   AGE
penumbra-testnet-preview-fn-0-x9sw2    3/3     Running   0          12m
penumbra-testnet-preview-fn-1-5qxx8    3/3     Running   0          12m                  
penumbra-testnet-preview-val-0-dltcz   2/2     Running   0          61s                                                                                                                                                         
penumbra-testnet-preview-val-1-8cs9g   2/2     Running   0          12m

Let's grab those logs and inspect:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-dltcz --all-containers > val-0-after.log
$ head -n 15 val-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
Address book already exists, not merging configs...
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0

Just what we want: the new validator instance comes up, using the same config in the persistent volume that was previously created.

@conorsch
Copy link
Contributor Author

conorsch commented Feb 6, 2023

And here's the same, but for a fullnode in the deployment:

head -n 15 fn-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-fn-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-fn-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
Address book already exists, not initializing...
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0
Address book already exists, not merging configs...

It was worth checking separately, since technically the fullnode and validator configs use different init logic.

@conorsch conorsch closed this as completed Feb 6, 2023
@conorsch
Copy link
Contributor Author

This happened again on testnet 044-ananke. We can see that the pods were destroyed and recreated ~11h ago:

❯ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet
NAME                           READY   STATUS    RESTARTS   AGE
penumbra-testnet-fn-0-ct9pb    3/3     Running   0          11h
penumbra-testnet-fn-1-7gvm5    3/3     Running   0          11h
penumbra-testnet-val-0-rf24g   2/2     Running   0          11h
penumbra-testnet-val-1-rrzp4   2/2     Running   0          11h

And this matches the lifetime of the nodes on which those pods are running:

❯ kubectl get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
gke-testnet-chain-node-pool-582c2542-q3rn   Ready    <none>   11h   v1.25.5-gke.2000
gke-testnet-chain-node-pool-8f5ab500-ubvc   Ready    <none>   11h   v1.25.5-gke.2000

The don't-reinitialize logic described above was triggered:

❯ kubectl logs penumbra-testnet-fn-0-ct9pb  -c config-init
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0

Which is good, but clearly not enough to keep the testnet functioning. From a node on the testnet:

Feb 18 17:56:31 shadow tendermint[715337]: E[2023-02-18|17:56:31.175] prevote step: ProposalBlock is invalid       module=consensus height=63272 round=281 err="wrong Block.Header.AppHash.  Expected DFA44D9E49CB9A07B8A6AC1A227B7212A5BF94A48E4CBA271518E3FE56E026CE, got 9232347076BF2BCF833688502A16220A97608D40BCEFEA5C94BC2201F84A4C9D"  

@conorsch
Copy link
Contributor Author

We disabled automatic upgrades to the node pool in f2d98df, to minimize surprises, and filed #2011 to increase headroom on storage requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant