Service restarts due to upgrades can destroy testnet deployment #1952

conorsch · 2023-02-06T17:17:08Z

Over the weekend we saw a failure of Testnet 42 Adraste (#1877). After investigation, it appears that an automatic node pool upgrade destroyed the deployment at around 2023-02-05T05:45+00:00:

Ostensibly this happened because we've set the cluster config node pool options to auto_upgrade=true, here:

penumbra/deployments/terraform/modules/node/v1/gke.tf

Line 47 in a0a6a5c

auto_upgrade = true

but the root cause is that our initContainer logic for Tendermint keygen isn't idempotent. Let's update the latter so that we can safely restart a service from the same persistent volume and have things work just fine.

The text was updated successfully, but these errors were encountered:

conorsch · 2023-02-06T17:34:12Z

On a clean tendermint init, we see the keyfiles and config:

tendermint init --home /tmp/sandbox-tm
I[2023-02-06|09:28:10.009] Generated private validator                  module=main keyFile=/tmp/sandbox-tm/config/priv_validator_key.json stateFile=/tmp/sandbox-tm/data/priv_validator_state.json
I[2023-02-06|09:28:10.009] Generated node key                           module=main path=/tmp/sandbox-tm/config/node_key.json
I[2023-02-06|09:28:10.009] Generated genesis file                       module=main path=/tmp/sandbox-tm/config/genesis.json

❯ tree /tmp/sandbox-tm
/tmp/sandbox-tm
├── config
│   ├── config.toml
│   ├── genesis.json
│   ├── node_key.json
│   └── priv_validator_key.json
└── data
    └── priv_validator_state.json

3 directories, 5 files

As soon as tendermint starts, it will create config/addrbook.json. Let's check for the existence of that file, and skip initialization if we find it, because clearly Tendermint will have started at least once before, so we should not touch its state. It might be as reasonable to gate on data/state.db.

A service may restart in k8s; if that happens, we must not regenerate the tendermint config, as doing so will break the currently running testnet. Instead, for the initContainer logic, let's gate on whether the `config/addrbook.json` file exists: it won't as a result of `tendermint init`, but will exist as soon as tendermint starts. Refs #1952.

conorsch · 2023-02-06T18:24:19Z

That seems to work well. To test, I deployed f0775c3 to preview, then pulled logs from the first validator in the preview deployment:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-xg2ts --all-containers > val-0-before.log
$ head -n 15 val-0-before.log 
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' '!' -d /home/.tendermint ]
+ tendermint init validator --home /home/.tendermint
I[2023-02-06|18:06:45.016] Generated private validator                  module=main keyFile=/home/.tendermint/config/priv_validator_key.json stateFile=/home/.tendermint/data/priv_validator_state.json
I[2023-02-06|18:06:45.017] Generated node key                           module=main path=/home/.tendermint/config/node_key.json
I[2023-02-06|18:06:45.017] Generated genesis file                       module=main path=/home/.tendermint/config/genesis.json
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ '[' -d /home/tmpConfig/config ]

There we can see the key-init logic running. Then I killed the pod for the first validator, via kubectl delete pod penumbra-testnet-preview-val-0-xg2ts. The replicationcontroller automatically created a replacement, visible as the youngest of the set here:

$ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet-preview                
NAME                                   READY   STATUS    RESTARTS   AGE
penumbra-testnet-preview-fn-0-x9sw2    3/3     Running   0          12m
penumbra-testnet-preview-fn-1-5qxx8    3/3     Running   0          12m                  
penumbra-testnet-preview-val-0-dltcz   2/2     Running   0          61s                                                                                                                                                         
penumbra-testnet-preview-val-1-8cs9g   2/2     Running   0          12m

Let's grab those logs and inspect:

$ kubectl logs  --since=1h penumbra-testnet-preview-val-0-dltcz --all-containers > val-0-after.log
$ head -n 15 val-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-val-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-val-0
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
Address book already exists, not merging configs...
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0

Just what we want: the new validator instance comes up, using the same config in the persistent volume that was previously created.

conorsch · 2023-02-06T18:34:59Z

And here's the same, but for a fullnode in the deployment:

head -n 15 fn-0-after.log
+ chown -R 1025:1025 /home/pv-penumbra-testnet-preview-tm-fn-0
+ chown -R 1000:1000 /home/pv-penumbra-testnet-preview-pd-fn-0
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
Address book already exists, not initializing...
+ exit 0
+ CONFIG_DIR=/home/.tendermint/config
+ MERGE_DIR=/tmp/configMerge
+ OVERLAY_DIR=/config
+ TMP_DIR=/home/tmpConfig
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not merging configs...'
+ exit 0
Address book already exists, not merging configs...

It was worth checking separately, since technically the fullnode and validator configs use different init logic.

conorsch · 2023-02-18T18:21:25Z

This happened again on testnet 044-ananke. We can see that the pods were destroyed and recreated ~11h ago:

❯ kubectl get pods -l app.kubernetes.io/instance=penumbra-testnet
NAME                           READY   STATUS    RESTARTS   AGE
penumbra-testnet-fn-0-ct9pb    3/3     Running   0          11h
penumbra-testnet-fn-1-7gvm5    3/3     Running   0          11h
penumbra-testnet-val-0-rf24g   2/2     Running   0          11h
penumbra-testnet-val-1-rrzp4   2/2     Running   0          11h

And this matches the lifetime of the nodes on which those pods are running:

❯ kubectl get nodes
NAME                                        STATUS   ROLES    AGE   VERSION
gke-testnet-chain-node-pool-582c2542-q3rn   Ready    <none>   11h   v1.25.5-gke.2000
gke-testnet-chain-node-pool-8f5ab500-ubvc   Ready    <none>   11h   v1.25.5-gke.2000

The don't-reinitialize logic described above was triggered:

❯ kubectl logs penumbra-testnet-fn-0-ct9pb  -c config-init
Address book already exists, not initializing...
+ CHAIN_DIR=/home/.tendermint
+ '[' -e /home/.tendermint/config/addrbook.json ]
+ echo 'Address book already exists, not initializing...'
+ exit 0

Which is good, but clearly not enough to keep the testnet functioning. From a node on the testnet:

Feb 18 17:56:31 shadow tendermint[715337]: E[2023-02-18|17:56:31.175] prevote step: ProposalBlock is invalid       module=consensus height=63272 round=281 err="wrong Block.Header.AppHash.  Expected DFA44D9E49CB9A07B8A6AC1A227B7212A5BF94A48E4CBA271518E3FE56E026CE, got 9232347076BF2BCF833688502A16220A97608D40BCEFEA5C94BC2201F84A4C9D"

conorsch · 2023-02-22T00:00:42Z

We disabled automatic upgrades to the node pool in f2d98df, to minimize surprises, and filed #2011 to increase headroom on storage requests.

conorsch closed this as completed Feb 6, 2023

conorsch reopened this Feb 18, 2023

conorsch mentioned this issue Feb 21, 2023

k8s: make persistence volume size configurable #2011

Closed

conorsch closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service restarts due to upgrades can destroy testnet deployment #1952

Service restarts due to upgrades can destroy testnet deployment #1952

conorsch commented Feb 6, 2023 •

edited

Loading

conorsch commented Feb 6, 2023

conorsch commented Feb 6, 2023

conorsch commented Feb 6, 2023

conorsch commented Feb 18, 2023

conorsch commented Feb 22, 2023

Service restarts due to upgrades can destroy testnet deployment #1952

Service restarts due to upgrades can destroy testnet deployment #1952

Comments

conorsch commented Feb 6, 2023 • edited Loading

conorsch commented Feb 6, 2023

conorsch commented Feb 6, 2023

conorsch commented Feb 6, 2023

conorsch commented Feb 18, 2023

conorsch commented Feb 22, 2023

conorsch commented Feb 6, 2023 •

edited

Loading