-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service restarts due to upgrades can destroy testnet deployment #1952
Comments
On a clean
As soon as tendermint starts, it will create |
A service may restart in k8s; if that happens, we must not regenerate the tendermint config, as doing so will break the currently running testnet. Instead, for the initContainer logic, let's gate on whether the `config/addrbook.json` file exists: it won't as a result of `tendermint init`, but will exist as soon as tendermint starts. Refs #1952.
That seems to work well. To test, I deployed f0775c3 to preview, then pulled logs from the first validator in the preview deployment:
There we can see the key-init logic running. Then I killed the pod for the first validator, via
Let's grab those logs and inspect:
Just what we want: the new validator instance comes up, using the same config in the persistent volume that was previously created. |
And here's the same, but for a fullnode in the deployment:
It was worth checking separately, since technically the fullnode and validator configs use different init logic. |
This happened again on testnet 044-ananke. We can see that the pods were destroyed and recreated ~11h ago:
And this matches the lifetime of the nodes on which those pods are running:
The don't-reinitialize logic described above was triggered:
Which is good, but clearly not enough to keep the testnet functioning. From a node on the testnet:
|
Over the weekend we saw a failure of Testnet 42 Adraste (#1877). After investigation, it appears that an automatic node pool upgrade destroyed the deployment at around
2023-02-05T05:45+00:00
:Ostensibly this happened because we've set the cluster config node pool options to
auto_upgrade=true
, here:penumbra/deployments/terraform/modules/node/v1/gke.tf
Line 47 in a0a6a5c
The text was updated successfully, but these errors were encountered: