-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS container mounting hangs indefinitely and can block daemon #1667
Comments
The underlying issue here may have a broader impact. Yesterday I staged my instances for patching, which happens in the middle of the night including reboots, and woke up to all instances stopped. I had left the system with one container instance in the stuck state, as mentioned above. Here's a VM that failed to start back up on reboot:
|
Can you show |
And I woke up again to all instances stopped. I spent some time last night trying to create a reproducer without success. I have three nodes, but only two of them have this issue. There are hardware differences documented below, but considering these nodes used to be stable it's likely a kernel/ZFS bug as you suggested, possibly related to AMD? But the Incus behavior is troublesome as it's leaving me in a completely down state. Note, that I recreated this last night on one of the nodes on LTS 6.0.3, so it's not limited to non-LTS. I also spent some time trying to reproduce it in a VM, because of how disruptive it is on baremetal, but failed. I can't find a reproducer. Which reinforces for me this is an issue with the specific hardware. Node 1 (unaffected): Intel i5-9600T, 64GB non-ECC, 1xNVME ZFS, Kernel 6.6.x, ZFS 2.2, Incus 6.0.3 While node 3 is currently on ZFS 2.3, it's the node I've been mostly using to try different configurations. I've tried combinations of Kernel 6.12.x and ZFS 2.2, but couldn't downgrade Incus to LTS due to a schema change. dmesg
ps fauxww
|
Yep, your kernel is fucked up, nothing Incus can do about it sadly :( |
The kernel traces show that Incus is asking for a ZFS mount to happen, then something hangs in ZFS and causes a complete (and likely permanent) CPU hang on that task. Things will then keep piling up as other mounts will keep getting stuck on it, until you run out of CPU threads and your entire system dies. On the Incus front, it has threads that are now locked on the kernel so you can't kill the process in that situation, it just goes to defunct and will stay in that state forever. Only way out of this is a full system reboot. But obviously the issue will occur again unless you figure out the root cause of it. Usual recommendation is to do a full ZFS scrub to make sure there's nothing wrong there. |
I have a monthly scrub and one ran on the 1st, but this has been going on for much longer. I'll run another anyway, but doubt that'll help. Good tip on the zpool events, I'll check that and see if it can point to anything. Unfortunately, I'm running out of ideas, which is why I ended up opening this. Thanks for looking. |
You might have some luck opening an issue against the ZFS repo with that kernel stack trace to see if they have some ideas of how to debug it. |
Thanks, I actually just went there and found some potentially related issues. I'm downgrading ZFS to 2.2.5 to see if that helps. |
I can reliably enough hang the Incus daemon by restarting NixOS containers. The daemon seems to get stuck waiting for the ZFS volume to mount, and will never complete. Killing the daemon yields a defunct daemon.
That was three hours ago at the time of writing, with only client messages since. Trying to restart, stop, and start the container, all either error due to state or hang indefinitely.
NixOS 24.11
Incus 6.9.0, but has existed since 6.8.0 and maybe before. I have not experienced the issue on LTS 6.0.x, but my sample size is much smaller there.
I've verified across multiple kernels, and both ZFS 2.2 and 2.3.
Sorry I haven't reported this sooner, but I can't figure out a reliable reproducer to share, even if it's reliable enough for me that I expect the behavior. I was also trying to narrow things down by trying different configurations, but finally reached the point of just creating the ticket. The daemon lock up is worth it alone. Anecdotally, another NixOS user reported some similar ZFS-related issues in their automation environment. In their case, they could retry and continue their workflow. In my case a reboot seems to be the reliable way to fix the host.
What else can I try, or other information can I provide? Thanks, as always.
The text was updated successfully, but these errors were encountered: