Hung pool after upgrade from 2.1.5 to 2.2.2 when performing a lot of writes #17091

phillip-stephens · 2025-02-25T02:08:48Z

System information

Type	Version/Name
Distribution Name	Ubuntu Server
Distribution Version	24.04
Kernel Version	6.8.0-53-generic
Architecture	amd64
OpenZFS Version	zfs-2.2.2-0ubuntu9.1

Describe the problem you're observing

I have a storage server with 9 vdev's each with 10 drives in a raidz2 format. 6 of the vdev's have 12 TB drives, 3 have 22 TB ones.
Last week I updated all packages on the Ubuntu 22.04 server, shut off smbd, upgraded from Ubuntu 22.04 -> Ubuntu 24.04, and rebooted.
Everything looked normal, the pool was healthy (noted by zpool status -v) and sudo dmesg was clean.
I upgraded the pool with zpool update. Everything still looked good.
Upon enabling smbd, within a few minutes a deadlock was be entered. Using zpool iostat -v 1 I could see there was no I/O happening on any drive.

ps aux:

pste@storage:~$ ps aux | grep "D"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        2901  4.7  0.0      0     0 ?        D    08:39  17:03 [z_upgrade]
root        4484  0.0  0.0      0     0 ?        D    08:39   0:00 [txg_quiesce]
root        6863  0.0  0.0  12196  8000 ?        Ss   08:48   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
kr      7436  0.0  0.0 132332 23756 ?        D    08:56   0:00 smbd: client [10.216.3.128]
as        7464  0.0  0.0 131708 28032 ?        D    08:56   0:00 smbd: client [10.216.0.37]
zd        7529  0.0  0.0 123512 17152 ?        D    08:56   0:00 smbd: client [10.216.3.160]
qa    7541  0.0  0.0 120468 39088 ?        D    08:56   0:06 smbd: client [10.216.3.49]
qu      7549  3.1  0.0 155548 48848 ?        D    08:56  10:48 smbd: client [10.216.0.20]
kr     12565  0.0  0.0 119112 21272 ?        D    11:18   0:00 smbd: client [10.216.2.157]
pste+   17343  0.0  0.0   9144  1920 pts/0    S+   14:37   0:00 grep --color=auto D

I figured something random was happening, so I rebooted the server but the issue immediately came back. I also figured maybe z_upgrade needed to complete but have seen online that this is safe to run in the background.
My running theory is that this is caused by writes rather than large amounts of reads. I tried making the smb share with the most I/O read-only to reduce I/O, and upon starting smbd the server was in a live state for almost an hour before a user tried to delete a large folder and it deadlocked back up again. Rebooting always fixes it.

Describe how to reproduce the problem

Upgrade server from Ubuntu 22.04 -> Ubuntu 24.04.
Upgrade Pool
Have 1+ Samba shares, possibly performing a large amount of writes

Include any warning/errors/backtraces from the system logs

sudo dmesg showing errors of the deadlock

Feb 24 08:40:00.406515 storage-01 kernel: RPC: Registered tcp-with-tls transport module.
Feb 24 08:40:00.406551 storage-01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Feb 24 08:40:00.575078 storage-01 kernel: Process accounting resumed
Feb 24 08:40:00.928108 storage-01 systemd-journald[903]: /var/log/journal/4d9214db14634e1a802b475d9bfe0478/user-1038.journal: Journal file uses a different sequence number ID, rotating.
Feb 24 10:44:09.484041 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 122 seconds.
Feb 24 10:44:09.491210 storage-01 kernel:       Tainted: P           O       6.8.0-53-generic #55-Ubuntu
Feb 24 10:44:09.491334 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:44:09.491417 storage-01 kernel: task:txg_quiesce     state:D stack:0     pid:4484  tgid:4484  ppid:2      flags:0x00004000
Feb 24 10:44:09.491517 storage-01 kernel: Call Trace:
Feb 24 10:44:09.491592 storage-01 kernel:  <TASK>
Feb 24 10:44:09.491655 storage-01 kernel:  __schedule+0x27c/0x6b0
Feb 24 10:44:09.491784 storage-01 kernel:  schedule+0x33/0x110
Feb 24 10:44:09.491896 storage-01 kernel:  cv_wait_common+0x102/0x140 [spl]
Feb 24 10:44:09.491973 storage-01 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:44:09.492069 storage-01 kernel:  __cv_wait+0x15/0x30 [spl]
Feb 24 10:44:09.492130 storage-01 kernel:  txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:44:09.492205 storage-01 kernel:  txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:44:09.492281 storage-01 kernel:  ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:44:09.492371 storage-01 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:44:09.492438 storage-01 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:44:09.492505 storage-01 kernel:  kthread+0xef/0x120
Feb 24 10:44:09.492591 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:44:09.492677 storage-01 kernel:  ret_from_fork+0x44/0x70
Feb 24 10:44:09.492771 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:44:09.492839 storage-01 kernel:  ret_from_fork_asm+0x1b/0x30
Feb 24 10:44:09.492900 storage-01 kernel:  </TASK>
Feb 24 10:46:12.364009 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 245 seconds.
Feb 24 10:46:12.364322 storage-01 kernel:       Tainted: P           O       6.8.0-53-generic #55-Ubuntu
Feb 24 10:46:12.364404 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:46:12.364475 storage-01 kernel: task:txg_quiesce     state:D stack:0     pid:4484  tgid:4484  ppid:2      flags:0x00004000
Feb 24 10:46:12.364539 storage-01 kernel: Call Trace:
Feb 24 10:46:12.364604 storage-01 kernel:  <TASK>
Feb 24 10:46:12.364705 storage-01 kernel:  __schedule+0x27c/0x6b0
Feb 24 10:46:12.364781 storage-01 kernel:  schedule+0x33/0x110
Feb 24 10:46:12.364834 storage-01 kernel:  cv_wait_common+0x102/0x140 [spl]
Feb 24 10:46:12.364901 storage-01 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:46:12.364953 storage-01 kernel:  __cv_wait+0x15/0x30 [spl]
Feb 24 10:46:12.365018 storage-01 kernel:  txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:46:12.365085 storage-01 kernel:  txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:46:12.365692 storage-01 kernel:  ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:46:12.365756 storage-01 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:46:12.365824 storage-01 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:46:12.365890 storage-01 kernel:  kthread+0xef/0x120
Feb 24 10:46:12.365957 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:46:12.366025 storage-01 kernel:  ret_from_fork+0x44/0x70
Feb 24 10:46:12.366112 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:46:12.366163 storage-01 kernel:  ret_from_fork_asm+0x1b/0x30
Feb 24 10:46:12.366214 storage-01 kernel:  </TASK>
Feb 24 10:48:15.244023 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 368 seconds.
Feb 24 10:48:15.251117 storage-01 kernel:       Tainted: P           O       6.8.0-53-generic #55-Ubuntu
Feb 24 10:48:15.251202 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:48:15.251293 storage-01 kernel: task:txg_quiesce     state:D stack:0     pid:4484  tgid:4484  ppid:2      flags:0x00004000
Feb 24 10:48:15.251358 storage-01 kernel: Call Trace:
Feb 24 10:48:15.251409 storage-01 kernel:  <TASK>
Feb 24 10:48:15.251460 storage-01 kernel:  __schedule+0x27c/0x6b0
Feb 24 10:48:15.251527 storage-01 kernel:  schedule+0x33/0x110
Feb 24 10:48:15.251594 storage-01 kernel:  cv_wait_common+0x102/0x140 [spl]
Feb 24 10:48:15.251702 storage-01 kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:48:15.251766 storage-01 kernel:  __cv_wait+0x15/0x30 [spl]
Feb 24 10:48:15.251833 storage-01 kernel:  txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:48:15.251885 storage-01 kernel:  txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:48:15.251951 storage-01 kernel:  ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:48:15.252025 storage-01 kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:48:15.252091 storage-01 kernel:  thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:48:15.252144 storage-01 kernel:  kthread+0xef/0x120
Feb 24 10:48:15.252194 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:48:15.252260 storage-01 kernel:  ret_from_fork+0x44/0x70
Feb 24 10:48:15.252325 storage-01 kernel:  ? __pfx_kthread+0x10/0x10
Feb 24 10:48:15.252393 storage-01 kernel:  ret_from_fork_asm+0x1b/0x30
Feb 24 10:48:15.252445 storage-01 kernel:  </TASK>
Feb 24 10:48:15.252496 storage-01 kernel: INFO: task smbd[10.216.0.2:7549 blocked for more than 122 seconds.
Feb 24 10:48:15.252572 storage-01 kernel:       Tainted: P           O       6.8.0-53-generic #55-Ubuntu
Feb 24 10:48:15.252624 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

sudo systemctl status smbd shows no errors on smb's side.

What info can I provide to help debug this? I'm also curious if there's a way to get an ETA for z_upgrade. While I've read it's fine to let that continue in the background, maybe I need to wait for it to finish before any real writes/reads?

The text was updated successfully, but these errors were encountered:

clhedrick · 2025-02-25T19:47:18Z

``Could` https://www.reddit.com/r/zfs/comments/1i1g9bn/silent_data_loss_while_confirming_writes/ be relevant?

phillip-stephens · 2025-03-05T22:27:45Z

I don't think so since I'm not using ZFS encryption nor have any sort of special SLOG setup. But the txg_quiesce message in sudo dmesg is similar.

phillip-stephens · 2025-03-05T22:29:02Z

Just posted pretty much the same post into the ZFS mailing list here in case it gets more visibility there.

amotin · 2025-03-06T15:23:57Z

It would be good to upgrade to 2.2.7 actually. I personally have no much interest to review again what we have fixed there during that year of development that could possibly help here.

clhedrick · 2025-03-06T15:53:28Z

He's running ubuntu 24.04. zfs 2.2.2 is what comes with it. Last changed April 24 when they patched it to support kernel 6.8. Ubuntu doesn't believe in tracking ZFS development. The explanation is that not enough testing is done to meet their standards. They also prefer to take a few patches rather than update to the current version. As a result, their ZFS is always pretty questionable.

To upgrade he'll have to build ZFS himself. https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html

What happened to the project to supply current packages for Ubuntu?

phillip-stephens added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hung pool after upgrade from 2.1.5 to 2.2.2 when performing a lot of writes #17091

Hung pool after upgrade from 2.1.5 to 2.2.2 when performing a lot of writes #17091

phillip-stephens commented Feb 25, 2025 •

edited

Loading

clhedrick commented Feb 25, 2025

phillip-stephens commented Mar 5, 2025

phillip-stephens commented Mar 5, 2025

amotin commented Mar 6, 2025

clhedrick commented Mar 6, 2025

Hung pool after upgrade from 2.1.5 to 2.2.2 when performing a lot of writes #17091

Hung pool after upgrade from 2.1.5 to 2.2.2 when performing a lot of writes #17091

Comments

phillip-stephens commented Feb 25, 2025 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

clhedrick commented Feb 25, 2025

phillip-stephens commented Mar 5, 2025

phillip-stephens commented Mar 5, 2025

amotin commented Mar 6, 2025

clhedrick commented Mar 6, 2025

phillip-stephens commented Feb 25, 2025 •

edited

Loading