You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a storage server with 9 vdev's each with 10 drives in a raidz2 format. 6 of the vdev's have 12 TB drives, 3 have 22 TB ones.
Last week I updated all packages on the Ubuntu 22.04 server, shut off smbd, upgraded from Ubuntu 22.04 -> Ubuntu 24.04, and rebooted.
Everything looked normal, the pool was healthy (noted by zpool status -v) and sudo dmesg was clean.
I upgraded the pool with zpool update. Everything still looked good.
Upon enabling smbd, within a few minutes a deadlock was be entered. Using zpool iostat -v 1 I could see there was no I/O happening on any drive.
ps aux:
pste@storage:~$ ps aux | grep "D"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2901 4.7 0.0 0 0 ? D 08:39 17:03 [z_upgrade]
root 4484 0.0 0.0 0 0 ? D 08:39 0:00 [txg_quiesce]
root 6863 0.0 0.0 12196 8000 ? Ss 08:48 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
kr 7436 0.0 0.0 132332 23756 ? D 08:56 0:00 smbd: client [10.216.3.128]
as 7464 0.0 0.0 131708 28032 ? D 08:56 0:00 smbd: client [10.216.0.37]
zd 7529 0.0 0.0 123512 17152 ? D 08:56 0:00 smbd: client [10.216.3.160]
qa 7541 0.0 0.0 120468 39088 ? D 08:56 0:06 smbd: client [10.216.3.49]
qu 7549 3.1 0.0 155548 48848 ? D 08:56 10:48 smbd: client [10.216.0.20]
kr 12565 0.0 0.0 119112 21272 ? D 11:18 0:00 smbd: client [10.216.2.157]
pste+ 17343 0.0 0.0 9144 1920 pts/0 S+ 14:37 0:00 grep --color=auto D
I figured something random was happening, so I rebooted the server but the issue immediately came back. I also figured maybe z_upgrade needed to complete but have seen online that this is safe to run in the background.
My running theory is that this is caused by writes rather than large amounts of reads. I tried making the smb share with the most I/O read-only to reduce I/O, and upon starting smbd the server was in a live state for almost an hour before a user tried to delete a large folder and it deadlocked back up again. Rebooting always fixes it.
Describe how to reproduce the problem
Upgrade server from Ubuntu 22.04 -> Ubuntu 24.04.
Upgrade Pool
Have 1+ Samba shares, possibly performing a large amount of writes
Include any warning/errors/backtraces from the system logs
sudo dmesg showing errors of the deadlock
Feb 24 08:40:00.406515 storage-01 kernel: RPC: Registered tcp-with-tls transport module.
Feb 24 08:40:00.406551 storage-01 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Feb 24 08:40:00.575078 storage-01 kernel: Process accounting resumed
Feb 24 08:40:00.928108 storage-01 systemd-journald[903]: /var/log/journal/4d9214db14634e1a802b475d9bfe0478/user-1038.journal: Journal file uses a different sequence number ID, rotating.
Feb 24 10:44:09.484041 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 122 seconds.
Feb 24 10:44:09.491210 storage-01 kernel: Tainted: P O 6.8.0-53-generic #55-Ubuntu
Feb 24 10:44:09.491334 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:44:09.491417 storage-01 kernel: task:txg_quiesce state:D stack:0 pid:4484 tgid:4484 ppid:2 flags:0x00004000
Feb 24 10:44:09.491517 storage-01 kernel: Call Trace:
Feb 24 10:44:09.491592 storage-01 kernel: <TASK>
Feb 24 10:44:09.491655 storage-01 kernel: __schedule+0x27c/0x6b0
Feb 24 10:44:09.491784 storage-01 kernel: schedule+0x33/0x110
Feb 24 10:44:09.491896 storage-01 kernel: cv_wait_common+0x102/0x140 [spl]
Feb 24 10:44:09.491973 storage-01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:44:09.492069 storage-01 kernel: __cv_wait+0x15/0x30 [spl]
Feb 24 10:44:09.492130 storage-01 kernel: txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:44:09.492205 storage-01 kernel: txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:44:09.492281 storage-01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:44:09.492371 storage-01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:44:09.492438 storage-01 kernel: thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:44:09.492505 storage-01 kernel: kthread+0xef/0x120
Feb 24 10:44:09.492591 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:44:09.492677 storage-01 kernel: ret_from_fork+0x44/0x70
Feb 24 10:44:09.492771 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:44:09.492839 storage-01 kernel: ret_from_fork_asm+0x1b/0x30
Feb 24 10:44:09.492900 storage-01 kernel: </TASK>
Feb 24 10:46:12.364009 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 245 seconds.
Feb 24 10:46:12.364322 storage-01 kernel: Tainted: P O 6.8.0-53-generic #55-Ubuntu
Feb 24 10:46:12.364404 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:46:12.364475 storage-01 kernel: task:txg_quiesce state:D stack:0 pid:4484 tgid:4484 ppid:2 flags:0x00004000
Feb 24 10:46:12.364539 storage-01 kernel: Call Trace:
Feb 24 10:46:12.364604 storage-01 kernel: <TASK>
Feb 24 10:46:12.364705 storage-01 kernel: __schedule+0x27c/0x6b0
Feb 24 10:46:12.364781 storage-01 kernel: schedule+0x33/0x110
Feb 24 10:46:12.364834 storage-01 kernel: cv_wait_common+0x102/0x140 [spl]
Feb 24 10:46:12.364901 storage-01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:46:12.364953 storage-01 kernel: __cv_wait+0x15/0x30 [spl]
Feb 24 10:46:12.365018 storage-01 kernel: txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:46:12.365085 storage-01 kernel: txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:46:12.365692 storage-01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:46:12.365756 storage-01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:46:12.365824 storage-01 kernel: thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:46:12.365890 storage-01 kernel: kthread+0xef/0x120
Feb 24 10:46:12.365957 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:46:12.366025 storage-01 kernel: ret_from_fork+0x44/0x70
Feb 24 10:46:12.366112 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:46:12.366163 storage-01 kernel: ret_from_fork_asm+0x1b/0x30
Feb 24 10:46:12.366214 storage-01 kernel: </TASK>
Feb 24 10:48:15.244023 storage-01 kernel: INFO: task txg_quiesce:4484 blocked for more than 368 seconds.
Feb 24 10:48:15.251117 storage-01 kernel: Tainted: P O 6.8.0-53-generic #55-Ubuntu
Feb 24 10:48:15.251202 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 24 10:48:15.251293 storage-01 kernel: task:txg_quiesce state:D stack:0 pid:4484 tgid:4484 ppid:2 flags:0x00004000
Feb 24 10:48:15.251358 storage-01 kernel: Call Trace:
Feb 24 10:48:15.251409 storage-01 kernel: <TASK>
Feb 24 10:48:15.251460 storage-01 kernel: __schedule+0x27c/0x6b0
Feb 24 10:48:15.251527 storage-01 kernel: schedule+0x33/0x110
Feb 24 10:48:15.251594 storage-01 kernel: cv_wait_common+0x102/0x140 [spl]
Feb 24 10:48:15.251702 storage-01 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Feb 24 10:48:15.251766 storage-01 kernel: __cv_wait+0x15/0x30 [spl]
Feb 24 10:48:15.251833 storage-01 kernel: txg_quiesce+0x181/0x1f0 [zfs]
Feb 24 10:48:15.251885 storage-01 kernel: txg_quiesce_thread+0xd2/0x120 [zfs]
Feb 24 10:48:15.251951 storage-01 kernel: ? __pfx_txg_quiesce_thread+0x10/0x10 [zfs]
Feb 24 10:48:15.252025 storage-01 kernel: ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Feb 24 10:48:15.252091 storage-01 kernel: thread_generic_wrapper+0x5c/0x70 [spl]
Feb 24 10:48:15.252144 storage-01 kernel: kthread+0xef/0x120
Feb 24 10:48:15.252194 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:48:15.252260 storage-01 kernel: ret_from_fork+0x44/0x70
Feb 24 10:48:15.252325 storage-01 kernel: ? __pfx_kthread+0x10/0x10
Feb 24 10:48:15.252393 storage-01 kernel: ret_from_fork_asm+0x1b/0x30
Feb 24 10:48:15.252445 storage-01 kernel: </TASK>
Feb 24 10:48:15.252496 storage-01 kernel: INFO: task smbd[10.216.0.2:7549 blocked for more than 122 seconds.
Feb 24 10:48:15.252572 storage-01 kernel: Tainted: P O 6.8.0-53-generic #55-Ubuntu
Feb 24 10:48:15.252624 storage-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sudo systemctl status smbd shows no errors on smb's side.
What info can I provide to help debug this? I'm also curious if there's a way to get an ETA for z_upgrade. While I've read it's fine to let that continue in the background, maybe I need to wait for it to finish before any real writes/reads?
The text was updated successfully, but these errors were encountered:
It would be good to upgrade to 2.2.7 actually. I personally have no much interest to review again what we have fixed there during that year of development that could possibly help here.
He's running ubuntu 24.04. zfs 2.2.2 is what comes with it. Last changed April 24 when they patched it to support kernel 6.8. Ubuntu doesn't believe in tracking ZFS development. The explanation is that not enough testing is done to meet their standards. They also prefer to take a few patches rather than update to the current version. As a result, their ZFS is always pretty questionable.
System information
Describe the problem you're observing
I have a storage server with 9 vdev's each with 10 drives in a
raidz2
format. 6 of the vdev's have 12 TB drives, 3 have 22 TB ones.Last week I updated all packages on the Ubuntu 22.04 server, shut off
smbd
, upgraded from Ubuntu 22.04 -> Ubuntu 24.04, and rebooted.Everything looked normal, the pool was healthy (noted by
zpool status -v
) andsudo dmesg
was clean.I upgraded the pool with
zpool update
. Everything still looked good.Upon enabling
smbd
, within a few minutes a deadlock was be entered. Usingzpool iostat -v 1
I could see there was no I/O happening on any drive.ps aux
:I figured something random was happening, so I rebooted the server but the issue immediately came back. I also figured maybe
z_upgrade
needed to complete but have seen online that this is safe to run in the background.My running theory is that this is caused by writes rather than large amounts of reads. I tried making the
smb
share with the most I/Oread-only
to reduce I/O, and upon startingsmbd
the server was in a live state for almost an hour before a user tried to delete a large folder and it deadlocked back up again. Rebooting always fixes it.Describe how to reproduce the problem
Upgrade server from Ubuntu 22.04 -> Ubuntu 24.04.
Upgrade Pool
Have 1+ Samba shares, possibly performing a large amount of writes
Include any warning/errors/backtraces from the system logs
sudo dmesg
showing errors of the deadlocksudo systemctl status smbd
shows no errors onsmb
's side.What info can I provide to help debug this? I'm also curious if there's a way to get an ETA for
z_upgrade
. While I've read it's fine to let that continue in the background, maybe I need to wait for it to finish before any real writes/reads?The text was updated successfully, but these errors were encountered: