-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permanent errors in metadata following 2.3.0 upgrade #17090
Comments
I tried running a scrub and returned to the pool being degraded. I've never seen this particular situation before. Not only is the pool degraded, dmesg is filled indefinitely with CPU hangs (they're still going right now):
Looking through the kernel log I can see that the event that precipitated these indefinite kernel hangs was a set of disk resets -- but usually it's recoverable. Doesn't look like that's the case with 2.3.0 and 6.12. |
zpool status:
|
I wouldn't attribute this to ZFS changes right away - that "too many slow I/Os" tag on one of your disks only triggers after multiple I/Os were delayed for at least 30 seconds. So, with that, you've got 3 damaged/dying/something disks in a RAIDZ2. Have you tried reseating the disks (or moving them to different slots)? With that many I assume it's in a backplane, but it might be worth moving things around a bit to ensure it's not data or power cable/backplane/etc related in any way. |
i have the same situation ubuntu 22.04.5 LTS now i replaced those 2 - and resilvered. then again fall down with errors. thats not normal.. something is not right with 2.3.0 i tried now with zpool clear pool1 and checksums instant climbing.. that was not with zfs 2.2.x |
If you haven't upgraded your pool features - you can move back to 2.2.X. This does sound like a hardware issue, though. |
ah got it working zfs set direct=disabled pool1
now no checksum raise and errors are clean when i use standard > i get errors. with disabled it works exactly like before with 2.2.x how the disks configured
|
What applications are you using that utilize O_DIRECT? |
only kvm (opennebula) running on top hosting vms based on virtio as such driver name='qemu' type='qcow2' cache='none' discard='unmap' we used ppa downstreams while ago able to jump onto the 2.2.2 because of silent corruption then we left it. but as now 2.3.0 caused some miracles...we switch off downstream again as officially repos seems at 2.2.2 so its all fine for now ;).. just wanted to share that there are Circumstances when switching straight to 2.3.0 then with other 2 Options are straight issues... this is may what the Initiator face also....as direct Speed without cache is often very slow.. also with ssds... when dealing with HDD's can mean absolutly nightmare ;) just thinking "standard" option as default is may not the best approach for every setup. as lots will suffer. especially for setups where cache is important for me all good ;) |
Did you update your kernel as well when upgrading the ZFS version? I've got some errors and data corruption because of #16873 . It was caused by power management defaults change in kernel. |
System information
Describe the problem you're observing
Now that 2.3.0 was merged into trixie, I just upgraded my system to 2.3.0 (from the 2.2.x series). Before rebooting, the pool was clean, having finished a scrub a couple of weeks ago. After reboot, I got an email from zed letting me know a resilver had occurred at boot and multiple permanent errors were present.
The pool is comprised of 6 vdevs, each 11 disks in raidz2. Compression and encryption are enabled. I am using a dedicated L2ARC device, a single SSD. secondarycache is set to
metadata
.Inspecting the "corrupt" metadata nodes:
Exactly how concerned should I be here? I have been waiting for a long time for the 2.3.0 release and kind of worried I may have rushed into it today.
The text was updated successfully, but these errors were encountered: