-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS Corrective Resilvering Functions #15917
Comments
I'm not a zfs expert (I'm rather just another user in a somewhat similar situation, currently trying to understand my options), but I've seen devs making statements like "ZFS should already automatically try to recover the data using available redundancy and if succeed, issue recovery writes to the corrupted blocks". I'm not 100% sure if that applies only to the replacing/resilvering situation, though. Maybe that statement was only meant to be about "normal reads from a normally functioning vdev". This make me think that, if the data is actually there, on the However, |
From mirrors, ZFS will basically try to read a leg randomly[1], and if that fails, try the other things and write a replacement if it finds another copy or reconstructs it from parity, except probably in replace commands, where if the "old" disk throws a read error, I don't think I expect it to try writing a replacement to the old disk, only the new one, but I've not checked or tested that. (After all, when you're replacing a failing disk, trying to write to bad things might cause it to error out or hang and fall off the bus...) A "replace" is basically making a temporary mirror vdev with old and the new and doing an So you shouldn't be losing any redundancy by the And it detaches at the end because the end goal of [1] - not actually randomly but don't worry about it |
Describe the feature would like to see added to OpenZFS
Find useful data in faulted or offline devices in order to repair "permanent errors".
For example:
I realize there is some necessary threshold for triggering a device FAULT. However, the user should be able to manually put a FAULTED or OFFLINE device into a state like "ex-FAULTED," which is designated for corrective functions only. Resilvering processes take place in the absense of the devices that are being replaced, from ZFS's point-of-view, right? This opens up unnecessary exposure to permanent errors, which could be either prevented or - I'm suggested here - later corrected using the remaining available good sectors of the FAULTED device(s).
If ZFS's default behavior is excluding (FAULTing) known bad devices as quickly as possible, then don't we want to invert that tendency during times when we'd want to reduce the chances of permanent errors?
How will this feature improve OpenZFS?
Reduces and/or repairs permanent errors during/before some resilvering processes. Helps form a complete system of tools and processes for those recovering from broken hardware.
Additional context
This was inspired by a small quantity of permanent errors I experienced during a resilvering process of a raidz2 pool. I could only see 3 ways out:
Providing a decision tree (or flow chart) would be invalauble for aiding many users through optimal recovery paths. ddrescue isn't officially covered in ZFS documentation (though its very useful for some cases), however, I see no way to use a ddrescue'd drive to recover permanent data errors after its replacement in the pool has already commenced. The above ZFS feature could help a user find & use available data to recover from "permanent errors" in mirrors and all kinds of raid-z configurations!
The text was updated successfully, but these errors were encountered: