-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smarter handling for unrecoverable read errors #13665
Comments
When read error is detected, ZFS should already automatically try to recover the data using available redundancy and if succeed, issue recovery writes to the corrupted blocks. For mirrors and ditto blocks (which are also handled as mirrors) reads are retried to other mirror components. For RAIDZ and DRAID ZFS should try all combinations of available data and parity disks, trying to find combination that provide correct block checksum. If block has no redundancy, then there is not much to do and the error is just reported, requiring administrator interventions, such as IIRC WIP recovery receive, allowing to replicate specifically corrupted blocks from other system. Do you observe some of those scenarios not working? |
That's not exactly what I'm seeing. What I see is:
@amotin do you know why else ZFS might report dozens of read errors even if the underlying device only had one? |
I am not sure. From my recent look on mirror and raidz/draid code I don't believe it should retry same block reads multiple times, it should read each data copy only once. But in context of I/O aggregation I guess single aggregated I/O error may produce a burst of errors for small parent I/Os, which seems indeed may be retried in zio_vdev_io_assess() with addition of ZIO_FLAG_DONT_AGGREGATE flag to read as much data as possible. If (I don't know) the original errors are counted despite the retrial, it may may be the issue you are looking for. It needs some investigation. |
I think the question is the self-healing of ZFS. AFAIK, The incorrect error will be replaced by the hot spare block by the HDD itself and the logic address will be read/written again. I'm not sure about the action how long it takes or until the incorrect device is reset ( it depends on the firmware of the vendor) or if something is wrong because the device hangs. It means OpenZFS don't know what time the logic address could be read/written. if the bad block is remapping, the read must show the error, it the ZFS self-healing turn. In my opinion, if the original logic block will be rewritten(no timeout), the wrong data will self-healing by ZFS, zpool status read/write error not increase. if the original logic block rewritten failed(zpool status increases the read/write error count), the self-healing will rewrite to the new area of the zpool(vdev_raidz_io_done, Maybe I'm showing a failed idea).
The OpenZFS could be self-healing and the SCSI device too. |
i'd second that. i don't see the point why zfs throws a disk out of service when there is only a few defective sectors. on synology for example, you won't have (by far) such great data integrity like with zfs, though synology (quite popular nas platform) considers a disk member of ordinary md raid as "healthy" , even when there are TONS of read errors have a look below - would YOU consider this synology box to be healthy ? but i would leave it up to the admin when it's time to consider removing a disk from zfs raid , but i also wish, if defective blocks handling would be more sophisticated. at the moment, some few medium errors puts a disk out of service. that's a little bit unfortunate if that disk is 16tb large and the raidz2 only contains few data. why not simply skip over and use all the good blocks instead for redundancy? i understand, that you may sensibly handle that in sensible environment, but you have an ordinary home box, backup storage or testing env, i don't see the point why we should replace a disk with some defective sectors. |
on my zfs box:
|
ok , this time it's only a single error message and after clearing, no resilver started. last time there have been few more errors and the disk was thrown out and there was a whole resilver after clearing the error. i think it would be useful if the behavour would be tunable/adjustable and if it would be more transparent, how zfs decides to mark a disk defective or not. where can i read about that, i.e. how does zfs defect detection logic work ? |
For read errors ZFS automatically issues recovery writes if it is finally able to get valid data from any redundancy, so full scrub/resilver is not needed. For write errors though ZFS can't do much, since disk is expected by itself do everything possible to recover it, and if it failed to do so -- it has to be replaced.
It is OS-specific. It is controlled by zfsd on FreeBSD, zed on Linux, some other daemon on Solaris, etc. |
Background
Modern HDDs have very sophisticated internal error correcting codes. They can usually read even from a slightly damaged sector. Yet, bitrot sometimes results in a sector that cannot be read. Manufacturers spec an upper limit for these errors, for example 1 sector for every 10^15 bits read. So dumb luck can result in unrecoverable sectors. Errors such as this should not be considered a sign of faulty hardware, as long as the error rate is within the manufacturer's spec. Upon receiving such an error, the client ought to rewrite the sector in question, if possible. Then future reads will likely succeed.
But ZFS does not rewrite the sector. Its current behavior (observed on FreeBSD stable/13 amd64) is to retry the read repeatedly. Eventually, zfsd faults the disk after too many reads fail. Notably, just a single unrecoverable sector can result in the disk being faulted.
At present, the operator's only options are:
zpool labelclear
, then replace the faulted disk by itself. But that could trigger a lengthy resilver processdd
or a similar tool to write junk to the troubled sector, thenzpool clear
the faulted disk, then start a scrub. ZFS will detect the junk as a checksum error and repair it. But that's labor-intensive.Describe the feature would like to see added to OpenZFS
If an HDD read fails as unrecoverable, ZFS should treat it like a checksum error: log the error and rewrite the sector if possible. Arguably it should start a scrub too.
How will this feature improve OpenZFS?
This will reduce downtime by avoiding lengthy resilvers, save money by avoiding unnecessary disk replacements, and save labor by handling these types of errors automatically.
Additional context
Here's an example of dmesg output generated by one such HDD error:
Note that ZFS doesn't yet have enough information to know which reads were unrecoverable, at least on FreeBSD. On FreeBSD, geom merely reports "EIO". It would need to be enhanced to report a different error code for unrecoverable reads. EINTEGRITY, perhaps?
SCSI log pages contain enough information to calculate the unrecoverable error rate. The easiest way to view them is with
smartctl
. For example:From this we see that there have been 6 uncorrected errors out of 831245 GB read. That works out to 1.1 unrecoverable sectors per 10^15 bits read, or just slightly over spec.
The text was updated successfully, but these errors were encountered: