Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter handling for unrecoverable read errors #13665

Open
asomers opened this issue Jul 18, 2022 · 8 comments
Open

Smarter handling for unrecoverable read errors #13665

asomers opened this issue Jul 18, 2022 · 8 comments
Labels
Type: Feature Feature request or new feature

Comments

@asomers
Copy link
Contributor

asomers commented Jul 18, 2022

Background

Modern HDDs have very sophisticated internal error correcting codes. They can usually read even from a slightly damaged sector. Yet, bitrot sometimes results in a sector that cannot be read. Manufacturers spec an upper limit for these errors, for example 1 sector for every 10^15 bits read. So dumb luck can result in unrecoverable sectors. Errors such as this should not be considered a sign of faulty hardware, as long as the error rate is within the manufacturer's spec. Upon receiving such an error, the client ought to rewrite the sector in question, if possible. Then future reads will likely succeed.

But ZFS does not rewrite the sector. Its current behavior (observed on FreeBSD stable/13 amd64) is to retry the read repeatedly. Eventually, zfsd faults the disk after too many reads fail. Notably, just a single unrecoverable sector can result in the disk being faulted.

At present, the operator's only options are:

  • Replace the HDD, but that's expensive.
  • clear the HDD's label with zpool labelclear, then replace the faulted disk by itself. But that could trigger a lengthy resilver process
  • Use dd or a similar tool to write junk to the troubled sector, then zpool clear the faulted disk, then start a scrub. ZFS will detect the junk as a checksum error and repair it. But that's labor-intensive.

Describe the feature would like to see added to OpenZFS

If an HDD read fails as unrecoverable, ZFS should treat it like a checksum error: log the error and rewrite the sector if possible. Arguably it should start a scrub too.

How will this feature improve OpenZFS?

This will reduce downtime by avoiding lengthy resilvers, save money by avoiding unnecessary disk replacements, and save labor by handling these types of errors automatically.

Additional context

Here's an example of dmesg output generated by one such HDD error:

(da113:mpr1:0:236:0): READ(10). CDB: 28 00 07 92 9d 58 00 07 50 00 
(da113:mpr1:0:236:0): CAM status: SCSI Status Error
(da113:mpr1:0:236:0): SCSI status: Check Condition
(da113:mpr1:0:236:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da113:mpr1:0:236:0): Info: 0x7929d80
(da113:mpr1:0:236:0): Field Replaceable Unit: 134
(da113:mpr1:0:236:0): Command Specific Info: 0x8103e6ff
(da113:mpr1:0:236:0): Actual Retry Count: 234
(da113:mpr1:0:236:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 (da113:mpr1:0:236:0): Error 5, Unretryable error

Note that ZFS doesn't yet have enough information to know which reads were unrecoverable, at least on FreeBSD. On FreeBSD, geom merely reports "EIO". It would need to be enhanced to report a different error code for unrecoverable reads. EINTEGRITY, perhaps?

SCSI log pages contain enough information to calculate the unrecoverable error rate. The easiest way to view them is with smartctl. For example:

$ sudo smartctl -a /dev/da113
...
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   903046750      521         0  903047271        531     831248.114           6
write:         0        0         2         2          2     413664.259           0

From this we see that there have been 6 uncorrected errors out of 831245 GB read. That works out to 1.1 unrecoverable sectors per 10^15 bits read, or just slightly over spec.

@asomers asomers added the Type: Feature Feature request or new feature label Jul 18, 2022
@amotin
Copy link
Member

amotin commented Jul 19, 2022

When read error is detected, ZFS should already automatically try to recover the data using available redundancy and if succeed, issue recovery writes to the corrupted blocks. For mirrors and ditto blocks (which are also handled as mirrors) reads are retried to other mirror components. For RAIDZ and DRAID ZFS should try all combinations of available data and parity disks, trying to find combination that provide correct block checksum. If block has no redundancy, then there is not much to do and the error is just reported, requiring administrator interventions, such as IIRC WIP recovery receive, allowing to replicate specifically corrupted blocks from other system. Do you observe some of those scenarios not working?

@asomers
Copy link
Contributor Author

asomers commented Jul 19, 2022

That's not exactly what I'm seeing. What I see is:

  • CAM reports one read error on each leg of the multipath device, so two read errors total, but only one for the gmultipath device.
  • zpool status reports a few dozen read errors
  • zfsd faulted the disk because it experienced too many read errors in a short amount of time.

@amotin do you know why else ZFS might report dozens of read errors even if the underlying device only had one?

@amotin
Copy link
Member

amotin commented Jul 19, 2022

I am not sure. From my recent look on mirror and raidz/draid code I don't believe it should retry same block reads multiple times, it should read each data copy only once. But in context of I/O aggregation I guess single aggregated I/O error may produce a burst of errors for small parent I/Os, which seems indeed may be retried in zio_vdev_io_assess() with addition of ZIO_FLAG_DONT_AGGREGATE flag to read as much data as possible. If (I don't know) the original errors are counted despite the retrial, it may may be the issue you are looking for. It needs some investigation.

@homerl
Copy link

homerl commented Sep 9, 2022

I think the question is the self-healing of ZFS.
This problem bothering me for a long time...

AFAIK, The incorrect error will be replaced by the hot spare block by the HDD itself and the logic address will be read/written again. I'm not sure about the action how long it takes or until the incorrect device is reset ( it depends on the firmware of the vendor) or if something is wrong because the device hangs.

It means OpenZFS don't know what time the logic address could be read/written. if the bad block is remapping, the read must show the error, it the ZFS self-healing turn.

In my opinion, if the original logic block will be rewritten(no timeout), the wrong data will self-healing by ZFS, zpool status read/write error not increase. if the original logic block rewritten failed(zpool status increases the read/write error count), the self-healing will rewrite to the new area of the zpool(vdev_raidz_io_done, Maybe I'm showing a failed idea).
If zpool status read/write error count increases, should I scrub this zpool ?? Now I think the bad block has been written in the new area of this zpool(Maybe I'm wrong, please let me know). and no need to scrub this zpool.

Replace the HDD, but that's expensive. I couldn't agree more.
In the large-scale real production environment, if replace the SCSI device by zpool stats read/write error, the SCSI device's fault rate is about 2~4X or more than the hardware RAID.

The OpenZFS could be self-healing and the SCSI device too.
Some vendors' devices maybe not be supported seal-healing or not enough hot spare block in this device.
Need a guide for replacement in OpenZFS.
Does the device need to be replaced?
What is the situation I should be scrub?

@devZer0
Copy link

devZer0 commented Mar 2, 2023

i'd second that.

i don't see the point why zfs throws a disk out of service when there is only a few defective sectors.

on synology for example, you won't have (by far) such great data integrity like with zfs, though synology (quite popular nas platform) considers a disk member of ordinary md raid as "healthy" , even when there are TONS of read errors

have a look below - would YOU consider this synology box to be healthy ?

but i would leave it up to the admin when it's time to consider removing a disk from zfs raid , but i also wish, if defective blocks handling would be more sophisticated.

at the moment, some few medium errors puts a disk out of service. that's a little bit unfortunate if that disk is 16tb large and the raidz2 only contains few data. why not simply skip over and use all the good blocks instead for redundancy?

i understand, that you may sensibly handle that in sensible environment, but you have an ordinary home box, backup storage or testing env, i don't see the point why we should replace a disk with some defective sectors.

grafik

grafik

dmesg:
https://pastebin.com/xFePqdmt

@devZer0
Copy link

devZer0 commented Mar 2, 2023

on my zfs box:

root@backup-filer:~# zpool status
  pool: nfsbackuppool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 181G in 00:20:15 with 0 errors on Wed Mar  1 13:06:29 2023
config:

	NAME                                       STATE     READ WRITE CKSUM
	nfsbackuppool                              ONLINE       0     0     0
	  raidz2-0                                 ONLINE       0     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi10  ONLINE       0     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi11  ONLINE       0     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi12  ONLINE       0     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi13  ONLINE       1     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi14  ONLINE       0     0     0
	    scsi-0QEMU_QEMU_HARDDISK_drive-scsi15  ONLINE       0     0     0

errors: No known data errors

root@backup-filer:~# dmesg -T|grep Do
[Do Mär  2 00:53:12 2023] sd 6:0:0:13: [sde] tag#198 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Do Mär  2 00:53:12 2023] sd 6:0:0:13: [sde] tag#198 Sense Key : Medium Error [current]
[Do Mär  2 00:53:12 2023] sd 6:0:0:13: [sde] tag#198 Add. Sense: Unrecovered read error
[Do Mär  2 00:53:12 2023] sd 6:0:0:13: [sde] tag#198 CDB: Read(16) 88 00 00 00 00 02 7f 96 6d e8 00 00 00 40 00 00
[Do Mär  2 00:53:12 2023] blk_update_request: critical medium error, dev sde, sector 10730499560 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 0
[Do Mär  2 00:53:12 2023] zio pool=nfsbackuppool vdev=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi13-part1 error=61 type=1 offset=5494014726144 size=32768 flags=180880

@devZer0
Copy link

devZer0 commented Mar 2, 2023

ok , this time it's only a single error message and after clearing, no resilver started.

last time there have been few more errors and the disk was thrown out and there was a whole resilver after clearing the error.

i think it would be useful if the behavour would be tunable/adjustable and if it would be more transparent, how zfs decides to mark a disk defective or not.

where can i read about that, i.e. how does zfs defect detection logic work ?

@amotin
Copy link
Member

amotin commented Mar 2, 2023

ok , this time it's only a single error message and after clearing, no resilver started.

For read errors ZFS automatically issues recovery writes if it is finally able to get valid data from any redundancy, so full scrub/resilver is not needed. For write errors though ZFS can't do much, since disk is expected by itself do everything possible to recover it, and if it failed to do so -- it has to be replaced.

last time there have been few more errors and the disk was thrown out.

i think it would be useful if the behavour would be tunable/adjustable and if it would be more transparent, how zfs decides to mark a disk defective or not.

where can i read about that, i.e. how does zfs defect detection logic work ?

It is OS-specific. It is controlled by zfsd on FreeBSD, zed on Linux, some other daemon on Solaris, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants