The better way to clear SMART disk complaints, with safety provided by ZFS

May 4, 2016

A couple of months ago I wrote about clearing SMART complaints about one of my disks by very carefully overwriting sectors on it, and how ZFS made this kind of safe. In a comment, Christian Neukirchen recommended using hdparm --write-sector to overwrite sectors with read errors instead of the complicated dance with dd that I used in my entry. As it happens, that disk coughed up a hairball of smartd complaints today, so I got a chance to go through my procedures again and the advice is spot on. Using hdparm makes things much simpler.

So my revised steps are:

  1. Scrub my ZFS pool in the hopes that this will make the problem go away. It didn't, which means that any read errors in the partition for the ZFS pool is in space that ZFS shouldn't be using.

  2. Use dd to read all of the ZFS partition. I did this with 'dd if=/dev/sdc7 of=/dev/null bs=512k conv=noerror iflag=direct'. This hit several bad spots, each of which produced kernel errors that included a line like this:
    blk_update_request: I/O error, dev sdc, sector 1748083315
    

  3. Use hdparm --read-sector to verify that this is indeed the bad sector:
    hdparm --read-sector 1748083315 /dev/sdc
    

    If this is the correct sector, hdparm will report a read error and the kernel will log a failed SATA command. Note that is not a normal disk read, as hdparm is issuing a low-level read, so you don't get a normal message; instead you get something like this:

    ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    ata3.00: irq_stat 0x40000001
    ata3.00: failed command: READ SECTOR(S) EXT
    ata3.00: cmd 24/00:01:73:a2:31/00:00:68:00:00/e0 tag 3 pio 512 in
             res 51/40:00:73:a2:31/00:00:68:00:00/00 Emask 0x9 (media error)
    [...]
    

    The important thing to notice here is that you don't get the sector reported (at least not in decoded form), so you have to rely on getting the sector number correct in the hdparm command instead of being able to cross check it against earlier kernel logs.

    (Sector 1748083315 is 0x6831a273 in hex. All the bytes are there in the cmd part of the message, but clearly shuffled around.)

  4. Use hdparm --write-sector to overwrite the sector, forcing it to be spared out:
    hdparm --write-sector 1748083315 <magic option> /dev/sdc
    

    (hdparm will tell you what the hidden magic option you need is when you use --write-sector without it.)

  5. Scrub my ZFS pool again and then re-run the dd to make sure that I got all of the problems.

I was pretty sure I'd gotten everything even before the re-scrub and the re-dd scan, because smartd reported that there were no more currently unreadable (pending) sectors or offline uncorrectable sectors, both of which it had been complaining about before.

This was a lot easier and more straightforward to go through than my previous procedure, partly because I can directly reuse the sector numbers from the kernel error messages without problems and partly because hdparm does exactly what I want.

There's probably a better way to scan the hard drive for read errors than dd. I'm a little bit nervous about my 512Kb block size here potentially hiding a second bad sector that's sufficiently close to the first, but especially with direct IO I think it's a tradeoff between speed and thoroughness. Possibly I should explore how well the badblocks program works here, since it's the obvious candidate.

(These days I force dd to use direct IO when talking to disks because that way dd does much less damage to the machine's overall performance.)

(This is the kind of entry that I write because I just looked up my first entry for how to do it again, so clearly I'm pretty likely to wind up doing this a third time. I could just replace the drive, but at this point I don't have enough drive bay slots in my work machine's case to do this easily. Also, I'm a peculiar combination of stubborn and lazy where it comes to hardware.)

Written on 04 May 2016.
« How I think you set up fair share scheduling under systemd
My annoyance with Chrome's cut and paste support under X »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 4 00:19:21 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.