Wandering Thoughts archives

2016-02-08

Old Unix filesystems and byte order

It all started with a tweet by @JeffSipek:

illumos/solaris UFS don't use a fixed byte order. SPARC produces structs in BE, x86 writes them out in LE. I was happier before I knew this.

As they say, welcome to old time Unix filesystems. Solaris UFS is far from the only filesystem defined this way; in fact, most old time Unix filesystems are probably defined in host byte order.

Today this strikes us as crazy, but that's because we now exist in a quite different hardware environment than the old days had. Put simply, we now exist in a world where storage devices both can be moved between dissimilar systems and are. In fact, it's an even more radical world than that; it's a world where almost everyone uses the same few storage interconnect technologies and interconnects are common between all sorts of systems. Today we take it for granted that how we connect storage to systems is through some defined, vendor neutral specification that many people implement, but this was not at all the case originally.

(There are all sorts of storage standards: SATA, SAS, NVMe, USB, SD cards, and so on.)

In the beginning, storage was close to 100% system specific. Not only did you not think of moving a disk from a Vax to a Sun, you probably couldn't; the entire peripheral interconnect system was almost always different, from the disk to host cabling to the kind of backplane that the controller boards plugged into. Even as some common disk interfaces emerged, larger servers often stayed with faster proprietary interfaces and proprietary disks.

(SCSI is fairly old as a standard, but it was also a slow interface for a long time so it didn't get used on many servers. As late as the early 1990s it still wasn't clear that SCSI was the right choice.)

In this environment of system specific disks, it was no wonder that Unix kernel programmers didn't think about byte order issues in their on disk data structures. Just saying 'everything is in host byte order' was clearly the simplest approach, so that's what people by and large did. When vendors started facing potential bi-endian issues, they tried very hard to duck them (I think that this was one reason endian-switchable RISCs were popular designs).

In theory, vendors could have decided to define their filesystems as being in their current endianness before they introduced another architecture with a different endianness (here Sun, with SPARC, would have defined UFS as BE). In practice I suspect that no vendor wanted to go through filesystem code to make it genuinely fixed endian. It was just simpler to say 'UFS is in host byte order and you can't swap disks between SPARC Solaris and x86 Solaris'.

(Since vendors did learn, genuinely new filesystems were much more likely to be specified as having a fixed and host-independent byte order. But filesystems like UFS trace their roots back a very long way.)

unix/OldFilesystemByteOrder written at 23:04:43; Add Comment

Clearing SMART disk complaints, with safety provided by ZFS

Recently, my office machine's smartd began complaining about problems on one of my drives (again):

Device: /dev/sdc [SAT], 5 Currently unreadable (pending) sectors
Device: /dev/sdc [SAT], 5 Offline uncorrectable sectors

As it happens, I was eventually able to make all of these complaints go away (I won't say I fixed the problem, because the disk is undoubtedly still slowly failing). This took a number of steps and some of them were significantly helped by ZFS on Linux.

(For background, this disk is one half of a mirrored pair. Most of it is in a ZFS pool; the rest is in various software RAID mirrors.)

My steps:

  1. Scrub my ZFS pool, in the hopes that this would make the problem go away like the first iteration of smartd complaints. Unfortunately I wasn't so lucky this time around, but the scrub did verify that all of my data was intact.

  2. Use dd to read all of the partitions of the disk (one after another) in order to try to find where the bad spots were. This wound up making four of the five problem sectors just quietly go away and did turn up a hard read error in one partition. Fortunately or unfortunately it was my ZFS partition.

    The resulting kernel complaints looked like:

    blk_update_request: I/O error, dev sdc, sector 1362171035
    Buffer I/O error on dev sdc, logical block 170271379, async page read
    

    The reason that a ZFS scrub did not turn up a problem was that ZFS scrubs only check allocated space. Presumably the read error is in unallocated space.

  3. Use the kernel error messages and carefully iterated experiments with dd's skip= argument to make sure I had the right block offset into /dev/sdc, ie the block offset that would make dd immediately read that sector.

  4. Then I tried to write zeroes over just that sector with 'dd if=/dev/zero of=/dev/sdc seek=... count=1'. Unfortunately this ran into a problem; for some reason the kernel felt that this was a 4k sector drive, or at least that it had to do 4k IO to /dev/sdc. This caused it to attempt to do a read-modify-write cycle, which immediately failed when it tried to read the 4k block that contained the bad sector.

    (The goal here was to force the disk to reallocate the bad sector into one of its spare sectors. If this reallocation failed, I'd have replaced the disk right away.)

  5. This meant that I needed to do 4K writes, not 512 byte writes, which meant that I needed the right offset for dd in 4K units. This was handily the 'logical block' from the kernel error message, which I verified by running:

    dd if=/dev/sdc of=/dev/null bs=4k skip=170271379 count=1
    

    This immediately errored out with a read error, which is what I expected.

  6. Now that I had the right 4K offset, I could write 4K of /dev/zero to the right spot. To really verify that I was doing (only) 4K of IO and to the right spot, I ran dd under strace:

    strace dd if=/dev/zero of=/dev/sdc bs=4k seek=170271379 count=1
    

  7. To verify that this dd had taken care of the problem, I redid the dd read. This time it succeeded.

  8. Finally, to verify that writing zeroes over a bit of one side of my ZFS pool had only gone to unallocated space and hadn't damaged anything, I re-scrubbed the ZFS pool.

ZFS was important here because ZFS checksums meant that writing zeroes over bits of one pool disk was 'safe', unlike with software RAID, because if I hit any in-use data ZFS would know that the chunk of 0 bytes was incorrect and fix it up. With software RAID I guess I'd have had to carefully copy the data from the other side of the software RAID, instead of just using /dev/zero.

By the way, I don't necessarily recommend this long series of somewhat hackish steps. In an environment with plentiful spare drives, the right answer is probably 'replace the questionable disk entirely'. It happens that we don't have lots of spare drives at this moment, plus I don't have enough drive bays in my machine to make this at all convenient right now.

(Also, in theory I didn't need to clear the SMART warnings at all. In practice the Fedora 23 smartd whines incessantly about this to syslog at a very high priority, which causes one of my windows to get notifications every half hour or so and I just couldn't stand it any more. It was either shut up smartd somehow or replace the disk. Believe it or not, all these steps seemed to be the easiest way to shut up smartd. It worked, too.)

linux/ClearingSMARTComplaints written at 00:51:13; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.