2017-01-30
How you can abruptly lose your filesystem on a software RAID mirror
We almost certainly just completely lost a software RAID mirror with no advance warning (we'll know for sure when we get a chance tomorrow to power-cycle the machine in the hopes that this revives a drive). This comes as very much of a surprise to us, as we thought that this was not supposed to be possible short of simultaneous two drive failure out of the blue, which should be an extremely rare event. So here is what happened, as best we can reconstruct right now.
In December, both sides of the software RAID mirror were operating
normally (at least as far as we know; unfortunately the filesystem
we've lost here is /var
). Starting around January 4th, one of the
two disks began sporadically returning read errors to software RAID
code, which caused the software RAID to redirect reads to the other
side of the mirror but not otherwise complain to us about the
read errors beyond logging some kernel messages. Since nothing
showed up about these read errors in /proc/mdstat
, mdadm
's
monitoring never sent us email about it.
(It's possible that SMART errors were also reported on the drive, but we don't know; smartd monitoring turns out not to be installed by default on CentOS 7 and we never noticed that it was missing until it was too late.)
In the morning of January 27th, the other disk failed outright in a way that caused Linux to mark it as dead. The kernel software RAID code noticed this, of course, and duly marked it as failed. This transferred all IO load to the first disk, the one that had been seeing periodic errors since January 4th. It immediately fell over too; although the kernel has not marked it as explicitly dead, it now fails all IO. Our mirrored filesystem is dead unless we can somehow get one or the other of the drives to talk to us.
The fatal failure here is that nothing told us about the software RAID code having to redirect reads from one side of the mirror to the other due to IO errors. Sure, this information shows up in kernel messages, but so does a ton of other unstructured crap; the kernel message log is the unstructured dumping ground for all sorts of things and as a result, almost nothing attempts to parse it for information (at least not in a standard, regular installation).
Well, let me amend that. It appears that this information is actually
available through sysfs, but nothing actually monitors it (in
particular mdadm
doesn't). There is an errors
file in
/sys/block/mdNN/md/dev-sdXX/
that contains a persistent counter
of corrected read errors (this information is apparently stored in
the device's software RAID superblock), so things like mdadm
's
monitoring could track it and tell you when there were problems.
It just doesn't.
(So if you have software RAID arrays, I suggest that you put together
something that monitors all of your errors
files for increases
and alerts you prominently.)
2017-01-23
Linux desktops and pre-packaged machines from big vendors
Today, I tweeted:
I wonder if Dell or any of the big vendors makes reasonably-priced (mid-)tower systems with 4 3.5" drive bays and 32 GB of RAM.
There is a story behind this tweet, and it's in part a story of how far Linux has come over the time I've been using it.
For a long time, the only way to get a decent Linux (desktop) machine was to specify it from parts yourself. If you were so foolish as to buy a pre-packaged desktop from a major desktop vendor like Dell, HP, or IBM, you might well wind up with hardware that Linux didn't support or only supported badly. So this is what we routinely did at work and what I did for my home machine. My previous generation work machine here back in 2006 was a specified-from-scratch machine, and so is my current office workstation. But, well, Linux has come a long way since 2006, or even 2011. These days it's pretty mainstream and widely supported, at least on most of the kind of plain vanilla hardware that you find on ordinary desktops. Certainly my co-workers have gotten Dell desktops as their new desktop machines at least once and had no problems running modern Linuxes on them.
Which gets me around to the subject of my current office workstation, which has the same hardware as my current home machine and is roughly as old as that machine, which means that it is a bit over five years old. It still runs perfectly fine and performs well enough for the work that I do, but I need to face reality; just as hard disks wear out eventually, so do things like CPU and case fans, power supplies, and eventually motherboards and CPUs and so on. My office machine is going to die from hardware failure at some point, the only question is when. I would like to replace it with fresh hardware before then, and after five years of life seems like a good time to at least start thinking about it (and to get the gears turning, because around here they move slowly for this kind of thing).
When I started thinking about this, my first instinct was to once again specify a machine from scratch (even though I'm not sure who we'd get to build it for us any more). But, well, do I still have to do that these days? In fact, does it even make sense to do that? I don't necessarily have demanding needs, Linux is likely to run on everyone's pre-packaged desktops, and it really should be the case that Dell et al can build machines cheaper than we can, since Dell is doing it in bulk (although with less volume than in the past, given the general decline of the PC market). And buying a Dell is probably much easier to get through the university purchasing process than a custom-built machine from some small place.
(Or we could buy parts and I could have the fun adventure of assembling my new work machine myself. I'm sure it would be educational and people assure me it's not too hard, but it's probably at least an inefficient use of my work time. Not that universities necessarily care about that.)
Having looked at this a bit, I suspect that my needs are sufficiently esoteric that they push me into the area where Dell and company start selling us excessively expensive 'workstation' machines. This may well make 'specify from parts' the least expensive option, but this time around, unlike in the past, it seems worthwhile to at least check. And I can imagine being perfectly happy with a Dell or the like assuming that it has the basic features I need.
It makes me quietly happy for Linux that what once was an esoteric option that required careful hardware curation has moved to being something where I can generally assume it just works, on both servers and even desktops.
(I'm sure there's some hardware that doesn't quite work great, especially if you're right at the edge of newly released stuff, and of course graphics cards are their own sad story of closed source drivers. But my impression is that running into such hardware is now either uncommon or outright rare.)
Sidebar: What I need in an office workstation
My needs almost fit in a tweet: four or more 3.5" drive bays (five would be great), 32 GB of RAM, a processor at least as modern as the i5-2500 I currently have, an onboard Ethernet port and onboard sound (almost sure to be on anything), and either onboard graphics that can drive two displays at 1920x1200 at 60 Hz or a slot for a graphics card so I can drop my current card in (I'd prefer to use onboard graphics). It would be ideal if there was either a second Ethernet port or a PCI(E) card slot I could put an additional network card into.
(And of course a bunch of USB ports, including at least a few USB 3.0 ports. But everything has those.)
I wouldn't mind an optical drive, but I'm not going to turn down a vendor pre-packaged desktop if it lacks one. I simply don't burn or read discs at work very much these days, and we're looking to move even further away from them if possible. USB memory sticks are (or would be) just so much more convenient for installing machines and so on.
(This isn't what I'd like in a theoretical new machine, but work is unlikely to buy me things like the latest top-end i7 CPU even if I try to make rational arguments about how it has an expected lifetime of at least five years so it totally makes sense.)
2017-01-22
An Ubuntu default Bash setup that irritates me, especially for root
Bash itself has a number of option settings to limit what it puts into
the (interactive) command history list, such as $HISTCONTROL
and
$HISTIGNORE
. A stock, no-startup-file Bash shell does not set any
of them and so saves everything into the history list. However, many
systems give you a default .bashrc
file that sets some options here.
In particular, Ubuntu has a very irritating default: it enables the
ignorespace
option in both /etc/skel/.bashrc
and /root/.bashrc
.
What ignorespace
does, for people who have never encountered this,
is that if your bash command line starts with a space character
(and perhaps any whitespace), it will not be saved into the history
list. No access through cursor up, reverse search, anything; it's
as if you didn't run the command. Now, I'm sure that there are people
and situations where this makes sense, but I believe strongly that it's
a bad default and in particular it's very annoying to have as a default
for root
.
You might ask why this is the case. Well, suppose you have a recipe with steps that look like this:
- install Exim4 and related packages:
apt-get install gnutls-bin swaks unrar apt-get install exim4-daemon-heavy exim4-doc-info exim4-doc-html exim4
When you're going through this recipe, the natural way to execute
things is to cut and paste the entire line from the recipe into a
(root) window on the machine you're installing or doing something
to. This means that the lines you're pasting in will start with
whitespace and will not be saved in history. Did you get interrupted
and want to quickly cursor-up to see what the last command you ran
was? You can't. Do you want to look in .bash_history
later to
see which version of the install instructions you used, as reflected
in the commands? You can't.
(This is the format our build instructions are in, so this particularly gets my goat.)
There are other situations where you can be cutting and pasting things into sessions and include even a single space at the start, which will have the same effects. For that matter, you can be editing previous commands in a way that leaves a space at the start and again, the same thing happens. In my opinion, it ought to be a lot harder than this to exclude things from history.
(Having written this entry, I should go change our standard install
stuff so it sets up a /root/.bashrc
that has this removed. I don't
think any of us will miss it; rather the reverse, probably.)
PS: I don't have a Debian system handy to check, but it's possible that this is a default that Ubuntu inherited from Debian instead of something that they decided on their own.
(Some searching turned up this bug-bash thread on why
HISTCONTROL=ignorespace
exists as an option
(via).
Debian or Ubuntu may have decided that this is an important enough usage
case to make it a default. If so, I disagree.)
2017-01-19
Thinking about how to add some SSDs on my home machine
It all started when I upgraded from Fedora 24 to Fedora 25 on my office workstation and then my home machine in close succession, and the work upgrade went much faster because my root filesystem was on SSDs. This finally pushed me over the edge to get a pair of SSDs for my home machine, as I've known I should do for a while. I now actually have the SSDs, but, well, I haven't put them into my home machine yet. You might wonder why, so let me put it this way: the next case I get will have at least six drive bays.
My current case has four drive bays (well, four conveniently usable 3.5" drive bays), and all four drive bays are used; two for the mirrored pair of system HDs, and two for the mirrored pair of data HDs. The SSDs will be replacing the system HDs (and pulling in things like my home directory filesystem from the data HDs), but I can't exactly unplug the HDs and put in the SSDs; I need to shift over, and to do that I need to temporarily have the SSDs in the system too. So I've been mulling over how best to do that, and in the mean time my SSDs have just been sitting there.
(If I had six drive bays it would be easy and I would have shoved in the SSDs almost immediately. And the delay is not just because I've been thinking; it's also because shuffling everything around is going to be kind of a hassle however I do it, and so I keep putting it off in favor of more interesting and pleasant things.)
I have a 3.5" to 2.5" dual SSD adaptor for the SSDs (I'm also using one at work), so a single open 3.5" drive bay will allow me to put both into the machine. A number of potential approaches have occurred to me:
- My case has some 5.25" drive bays, which I'm not using. Maybe I could
just temporarily rest the dual adaptor on the bottom of that area, run
cables to it, and have that work. (The deluxe version would be to put
the 3.5" to 2.5" adaptor in a 5.25" to 3.5" adaptor, but I don't have
one of the latter sitting around and that feels like a lot of work.)
- I could just temporarily run with the side of the case open and cables
running to the SSDs. Don't laugh, one co-worker has been running with
his machine opened up like this for years. It'd be awkward for me,
though, because of where everything is physically (my co-worker has his
open machine on his desk).
- I could deliberately break the mirror of my system disks, remove one, and put the two SSDs in the drive slot freed up by that. It's not very likely that the remaining system disk will fail while I'm shifting over, and if it does I have the other system disk to swap back in.
Breaking the system disk mirror and removing one of the disks strikes me as the least crazy plan. However, it means I get to find out if my Fedora system is set up so that it will actually boot when one of the system disks goes away, or if it will throw up its hands because the shape of the RAID array is not exactly what it wants (this has been known to happen under some circumstances, although that wasn't a disk going missing). Certainly I'd hope that my Fedora 25 system will boot without problems there, but between general issues and systemd I don't have complete confidence here, and I can imagine scenarios that end up with me having to boot a rescue environment and try to glue my system back together again by hand.
(My system disk mirror doesn't just have the root filesystem; it
also has /boot
and swap, each as mirrored things. So systemd
needs to be willing to bring up several RAID arrays in degraded
mode in order to be able to get everything in /etc/fstab
up.)
I expect that the easiest way to test this is to open the case up,
shut the system down, pull the power connector for one of my system
disks, and then try to boot the system. If it fails, I can shut
everything down, plug the power connector back in, and hopefully
everything will be back to being happy with the world. It would
probably be more proper to take the disk offline in mdadm
, but
that may be less easily reversed if things then explode.
(My plan for the SSDs are about a 100 GB ext4 root filesystem (which
will also get /boot
), a bit of swap space, and then the rest of
the space in a ZFS pool. The pool will get my home directory and
various other things that fit where I care either about speed or
about having ZFS's checksums for the data.)
2017-01-17
Making my machine stay responsive when writing to USB drives
Yesterday I talked about how writing things to USB drives made my machine not very responsive, and in a comment Nolan pointed me to LWN's The pernicious USB-stick stall problem. According to LWN's article, the core problem is an excess accumulation of dirty write buffers, and they give some VM system sysctls that you can use to control this.
I was dubious that this was my problem, for two reasons. First, I
have a 16 GB machine and I barely use all that memory, so I thought
that allowing a process to grab a bit over 3 GB of them for dirty
buffers wouldn't make much of a difference. Second, I had actually
been running sync
frequently (in a shell loop) during the entire
process, because I have sometimes had it make a difference in these
situations; I figured frequent sync
s should limit the amount of
dirty buffers accumulating in general. But I figured it couldn't
hurt to try, so I used the dirty_background_bytes
and dirty_bytes
settings to limit this to 256 MB and 512 MB respectively and tested
things again.
It turns out that I was wrong. With these sysctls turned down, my machine stayed quite responsive for once, despite me doing various things to the USB flash drive (including things that had had a terrible effect just yesterday). I don't entirely understand why, though, which makes me feel as if I'm doing fragile magic instead of system tuning. I also don't know if setting these down is going to have a performance impact on other things that I do with my machine; intuitively I'd generally expect not, but clearly my intuition is suspect here.
(Per this Bob Plankers article,
you can monitor the live state of your system with egrep
'dirty|writeback' /proc/vmstat
. This will tell you the number of
currently dirty pages and the thresholds (in pages, not bytes). I
believe that nr_writeback
is the number of pages actively being
flushed out at the moment, so you can also monitor that.)
PS: In a system with drives (and filesystems) of vastly different speeds, a global dirty limit or ratio is a crude tool. But it's the best we seem to have on Linux today, as far as I know.
(In theory, modern cgroups support
the ability to have per-cgroup dirty_bytes
settings, which would
let you add extra limits to processes that you knew were going to
do IO to slow devices. In practice this is only supported on a few
filesystems and isn't exposed (as far as I know) through systemd's
cgroups mechanisms.)
2017-01-16
Linux is terrible at handling IO to USB drives on my machine
Normally I don't do much with USB disks on my machine,
either flash drives or regular hard drives. When I do, it's mostly
to do bulk read or write things such as blanking a disk or writing
an installer image to a flash drive, and I've learned the hard
way to force direct IO through dd
when I'm doing this kind of
thing.
Today, for reasons beyond the scope of this entry, I was copying a
directory of files to a USB flash drive, using USB 3.0 for once.
This simple operation absolutely murdered the responsiveness of my machine. Even things as simple as moving windows around could stutter (and fvwm doesn't exactly do elaborate things for that), never mind doing anything like navigating somewhere in a browser or scrolling the window of my Twitter client. It wasn't CPU load, because ssh sessions to remote machines were perfectly responsive; instead it seemed that anything that might vaguely come near doing filesystem IO was extensively delayed.
(As usual, ionice
was ineffective. I'm not really surprised,
since the last time I looked it didn't do anything for software
RAID arrays.)
While hitting my local filesystems with a heavy IO load will slow
other things down, it doesn't do it to this extent, and I wasn't
doing anything particularly IO-heavy in the first place (especially
since the USB flash drive was not going particularly fast). I also
tried out copying a few (big) files by hand with dd
so I could
force oflag=direct
, and that was significantly better, so I'm
pretty confident that it was the USB IO specifically that was the
problem.
I don't know what the Linux kernel is doing here to gum up its works so much, and I don't know if it's general or specific to my hardware, but it's been like this for years and I wish it would get better. Right now I'm not feeling very optimistic about the prospects of a USB 3.0 external drive helping solve things like my home backup headaches.
(I took a look with vmstat
to see if I could spot something like
a high amount of CPU time in interrupt handlers, but as far as I
could see the kernel was just sitting around waiting for IO all the
time.)
PS: We have more modern Linux machines with USB 3.0 ports at work, so I suppose I should do some tests with one just to see. If this Linux failure is specific to my hardware, it adds some more momentum for a hardware upgrade (cf).
(This elaborates on some tweets of mine.)
2017-01-10
Picking FreeType CJK fonts for xterm
on a modern Linux system
Once I worked out how to make xterm
show Chinese, Japanese, and
Korean characters, I had to figure
out what font to use. I discussed the general details of using
FontConfig to hunt for CJK fonts in that entry, so now let's get down to details.
The Arch Linux xterm
example uses 'WenQuanYi
Bitmap Song' as its example CJK font. This is from the Wen Quan
Yi font collection, and
they're available for Fedora in a collection of wqy-*-fonts packages.
So I started out with 'WenQuanYi Zen Hei Mono' as the closest thing
that I already had installed on my system.
(Descriptions of Chinese fonts often talk about them being an 'X style' font. It turns out that Chinese has different styles of typography, analogous to how Latin fonts have serif and sans-serif styles; see here or here or here for three somewhat random links that talk about eg Heiti vs Mingti. Japanese apparently has a similar but simpler split, per here, with the major divisions being called 'gothic' and 'Mincho'. Learning this has suddenly made some Japanese font names make a lot more sense.)
Fedora itself has a Localization fonts requirements
wiki page. The important and useful bit of this page is a matrix
of language and the default and additional fonts Fedora apparently
prefers for it. Note that each of Chinese, Japanese, and Korean
pick different fonts here; there isn't one CJK font that's the
first or even second preference for all of them. Since you have to
pick only one font for xterm
's CJK font, you may want to think
about which language you care most about.
(This is probably where Han unification sticks its head up, too. Fedora talks about maybe influencing font rendering choices here on its Identifying fonts page.)
In Ubuntu, apparently some CJK default fonts have changed to
Google's Noto CJK family.
A discussion in that bug suggests that Fedora may also have changed
its defaults to the Noto CJK fonts, contrary to what its wiki sort of
implies. The Arch Wiki has its usual comprehensive list of CJK
font options
and there's also Wikipedia's general list. Neither particularly
mentions monospaced fonts, though, assuming that this is even
something that one has to consider in CJK fonts for xterm
.
All of this led me to peer into the depths of /etc/fonts/conf.d
on my Fedora machines to look for mentions of monospace. Here I
found interesting configuration file snippets that said things like:
<match> <test name="lang"> <string>ja</string> </test> <test name="family"> <string>monospace</string> </test> <edit name="family" mode="prepend"> <string>Noto Sans Mono CJK JP</string> </edit> </match> <alias> <family>Noto Sans Mono CJK JP</family> <default> <family>monospace</family> </default> </alias>
I'm not really up on FontConfig magic, but this sure looked like it was setting up a 'Noto Sans Mono CJK JP' font as a monospace font if you wanted things in Japanese. There's also KR, SC (Simplified Chinese), and TC (Traditional Chinese) variants of Noto Sans Mono CJK lurking in the depths of my Fedora system.
After looking at an xterm
using WenQuanYi Zen Hei Mono side by
side with one using Noto Sans Mono CJK JP, I decided that the Noto
version was probably better looking (on my very limited sample of
CJK text, mostly in file names and font names) and also I felt
slightly more confident in picking it, since it seemed more likely
to be closer to how eg gnome-terminal
was operating and also the
general trend of CJK font choices in various Linuxes. I wish I could
find out what CJK font(s) gnome-terminal
was using, but the
design of current versions makes that
difficult.
(Some experimentation suggests that in my setup, gnome-terminal
may be using VL Gothic here. I guess I can live with all of this,
however it comes out; mostly I just want CJK characters to show up
as something other than boxes or especially spaces.)