Sometimes, firmware updates can be a good thing to do
There are probably places that routinely apply firmware updates to every piece of hardware they have. Oh, sure, with a delay and in stages (rushing into new firmware is foolish), but it's always in the schedule. We are not such a place. We have a long history of trying to do as few firmware updates as possible, for the usual reason; usually we don't even consider it unless we can identify a specific issue we're having that new firmware (theoretically) fixes. And if we're having hardware problems, 'update the firmware in the hope that it will fix things' is usually last on our list of troubleshooting steps; we tacitly consider it down around the level of 'maybe rebooting will fix things'.
I mentioned the other day that we've inherited a 16-drive machine with a 3ware controller care. As far as we know, this machine worked fine for the previous owners in a hardware (controller) RAID-6 configuration across all the drives, but we've had real problems getting it stable for us in a JBOD configuration (we much prefer to use software RAID; among other things, we already know how to monitor and manage that with Ubuntu tools). We had system lockups, problems installing Ubuntu, and under load such as trying to scan a 14-disk RAID-6 array, the system would periodically report errors such as:
sd 2:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
(This isn't even for a disk in the RAID-6 array; sd 2:0:0:0 is one of the mirrored system disks.)
Some Internet searches turned up people saying 'upgrade the firmware'. That felt like a stab in the dark to me, especially if the system had been working okay for the previous owners, but I was getting annoyed with the hardware and the latest firmware release notes did talk about some other things we might want (like support for disks over 2 TB). So I figured out how to do a firmware update and applied the 'latest' firmware (which for our controller dates from 2012).
(Unsurprisingly the controller's original firmware was significantly out of date.)
I can't say that the firmware update has definitely fixed our problems with the controller, but the omens are good so far. I've been hammering on the system for more than 12 hours without a single problem report or hiccup, which is far better than it ever managed before, and some things that had been problems before seem to work fine now.
All of this goes to show that sometimes my reflexive caution about firmware updates is misplaced. I don't think I'm ready to apply all available firmware updates before something goes into production, not even long-standing ones, but I'm certainly now more ready to consider them than I was before (in cases where there's no clear reason to do so). Perhaps I should be willing to consider firmware updates as a reasonably early troubleshooting step if I'm dealing with otherwise mysterious failures.
Waiting for a specific wall-clock time in Unix
At least on Unix systems, time is a subtle but big pain for
programmers. The problem is that because the clock can jump forward,
stand still (during leap seconds), or even go backwards, your
expectations about what subtracting and adding times does can wind
up being wrong under uncommon or rare circumstances. For instance,
you can write code that assumes that the difference between a time
in the past and
now() can be at most zero. This assumption recently
led to a Cloudflare DNS outage during a leap second, as covered
in Cloudflare's great writeup of this incident.
The solution to this is a new sort of time. Instead of being based on wall-clock time, it is monotonic; it always ticks forward and ticks at a constant rate. Changes in wall-clock time don't affect the monotonic clock, whether those are leap seconds, large scale corrections to the clock, or simply your NTP daemon running the clock a little bit slow or fast in order to get it to the right time. Monotonic clocks are increasingly supported by Unix systems and more and more programming environments are either supporting them explicitly or quietly supporting them behind the scenes. All of this is good and fine and all that, and it's generally just what you want.
I have an unusual case, though, where I'd actually like the reverse functionality. I have a utility that wants to wait until a specific wall-clock time. If the system's wall-clock time is adjusted, I'd like my waiting to immediately be updated to reflect that and my program woken up if appropriate. Until I started writing this entry, I was going to say that this is impossible, but now I believe that it's possible in POSIX. Well, in theory it's possible in POSIX; in practice it's not portable to at least one major Unix OS, because FreeBSD doesn't currently support the necessary features.
On a system that supports this POSIX feature, you have two options:
just sleeping, or using timers. Sleeping is easier; you use
CLOCK_REALTIME clock with the
The POSIX standard (and Linux)
specify that if the wall-clock time is changed, you still get woken
up when appropriate. With timers, you use a similar but more
intricate process. You create a
CLOCK_REALTIME timer with
and then use
to set a
TIMER_ABSTIME wait time. When the timer expires, you
get signalled in whatever way you asked for.
In practice, though, this doesn't help me. Not only is this clearly
not supported on every Unix, but as far as I can see Go doesn't
expose any API for
clock_nanosleep or equivalent functionality.
This isn't terribly surprising, since sleeping in Go is already
deeply intertwined with its multi-threaded runtime. Right now my
program just approximates what I want by waking up periodically in
order to check the clock; this is probably the best I can do in
general for a portable program, even outside of Go.
(If I was happy with a non-portable program that only worked on
Linux, probably the easiest path would be to use Python with the
to directly call
clock_nanosleep with appropriate arguments.
I'm picking Python here because I expect it's the easiest language
for easy and reasonably general time parsing code. Anyways, I already
know Python and I've never used the
ctypes module, so it'd be fun.)
Sidebar: The torture case here is DST transitions
I started out thinking that DST transitions would be a real problem, since either an hour disappears or happens twice. For example, if I say 'wait until 2:30 am' on the night of a transition into DST, I probably want my code to wake up again when the wall-clock time ticks from 2 am straight to 3 am. Similarly, on a transition out of DST, should I say 'wake up at 2:10 am', I probably don't want my code waking up at the second 1:10 am.
However, the kernel actually deals in UTC time, not local time. In practice all of the complexity is in the translation from your (local time) time string into UTC time, and in theory a fully timezone and DST aware library could get this (mostly) right. For '2:30 am during the transition into DST', it would probably return an error (since that time doesn't actually exist), and for '2:10 am during the transition out of DST' it should return a UTC time that is an hour later than you'd innocently expect.
(This does suggest that parsing such times is sort of current-time dependent. Since there are two '1:30 am' times on the transition out of DST, which one you want depends in part on what time it is now. If the transition hasn't happened yet, you probably want the first one; if the transition has happened but it's not yet the new 1:30 am yet, you probably want the second.)
Another risk of hardware RAID controllers is the manufacturer vanishing
We recently inherited a 16-drive machine with a 3ware hardware RAID controller and now I'm busy trying to put it to work. Our first preference is to ignore the RAID part and just the raw disks, which may or may not work sufficiently well (the omens are uncertain at the moment). If we have to use the 3ware hardware RAID, we'll need the proprietary tools that 3ware supplies. And that has turned out to be a problem.
Once upon a time, 3ware was an independent company that made well regarded (at the time) hardware RAID controllers that were also a popular way to do JBOD IDE and SATA disks. Then it was bought by AMCC, which was bought by LSI, which was bought by Avago (now Broadcom). The 3ware website currently points to an Avago IP address that doesn't respond, and good luck finding links for anything that still works, or rather links that point to official sources for it (lots of people have made copies of this stuff and put them up on their own websites). At one point it looked like I might have to resort to the Wayback Machine in order to get something, although that probably didn't have the actual files we'd need.
(If you're ever in this situation, it turns out that you can dig things out of the Broadcom website with enough work. The downloads you want are in the 'Legacy Products' category, for example through this search link.)
I've been generally down on hardware RAID over the years for various reasons, including performance, ease of management and diagnostics, and the portability of software RAID across random hardware. But I have to admit that until now I hadn't really considered the risk of the maker of your hardware RAID card simply vanishing and taking with it the associated software that you needed to actually manage and monitor the RAID at anything except a very basic level.
(Monitoring is especially important for hardware RAID, where without special software you may not get notified about a failed disk until the second one dies and takes your entire array with it. Or a third one, for people using RAID-6.)
Of course even if the company doesn't vanish, products do get deprecated and software related to them stops being maintained. I'm reasonably hopeful that the 3ware utilities will still run on a modern Linux system, but I'm not entirely confident of it. And if they don't, we don't really have very many options. People who use less popular operating systems may have even bigger problems here (I think current versions of Illumos may have wound up with no support for 3ware, for example).
Does CR LF as a line ending cause extra problems with buffers?
When you said “state machine” in the context of network protocols, I thought you were going to talk about buffers. That’s an even more painful consequence than just the complexity of scanning for a sequence. [...]
My first reaction was that I didn't think a multi-byte line ending sequence causes extra problems, because dealing with line oriented input through buffering already gives you enough of them. Any time you read input in buffers but want to produce output in lines, you need to deal with the problem that a line may not end in the current buffer. This is especially common if you're reading through input in fixed-size chunks; you would have to be very lucky to always have a line end right at the end of every 4k block (or 16k block or whatever). Sooner or later a block boundary will happen in the middle and there you are. So you have to be prepared to glue lines together across buffers no matter what.
This is too simple a view, though, once you (ie, I) think about it more. When your line ending is a single byte, you have an unambiguous situation within a single buffer; either the line definitely ends in the buffer or it doesn't. Your check for the line ending is 'find occurrence of byte <X>' and once this fails you'll never have to re-check the current buffer's contents. This is not true with a multi-byte line ending, because the line ending CR LF sequence may be split over a buffer boundary. This means that you can no longer scan each buffer independently. Either you need to scan them together so that such split CR LF sequences are fused back together, or you need to remember that the last byte in the current buffer is a CR and look for a bare LF at the start of the next buffer.
Of course, CR LF line endings aren't the only case in modern text processing where you have multi-byte sequences. A great deal of modern text is encoded in UTF-8, and many UTF-8 codepoints are multi-byte sequences; if you want to recognize such a codepoint in buffers of UTF-8 text, you have the same problem that the UTF-8 encoding may start at the end of one buffer and finish in the start of the next. It feels like there ought to be a general way of dealing with this that could then be trivially applied to the CR LF case.
(As Aristotle Pagaltzis kind of mentions later in his comment, this is going to involve storing state somewhere, either explicitly in a data structure or implicitly in the call stack of a routine that's pulling in the next buffer's worth of data.)
What file types we see inside singleton nested zipfiles in email
Earlier, I wrote about how email attachments of a single .zip
inside another zip are suspicious. Given
.doc malware using them has come back,
today I feel like reporting on what file types we've seen in such
cases over the past nine weeks.
(I'm picking nine weeks because we rotate this particular logfile once a week and it's thus easy to grep through just nine weeks worth.)
So here are the raw numbers:
2292 inner zip exts: .js 1261 inner zip exts: .doc 606 inner zip exts: .lnk 361 inner zip exts: .wsf 15 inner zip exts: .jse 5 inner zip exts: .exe 1 inner zip exts: .txt 1 inner zip exts: .scr
Of these 4542 emails, 3760 came from IP addresses that were listed in zen.spamhaus.org. In fact, here is the breakdown of how many of each different type were listed there:
2051 inner zip exts: .js (89%) 1101 inner zip exts: .doc (87%) 386 inner zip exts: .lnk (64%) 214 inner zip exts: .wsf (59%) 4 inner zip exts: .jse (27%) 3 inner zip exts: .exe (60%) 1 inner zip exts: .scr (100%)
.js) under another name.
.wsf is a Windows Script File.
files are Windows shortcuts, but get abused in malware as
covered eg here
(or the interesting live scam covered here).
.scr is a Windows screensaver, which can also contain all sorts
of executable code.
There's nothing really surprising here; it's basically a greatest
hits collection of ways to run your own code on reasonably modern
Windows machines (apparently
.com are now too old for
most things). Since the
.lnk files are not with other files,
they're probably being used in the way mentioned here,
where they run Powershell or some other capable tool with a bunch
of command line arguments that pull down and run a nasty thing.
I don't know what to make of the variance in Zen listings between
the various file extensions. I suspect that it has something to do
with how big and broad a malware campaign is; if a campaign is
prolific, its sending IPs are probably more likely to trip the
detection for DNS blocklists. It seems at least reasonable that
.js malware are more prolific than
the others; they certainly send us much more stuff.
I'm too much of a perfectionist about contributing to open source projects
I find it hard to contribute changes to open source projects, and in fact even the thought of trying to do so usually makes me shy away. There are a tangled bunch of reasons for why, but I've come to realize that a part of it is that I'm nervous about my work being perfect, or at least very good. Take a documentation change, for example.
If I'm going to make any public contribution of this nature, for example to make a correction to a manpage, I'm going to worry a great deal about how my writing reads. Does it flow well? Does it fit in? Have I written some clumsy, clunky sentences that are forever going to sit there as a stinker of a legacy, something I'll wind up wincing about every time I read the manpage myself? I write enough things here that don't quite work or that make me wince at the phrasing in retrospect, so it's very much a possibility and something I'm aware of.
Then there's the question of what I'm writing down. Do I actually understand the situation correctly and completely? If I'm writing about how to do something, have I picked the best way to do it, one that works effectively and is easy to do? What about backwards compatibility concerns, for example if this is different on older Linux kernels? Does the project even care about that, or do they want their documentation to only reflect the current state of affairs?
I'm not saying that I should throw half-baked, half-thought-out things at projects; that's clearly a bad idea itself. But many of these worries I wind up with are probably overblown. Maybe I don't get my writeup entirely complete on the first submission, and the existing people in the project have to point out stuff that I missed. That's probably okay. But there's this gnawing worry that it's not. I don't want to be an annoyance for open source projects and (potentially) incomplete work feels like it makes me into one.
There's also that making changes to a decent sized project feels like a terrifyingly large responsibility. After all, I could screw something up and create bugs or documentation problems or whatever, for a whole lot of people. It's much easier to just submit bug reports and make suggestions about fixes, which leaves the responsibility for the actual changes to someone else.
(Of course, submitting good changes is hard, too. It really is a surprisingly large amount of work to do it right. But that's another issue from being (overly) nervous about how good my work is, and in theory many projects are welcoming of incomplete or first-pass change submissions. I doubt I'll ever persuade myself to take them up on their offer, though.)
Different ways you can initialize a RAID-[567+] array
I was installing a machine today where we're using Linux software RAID to build a RAID-6 array of SATA HDs, and naturally one of the parts of the installation is creating and thus initializing the RAID-6 array. This is not something that goes very fast, and when I wandered past the server itself I noticed that the drive activity lights were generally blinking, not on solid. This got me thinking about various different ways that you might initialize a newly created RAID-N array.
It's obvious, but the reason newly created RAID-N arrays need to be initialized is to make the parity blocks consistent with the data blocks. The array generally starts with drives where all the blocks are in some random and unknown state, which means that the parity blocks of a RAID stripe are extremely unlikely to match with the data blocks. Initializing a RAID array fixes this in one way or another, so that you know that any parity mismatches are due to data corruption somewhere.
The straightforward way to initialize a RAID-N array is to read the current state of all of the data blocks for each stripe, compute the parity blocks, and write them out. This approach does minimal write IO, but it has the drawback that it sends an interleaved mixture of read and write IO to all drives, which may slow them down and force seeking. This happens because the parity blocks are normally distributed over all of the drives, rotating from drive to drive with each stripe. This rotation means that every drive will have parity blocks written to it and no drive sees pure sequential read or write IOs. This way minimizes write IO to any particular drive.
A clever way to initialize the array is to create it as a degraded array and then add new disks. If you have an M disk array with N-way parity, create the array with M-N disks active. This has no redundancy and thus no need to resynchronize the redundancy to be correct. Now add N more disks, and let your normal RAID resynchronization code go into effect. You'll read whatever random stuff is on those first M-N disks, assume it's completely correct, reconstruct the 'missing' data and parity from it, and write it to the N disks. The result is random garbage, but so what; it was always going to be random garbage. The advantage here is that you should be sequentially reading from the M-N disks and sequentially writing to the N disks, and disks like simple sequential read and write IO. You do however write over all of the N disks, and you still spend the CPU to do the parity computation for every RAID stripe.
The final way I can think of is to explicitly blank all the drives. You can pre-calculate the two parity blocks for a stripe with all zeros in the data blocks, then build appropriate large write IOs for each drive that interleave zero'd data blocks and the rotating parity blocks, and finally blast these out to all of the drives as fast as each one can write. There's no need to do any per-stripe computation or any read IO. The cost of this is that you overwrite all of every disk in the array.
(If you allow people to do regular IO to a RAID array being initialized, each scheme also needs a way to preempt itself and handle writes to a random place in the array.)
In a world with both HDs and SSDs, I don't think it's possible to say that one approach is right and the other approaches are wrong. On SSDs seeks and reads are cheap, writes are sometimes expensive, and holding total writes down will keep their lifetimes up. On HDs, seeks are expensive, reads are moderately cheap but not free, writes may or may not be expensive (depending in part on how big they are), and we usually assume that we can write as much data to them as we want with no lifetime concerns.
PS: There are probably other clever ways to initialize RAID-N arrays; these are just the three I can think of now.
(I'm deliberately excluding schemes where you don't actually initialize the RAID array but instead keep track of which parts have been written to and so have had their parity updated to be correct. I have various reactions to them that do not fit in the margins of this entry.)
PPS: The Linux software RAID people have a discussion of this issue from 2008. Back then, RAID-5 used the 'create as degraded' trick, but RAID-6 didn't; I'm not sure why. There may be some reason it's not a good idea.
Python won't (and can't) import native modules from zip archives
I've written before about running Python programs from zip files, and in general you can package up Python modules in zip files. Recently I was grumbling on Twitter about the hassles of copying multi-file Python things around, and Jouni Seppänen mentioned using zip files for this but wasn't sure whether they supported native modules (that is, compiled code, as opposed to Python code). It turns out that the answer is straightforward and unambiguous; Python doesn't support importing native modules from zip files. This is covered in the zipimport module:
Any files may be present in the ZIP archive, but only files
.pycare available for import. ZIP import of dynamic modules (
.so) is disallowed. [...]
(The Python 2.7 documentation says the same thing. If you're like
me and had never heard of
.pyd files before, they're basically
I don't know about the Windows side of things, but on Unix (with
.so shared objects), this is not an arbitrary restriction Python
has imposed, which is what the term 'disallowed' might lead you to
think. Instead it's more or less inherent in the underlying API
that Python is using. Python loads native modules using the Unix
dlopen() function, and the
dlopen() manpage is specific
about its API:
void *dlopen(const char *filename, int flags);
Which is to say that
dlopen() takes a (Unix) filename as the
dynamic object to load. In order to call
dlopen(), you must have
an actual file on disk (or at least on something that can be
mmap()'d). You can't just hand
dlopen() a chunk of memory, for
example a file that you read out of a zip file on the fly, and say
'treat this as a shared object'.
dlopen() relies on
mmap() for relatively solid reasons due to
how regular shared libraries are loaded in order to share memory
between processes. Plus, wanting to turn a
block of memory into a loaded shared library is a pretty uncommon
thing; mapping existing files on disk is the common case. There are
probably potential users other than Python, but I doubt there are
In theory perhaps Python could extract the
.so file from the zip
file, write it to disk in some defined temporary location, and
dlopen() that scratch file. There are any number of potential
issues with that (including fun security ones), but what's certain
is that it would be a lot of work. Python declines to do that work
for you and to handle all of the possible special cases and so on;
if you need to deploy a native module in a bundle like this, you'll
have to arrange to extract it yourself. Among other things, that
puts the responsibility for all of the special cases on your
shoulders, not on Python's.
Malware strains may go away sometimes, but they generally come back
I have a little confession. Last Tuesday I wrote about how we'd started rejecting .doc files in nested zipfiles. Although I didn't mention it in the entry, we did this because we'd seen them dominate our detected malware attempts over the weekend (with everything being identified by Sophos PureMessage as 'Mal/DrodZp-A'). Well, guess what? The moment we added that new rejection rule, all of those .docs in .zips in zipfiles vanished, with not one to be seen for our new rule to reject.
On the one hand, in theory this didn't matter; as I wrote, singleton nested zipfiles are suspicious in general and we had definitely not seen any legitimate cases of this sort of email. On the other hand, in practice we don't want to have rejection rules for everything we've seen once, because every rejection rule is a little bit of complexity added to our mail system and we want to keep the complexity down to only the things that are really worthwhile. With malware, there are always more things we could be looking for and rejecting on, so we have to draw the line somewhere; otherwise we could be playing whack-a-mole against obscure malware for months and building up a towering mass of complexity in the process. So it wasn't a good feeling to think that I might have written in a useless rejection rule and maybe I should go back in and take it out.
I won't say that I shouldn't have worried about it, but I can say
that I don't have to any more. Starting on February 6th, whatever
malware was sending this stuff our way came roaring back (well,
roaring for our traffic volume); we had 30 rejections on the 6th,
59 on the 7th, and 38 on the 8th. Just over 93% of these were from
IPs listed in the Spamhaus ZEN aggregate DNSBL, which suggests that we probably
rejected a bunch more that were sent to people who had opted in to
DNSBL based rejection (which happens at
RCPT TO time, before we
receive the message and start scanning MIME attachments).
Whatever strain of malware is responsible for sending these things
out may have temporarily turned its attention away from us for a
while, but it's back now, at least for a while.
I suppose this really shouldn't surprise me. We've seen that MyDoom is still around and there's no particular reason why a malware attack vector should stop being used as long as it's even vaguely working. Spam (malware included) comes and goes based on where the sending attention is focused today, but it's very likely to come back sooner or later. And even if a particular strain of malware is wiped out totally (by taking over its command & control infrastructure or arresting the people behind it or the like), I expect that any respite is only temporary. Sooner or later someone will come along to pick up the pieces and revive the attack techniques and address lists for their own benefit, and we'll get hit again by something that looks very much like the same old thing.
How to see and flush the Linux kernel NFS server's authentication cache
We're going to be running some Linux NFS (v3) servers soon (for
reasons beyond the scope of this entry), and we want to control
access to the filesystems that those servers will export by netgroup,
because we have a number of machines that should have access. Linux
makes this a generally very easy process, because unlike many systems
you don't need
in order to use netgroups. All you need to do is change
/etc/nsswitch.conf to say '
netgroup: files', and then you can
just put things in
However, using netgroups makes obvious the important question of how you get your NFS server to notice changes in netgroup membership, as well as more general changes in authorizations such as changes in the DNS. If you add or delete a machine from a netgroup, or change a IP's PTR record in DNS, you want your NFS server to notice and start using the new information.
I will skip to the conclusion: the kernel maintains a cache of mappings from IP addresses to 'authentication domains' that the IP address is a member of. When it needs to know information about an IP address that it doesn't already have, the kernel asks mountd and mountd adds an entry to the cache. Entries are generally added with a time-to-live, after which they'll be automatically expired and then re-validated; mountd hard-codes this TTL to 30 minutes.
(You can read more information about this and several other interesting
things in the nfsd(7) manpage,
which describes what you'll find in the special NFS server related
/proc and associated virtual filesystems.)
You can see the contents of this cache by looking at
/proc/net/rpc/auth.unix.ip/content. Note that the cache includes
both positive entries and negative ones (where mountd has declined
to authorize a host, and so it's had mount permissions denied). To
clear this cache and force everything to revalidate, you write a
sufficiently large number to
So, what is a sufficiently large number?
The nfsd(7) manpage describes
flush this way:
When a number of seconds since epoch (1 Jan 1970) is written to this file, all entries in the cache that were last updated before that file become invalidated and will be flushed out. Writing 1 will flush everything. [...]
That bit about writing a
1 is incorrect and doesn't work (perhaps
this is a bug, but it's also the reality on all of the kernels that
you'll find on systems today). So you need to write something that
is a Unix timestamp that's in the future, perhaps well in the future.
If you feel like running a command to get such a number, the simple
thing is to use GNU date's relative time feature:
$ date -d tomorrow +%s 1486620251
The easier way is just to stack up
9s until everything gets
flushed. Of course there have been so many seconds since the Unix
epoch that you need quite a lot of 9s by now.
Probably we're going to wrap this up in a script and put a big comment
at the start of
/etc/netgroup (and possibly
/etc/exports) about it.
Fortunately I don't expect our netgroups to change very often for these
fileservers. Their export lists will likely be mostly static, but we'll
slowly add some additional machines to the netgroup.