2016-06-24
Our new plan for creating our periodic long term backups
Our ordinary backups are done on the usual straightforward rolling basis, where we aim to have about 60 days worth of backups. We also try to make an additional set of long term backups every so often, currently roughly three times a year, and keep these for as long as possible. Every so often this makes people very happy because we can restore something they deleted six months ago without noticing.
Our long term backups are done with the same basic system as our regular disk-based backups. We have some additional Amanda servers that are used only for these long term backups, we load them up with disks, and then we have them do full backups of all of our filesystems to the spare disks. Obviously this requires careful scheduling and managing, since we don't want to collide with the regular backups (which take priority). This is a simple approach and it works, but unfortunately over time it's become increasingly difficult and time consume to actually do a long term backup run. The long term backups can only run during the day and require hand attention, sometimes the regular backups of our largest fileserver run into the day and block long term backups that day entirely, the daytime backups go very slowly in general because our systems are actively in use, and so on. And many of these problems are only going to get worse in the future, as people use more space and are more active on our machines.
Recently, one of my co-workers had a great idea on how to deal with all of these problems: copy filesystem backups out of our existing Amanda servers. Instead of using additional Amanda servers to do additional backups, we can just make copies of the full filesystem backups from our existing regular backup system. When you do Amanda backups to 'tapes' that are actually disks, Amanda just writes each filesystem backup to a regular file. Want an extra copy, say for long term backups? Just copy it somewhere, say to the disks we're using for those long term backups. This copying doesn't bog down our fileservers, can easily be done when the Amanda servers are otherwise idle, and can be done any time we want, even days after the filesystem full backup was actually made. Effectively we've turned building the long term backups from a synchronous process into an asynchronous one.
The drawback of abandoning Amanda is that we lose all of the Amanda
infrastructure for tracking where filesystems have been saved and
restoring filesystems (and files). It's entirely up to us to keep
track of which disk has which filesystem backup (and when it was
made) and to save per-filesystem index files. And any restores will
have to be entirely done by hand with raw dd and tar commands,
which makes them rather less convenient. But we think we can live
with all of this in exchange for it being much easier to make the
long term backups.
Right now this is just a plan. We haven't done a long term backup run with it; the next one is likely to happen in September or October. We may find out that there are some unexpected complications or annoyances when we actually try it, although we haven't been able to think of any.
(In retrospect this feels like an obvious idea, but much like the last time spotting it requires being able to look at things kind of sideways. All of my thoughts about the problem were focused on 'how can we speed up dumping filesystems for the long term backups' and 'how can we make them work more automatically' and so on, which were all stuck inside the existing overall approach.)
2016-06-22
I need to cultivate some new coding habits for Python 3 ready code
We have a Django application, and because of various aspects of Django (such as Django handling much of the heavy lifting), I expect that it's our most Python 3 ready chunk of code. Since I was working on it today anyways, I took a brief run at seeing if it would at least print help messages if I ran it under Python 3. As it turns out, making the attempt shows me that I need to cultivate some new coding habits in order to routinely write code that will be ready for Python 3.
What I stumbled over today is that I still like to write except
clauses in the old way:
try: .... except SomeErr, err: ....
The new way to write except clauses is the less ambiguous 'except
SomeErr as err:'. Python 3 only supports the new style.
Despite writing at least some of the code in our Django application
relatively recently, I still wrote it using the old style for
except. Of course this means I need to change it all. I'm pretty
certain that writing except clauses this way is not something
that I think about; it's just a habit of how I write Python, developed
from years of writing Python before 'except CLS as ERR' existed
or at least was usable by me.
What I take away from this is that I'm going to need to make new
Python coding habits, or more exactly go through the more difficult
exercise of overwriting old habits with new ones. I'm sure to be
irrationally annoyed at some of the necessary changes, especially
turning 'print' statements into function calls.
(If I was serious about this, what I should do is force myself to write only in Python 3 for a while. Unfortunately that's not very likely.)
The good news is that I checked some code I wrote recently, and I
seem to have deliberately used the new style except clauses in
it. Now if I can remember to keep doing that, I might be in okay
shape.
(Having thought about it, what would be handy is a Python linter that complains about such unnecessary Python 3 incompatibilities. Then I'd at least have a chance of catching my issues here right away.)
PS: Modernizing an old code base is a related issue, of course. I need to modernize both code and habits to be ready for Python 3 in both current and future code.
Sidebar: The timing and rationality of using old-style except
New style except was introduced in Python 2.6, which dates back
to 2008. However, the new version of Python didn't propagate into
things like Linux distributions immediately; it took two years to
get it into an Ubuntu LTS release, for example (in 10.04). Looking
back at various records, it seems that the initial version of our
Django application was deployed on an
Ubuntu 8.04 machine that would have had only Python 2.5. In fact I may have written the first version
of all of the substantial code in the application while we were
still using 8.04 as the host machine and so new-style except would
have been unavailable to us.
This is of course no longer the case. Although not everything we
run today has Python 2.7 available (cf),
it all has at least Python 2.6. So I should be writing all my code
with new style except clauses and probably some other modernizations.
Moving from Python 2 to Python 3 calls for a code inventory
I was all set to write a blog entry breaking down what sort of Python code we had, how much it was exposed to security issues and other threats, and how much work it would probably be to migrate it to or towards Python 3. I even thought it was going to be a relatively short and simple entry. Then, as I was writing things down, I kept remembering more and more bits of Python code we're using in different contexts, and I'm pretty sure I'm still forgetting some.
So, here's my broad moral for today: if you have Python code, and you're either thinking of migrating at least some of it to Python 3 or considering whether you can ignore the alleged risks of continuing to use Python 2, your first step is (or should be) to get a code inventory. Expect this to probably take a while; you don't want just the big obvious applications, you also care about the little things in the corners.
Perhaps we're unusual, but we don't have our Python code in one or more big applications, where it's easy and obvious to look at things. Instead, we have all sorts of things written in Python, everything from a modest Django application through system management subsystems to little command line things (and not so little ones). These have accumulated over what is almost a decade by now, and if they work quietly we basically forget about them (and most of them do). It's clearly going to take me some work to find them all, categorize them, and probably in some cases discover that they're now unnecessary.
Having written this, I don't know if I'm actually going to do such an inventory any time soon. The problem is that the work is a boring slog and the issue is not particularly urgent, even if we accept a '2020 or earlier' deadline on Python 2 support. Worse, if I do an inventory now and then do nothing with it, it's probably going to get out of date (wasting the work). I'd still like to know, though, if only for my own peace of mind.
2016-06-20
A tiny systemd convenience: it can reboot the system from RAM alone
One of the things I do a fair bit of is building and testing
from-scratch system installs. Not being crazy, I do this in virtual
machines (it's much faster that way). When you do this sort of work,
you live in a constant cycle of installing a machine from scratch,
testing it, and then damaging the install enough so that when you
reboot, your VM will repeat the 'install from scratch' part. Most
of the time, the most convenient way to damage the install is with
dd:
dd if=/dev/zero of=/dev/sda bs=1024k count=32; sync reboot
(The sync can be important.)
Dd'ing over the start of the (virtual) disk makes sure that there isn't a partition table and a bootloader any more, and it also generally prevents the install CD environment from sniffing around and finding too many traces of your old installed OS.
On normal System V init or Upstart based systems, this sequence has
a minor little irritation: the reboot will usually fail. This is
because the reboot process needs to read files off the filesystem,
which you've overwritten and corrupted with the dd. Then you (I)
get to go off to the VM menus and say 'power cycle the machine',
which is just a tiny little interruption.
With systemd, at least in Ubuntu 16.04, this doesn't happen. Sure, a number of things run during the reboot process will spit out various errors, but systemd continues driving everything onwards anyways and will successfully reboot my virtual machine with no further activity on my part. The result is every so slightly more convenient for my peculiar usage case.
I believe that systemd can do this for several reasons. First,
systemd parses and loads all unit files into memory when it starts
up (or you tell it 'systemctl daemon-reload'), which means that
it doesn't have to read anything from disk in order to know what
needs to be done to shut the system down. Second, systemd mostly
terminates processes itself; it doesn't need to repeatedly get
scripts to run kill and the like, which could fail if kill or
other necessary bits have been damaged by that dd. Finally,
I think that systemd can handle calling reboot() internally,
instead of having to run an executable (which might not be there)
in order to do this.
(Systemd clearly has internal support in PID 1 for rebooting the system under some circumstances. I'm not quite clear if this is the path that a normal reboot eventually takes; it's a bit tangled up in units and handling this and that and so on.)
PS: Possibly there is a better way to damage a system this way than
dd. dd has the (current) virtue of being easy to remember and
clearly sufficient. And small variants of this dd command work on
any Unix, not just Linux (or a particular Linux).
2016-06-19
A lesson to myself: know your emergency contact numbers
Let's start with my tweets:
@thatcks: There's nothing quite like getting a weekend alert that a machine room we have network gear in is at 30C and climbing. Probably AC failure.
@thatcks: @isomer There is approximately nothing I can do, too. I'm not even sure who to potentially call, partly because it's not our machine room.
(This is the same machine room that got flooded because of an AC failure, which certainly added a degree of discomfort to the whole situation.)
In some organizations the answer here is 'go to the office and see about doing something, anything'. That is not how we work, for various reasons. It might be different if it was one of our main machine rooms, but an out of hours AC failure in a machine room we only have switches in is not a crisis sufficiently big to drag people to the office.
But, of course, there is a failure and a learning experience here, which is that I don't have any information written down about who to call to get the AC situation looked at by the university's Facilities and Services people. I've been through past machine room AC failures, and at the time I either read the signs we have on machine room doors or worked out (or heard) who to call to get it attended to, but I didn't write it down. Probably I thought that it was either obvious or surely I wouldn't forget it for next time around. Today I found out how well that went.
So, my lessons learned from this incident is that I should fix my ignorance problem once and for all. I should make a file with both in-hours and out-of-hours 'who to contact and/or notify' information for all of the machine rooms we're involved in. Probably we call the same people for a power failure as for an AC failure or another incident, but I should find out for sure and note this down too. Then I should replicate the file to at least my home machine, and probably keep a printout in the office (in case there's a failure in our main machine room, which would take our entire environment down).
(It would be sensible to also have contact information for, say, a failure in our campus backbone connection. I think I know who to try to call there, but I'm not sure and if it fails I won't exactly be able to look things up in the campus directory.)
Why ZFS can't really allow you to add disks to raidz vdevs
Today, the only change ZFS lets you make to a raidz vdev once you've created it is to replace a disk with another one. You can't do things like, oh, adding another disk to expand the vdev, which people wish for every so often. On the surface, this is an artificial limitation that could be bypassed if ZFS wanted to, although it wouldn't really do what you want. Underneath the surface, there is an important ZFS invariant that makes it impossible.
What makes this nominally easy in theory is that ZFS raidz vdevs already use variable width stripes. A conventional RAID system uses full width stripes, where all stripes span all disks. When you add another disk, the RAID system has to change how all of the existing data is laid out to preserve this full width; you goes from having the data and parity striped across N disks to having it striped across N+1 disks. But with variable width stripes, ZFS doesn't have this problem; adding an existing disk doesn't require touching any of the existing stripes, even what were full width stripes. All that happens is they go from being full width stripes to being partial width stripes.
However, this is probably not really what you wanted because it doesn't get you as much new space as adding a disk does in a conventional RAID system. In a conventional RAID system, the reshaping involved both minimizes the RAID overhead and gives you a large contiguous chunk of free space at the end of the RAID array. In ZFS, simply adding a disk this way would obviously not do that; all of your old 'full width' stripes are now somewhat inefficient partial width stripes, and much of the free space is going to be scattered about in little bits at the end of those partial width stripes.
In fact, the free space issue is the fatal flaw here. ZFS raidz imposes a minimum size on chunks of free space; they must be large enough that it can write one data block plus its parity blocks (ie N+1, where N is the raidz level). Were we to just add another disk along side existing disks, much of the free space on it could in fact violate this invariant. For example, if the vdev previously had two consecutive full width stripes next to each other, adding a new disk will create a single-block chunk of free space in between them.
You might be able to get around this by immediately marking such space on the new disk as allocated instead of free, but if so you could find that you got almost no extra space from adding the disk. This is probably especially likely on a relatively full pool, which is exactly the situation where you'd like to get space quickly by adding another disk to your existing raidz vdev.
Realistically, adding a disk to a ZFS raidz vdev requires the same sort of reshaping as adding a disk to a normal RAID-5+ system; you really want to rewrite stripes so that they span across all disks as much as possible. As a result, I think we're unlikely to ever see it in ZFS.
2016-06-18
It's easier to shrink RAID disk volumes than to reshape them
Once your storage system is using more than a single disk to create a pool of storage, there are a number of operations that you can want to do in order to restructure that pool of storage. Two of them are shrinking and reshaping. It's common for volume managers and modern filesystems like btrfs to be able to shrink storage pool by removing a disk (or a set of mirrored disks), although not all modern filesystems support doing this. It's also becoming increasingly common for RAID (sub)systems to support reshaping RAID pools to do things like change from RAID-5 to RAID-6 (or vice versa); modern filesystems may also implement this sort of reshaping if they support RAID levels that can use it. Often shrinking and reshaping are lumped together as 'yeah, we support reorganizing storage in general'.
In thinking about this whole area lately, I've realized that shrinking is fundamentally easier to do than reshaping because of what it involves at a mechanical level. When you shrink a pool of storage, you do so by moving data to a new place; you move it from disk A, which you are getting rid of, to free space on other disks. When all the data has been moved off of disk A, you're done. By contrast, reshaping is almost always an in-place operation. You don't copy all the data to an entirely different set of disks, then copy it back in a different arrangement; instead you must very carefully shuffle it around in place, keeping exacting records of what has and hasn't been shuffled so you know how to refer to it.
For obvious reasons, filesystems et al already have plenty of code for allocating, writing, and freeing blocks. To implement shrinking, 'all' you need is an allocation policy that says 'never allocate on this entity' plus something that walks over the entire storage tree, finds anything allocated on the to-be-removed disk, triggers a re-allocation and re-write, and then updates bits of the tree appropriately. The tree walker is not trivial, but because all of this mimics what the filesystem is already doing you have natural answers for many questions about things like concurrent access by ordinary system activity, handling crashes and interruptions, and so on. Fundamentally, the whole thing is always in a normal and consistent state; it just has less and less of your data on the to-be-removed disk over time.
This is not true for reshaping. Very few storage systems do anything like a RAID reshaping during normal steady state operation. This means you need a whole new set of code, you're going to have to be very careful to manage things like crash resistance, and a pool of storage that's in the middle of a reshaping looks very different from how it does in normal operation (which means that you can't just abandon a reshaping in mid-progress in the way you can abandon a shrink).
(This is a pretty obvious thing if you think about it, but I hadn't really considered it before now.)
PS: Not all 'shrinking' is actually shrinking in the form I'm considering here. Removing one disk from a RAID-5 or RAID-6 pool of storage is really a RAID reshape.
(It's theoretically possible to design a modern filesystem where RAID reshapes proceed like shrinking. I don't think anyone has done so, although maybe this is how btrfs works.)
2016-06-17
Why you can't remove a device from a ZFS pool to shrink it
One of the things about ZFS that bites people every so often is
that you can't remove devices from ZFS pools. If you do 'zpool add
POOL DEV', congratulations, that device or an equivalent replacement
is there forever. More technically, you cannot remove vdevs once
they're added, although you can add and remove mirrors from a
mirrored vdev. Since people do make mistakes with 'zpool add',
this is periodically a painful limitation. At this point you might
well ask why ZFS can't do this, especially since many other volume
managers do support various forms of shrinking.
The simple version of why not is ZFS's strong focus on 'write once' immutability and being a copy on write filesystem. Once it writes filesystem information to disk, ZFS never changes it; if you change data at the user level (by rewriting a file or deleting it or updating a database or whatever), ZFS writes a new copy of the data to a different place on disk and updates everything that needs to point to it. That disk blocks are not modified once written creates a whole lot of safety in ZFS and is a core invariant in the whole system.
Removing a vdev obviously requires breaking this invariant, because as part of removing vdev A you must move all of the currently in use blocks on A over to some other vdev and then change everything that points to those blocks to use the new locations. You need to do this not just for ordinary filesystem data (which can change anyways) but also for things like snapshots that ZFS normally never modifies once created. This is a lot of work (and code) that breaks a bunch of core ZFS invariants. As a result, ZFS was initially designed without the ability to do this and no one has added it since.
(This is/was known as 'block pointer rewrite' in the ZFS community. ZFS block pointers tell ZFS where to find things on disk (well, on vdevs), so you need to rewrite them if you move those things from one disk to another.)
About a year and a half ago, I wrote an entry about how ZFS pool shrinking might be coming. Given what I've written here, you might wonder how it works. The answer is that it cheats. Rather than touch the ZFS block pointers, it adds an extra layer underneath them that maps IO from one vdev to another. I'm sure this works, but it also implies that removing a vdev adds a more or less permanent extra level of indirection for access to all blocks that used to be on the vdev. In effect the removed vdev lingers on as a ghost instead of being genuinely gone.
(This obviously has an effect on, for example, ZFS RAM usage. That mapping data has to live somewhere, and may have to be fetched off disk, and we've seen this show before.)
Having the ability to remove an accidentally added vdev is a good thing, but the more I look at the original Delphix blog entry, the more dubious I am about ever using it for anything big. A quick removal of an accidentally added vdev has the advantage that almost nothing should be on the new vdev, and normal churn might well get rid of the few bits that wound up on it (and so allow the extra indirection to go away). Shrinking an old, well used pool by a vdev or two is not going to be like that, especially if you have things like old snapshots.
2016-06-15
How (some) syndication feed readers deal with HTTP to HTTPS redirections
It's now been a bit over a year since Wandering Thoughts switched from HTTP to HTTPS, of course with a pragmatic permanent redirect from the HTTP version to the HTTPS version. In theory syndication feed readers should notice permanent HTTP redirections and update their feed fetching information to just get the feed from its new location (although there are downsides to doing this too rapidly).
In practice, apparently, not so much. Looking at yesterday's stats from this server, there are 6,700 HTTP requests for Atom feeds from 520 different IP addresses. Right away we can tell that a number of these IPs made a lot of requests, so they're clearly not updating their feed source information. Out of those IPs, 30 of them did not make HTTPS requests for my Atom feeds; in other words, they didn't even follow the redirection, much less update their feed source information. The good news is that these IPs are only responsible for 102 feed fetch attempts, and that a decent number of these come from Googlebot, Google Feedfetcher (yes, still), and another web spider of uncertain providence and intentions. The bad news is that this appears to include some real syndication feed readers, based on their user agents, including Planet Sysadmin (which is using 'Planet/1.0'), Superfeedr, and Feedly.
The IPs that did at least seem to follow the HTTP redirection have a pretty wide variety of user agents. The good or bad news is that this includes a number of popular syndication feed readers. It's good that they're at least following the HTTP redirection, but it's bad that they're both popular and not updating feed source information after over a year of permanent HTTP redirections. Some of these feed readers include CommaFeed, NewsBlur, NetNewsWire, rss2email, SlackBot, newsbeuter, Feedbin, Digg's Feed Fetcher, Gwene, Akregator, and Tiny Tiny RSS (which has given me some heartburn before). Really, I think it's safer to assume that basically no feed readers ever update their feed source information on HTTP redirections.
As it turns out, the list of user agents here comes with a caveat. See the sidebar.
(Since it's been more than a year, I have no way to tell how many feed readers did update their feed source information. Some of the people directly fetching the HTTPS feeds may have updated, but certainly at least some of them are new subscribers I've picked up over the past year.)
At one level, this failure to update the feed source is harmless; the HTTP to HTTPS redirection here can and will continue basically forever without any problems. At another level it worries me, both for Wandering Thoughts and for blogs in general, because very few things on the web are forever and anything that makes it harder to move blogs around is worth concern. Blogs do move, and very few are going to be able to have a trail of HTTP redirections that lives forever.
(Of course the really brave way to move a blog is to just start a new one and announce it on the old one. That way it takes active interest for people to keep reading you; you'll lose the ones who aren't actually reading more (but haven't removed you from their feed reader) and the ones who decide they're not interested enough.)
Sidebar: Some imprecision in these results
Without more work than I'm willing to put in, I can't tell when a HTTPS request from a given IP is made due to following a redirection from a HTTP request. All I can say is that an IP address that made one or more HTTP requests also made some HTTPS requests. I did some spot checks (correlating the times of some requests from specific IPs) and they did look like HTTP redirections being followed, but this is far from complete.
The most likely place where I'd be missing a feed reader that doesn't follow redirections is shared feed reader services (ie, replacements for Google Reader). There it would be easy for one person to have imported the HTTP version of my feed and another person to have added the HTTPS version later, quite likely causing the same IP fetching both HTTPS and HTTP versions of my feed and leading me to conclude that it did follow the redirection.
I have some evidence that there is some amount of this sort of feed duplication, because as far as I can tell I see more HTTPS requests from these IPs than I do HTTP ones. Assuming my shell commands based analysis is correct, I see a number of cases where per-IP request counts are different, in both directions (more HTTPS than HTTP, more HTTP than HTTPS).
(This is where it would be really useful to be able to pull all of these Apache logs into a SQL database in some structured form so I could sophisticated ad-hoc queries, instead of trying to do it with hacky, awkward shell commands that aren't really built for this.)
ZFS on Linux has just fixed a long standing little annoyance
I've now been running ZFS on Linux for a while.
Over that time, one of the small little annoyances of the ZoL
experience has been that all ZFS commands required you to be root,
even if all you wanted to do was something innocuous like 'zpool
status' or 'zfs list'. This wasn't for any particularly good
reason and it's not how Solaris and Illumos behave; it was just
necessary because the ZoL kernel code itself had no permissions
restrictions on anything for complicated porting reasons. Anyone
who could talk to /dev/zfs could do any ZFS operation, including
dangerous and destructive ones, so it had to be restricted to root.
Like many people running ZoL, I dealt with this in a straightforward
way. To wit, I set up a /etc/sudoers.d/02-zfs file that allowed
no-password access to a great big list of ZFS commands that are
unprivileged on Solaris, and then I got used to typing things like
'sudo zpool status'. But this was never a really great experience
and it's always been a niggling annoyance.
I'm happy to report that as of a week or so ago, the latest development
version of ZoL now has fixed this issue. Normal non-root users can
now run all of the ZFS commands that are unprivileged on Solaris.
As part of this, ZoL now supports normal ZFS 'zfs allow' and 'zfs
unallow' for most operations, so you can (if desired) allow yourself
or other normal users to do things like create snapshots.
(Interestingly, poking around at this caused me to re-discover that
'zpool history' is a privileged operation even on Solaris. I
guess some bits of my sudoers file are going to stay.)
Things like this are part of why I've been pretty happy to run the development version of ZoL. Even the development version has been pretty stable, and it means that I've gotten a fair number of interesting and nice features well before they made it into one of the infrequent ZoL releases. I don't know how many people run the development version, but my impression is that it's not uncommon.
(I can't blame the ZoL people for the infrequent releases, because they want releases to be high quality. Making high quality releases is a bunch of work and takes careful testing. Plus sometimes the development tree has known outstanding issues that people want to fix before a release. (I won't point you at the ZoL Github issues to see this, because there's a fair amount of noise in them.))