The challenges of diagnosing slow backups
This is not the techblog entry I thought I was going to write. That entry would have been a quietly triumphal one about finding and squelching an interesting and obscure issue that was causing our backups to be unnaturally slow. Unfortunately, while the particular issue is squelched our backups of at least our new fileservers are still (probably) unnaturally slow. Certainly they seem to be slower than we want. So this entry is about the challenge of trying to figure out why your backups are slow (and what you can do about it).
The first problem is that unless the cause is quite simple, you're going to wind up needing to make detailed observations of your systems while the backups are running. In fact you'll probably have to do this repeatedly. By itself this is not a problem as such. What makes it into one is that most backups are run out of hours, often well out of hours. If you need to watch the backups and the backups start happening at 11pm, you're going to be staying up. This has various knock-on consequences, including that human beings are generally not at their sharpest at 11pm.
(General purpose low level metrics collection can give you some immediate data but there are plenty of interesting backup slowdowns that cannot be caught with them, including our recent one. And backups run during the day (whether test runs or real ones) are generally going to behave somewhat differently from nighttime backups, at least if your overall load and activity are higher in the day.)
Beyond that issue, a running backup system is generally a quite complex beast with many moving parts. There are probably multiple individual backups in progress in multiple hosts, data streaming back to backup servers, and any number of processes in communication with each other about all of this. As we've seen, the actual effects of a problem in one piece can manifest far away from that piece. In addition pieces may be interfering with each other; for example, perhaps running enough backups at once on a single machine causes them to contend for a resource (even an inobvious one, since it's pretty easy to spot saturated disks, networks, CPU, et al).
Complex systems create complex failure modes, which means that there are a lot of potential inobvious things that might be going on. That's a lot of things to winnow through for potential issues that pass the smell test, don't contradict any evidence of system behavior that you already have, and ideally that can be tested in some simple way.
(And the really pernicious problems don't have obvious causes, because if they did they would be easy to at least identify.)
What writing this tells me is that this is not unique to backup systems and that I should probably sit down to diagram out the overall backup system and its resources, then apply Brendan Gregg's USE Method to all of the pieces involved in backups in a systematic way. That would at least give me a good collection of data that I could use to rule things out.
(It's nice to form hypotheses and then test them and if you get lucky you can speed everything up nicely. But there are endless possible hypotheses and thus endless possible tests, so at some point you need to do the effort to create mass winnowings.)
Caches should be safe by default
I've been looking at disk read caching systems recently. Setting aside my other issues, I've noticed something about basically all of them that makes me twitch as a sysadmin. I will phrase it this way:
Caches should be safe by default.
By 'safe' I mean that if your cache device dies, you should not lose data or lose your filesystem. Everything should be able to continue on, possibly after some manual adjustment. The broad purpose of most caches is to speed up reads; write accelerators are a different thing and should be labeled as such. When your new cache system is doing this for you, it should not be putting you at a greater risk of data loss because of some choice it's quietly made; if writes touch the cache at all, it should default to write-through mode. To do otherwise is just as dangerous as those databases that achieve great speed through quietly not really committing their data to disk.
There is a corollary that follows from this:
Caches should clearly document when and how they aren't safe.
After I've read a cache's documentation, I should not be in either doubt or ignorance about what will or can happen if the cache device dies on me. If I am, the documentation has failed. Especially it had better document the default configuration (or the default examples or both), because the default configuration is what a lot of people will wind up using. As a corollary to the corollary, the cache documentation should probably explain what I get for giving up safety. Faster than normal writes? It's just required by the cache's architecture? Avoiding a write slowdown that the caching layer would otherwise introduce? I'd like to know.
(If documenting this stuff makes the authors of the cache embarrassed, perhaps they should fix things.)
As a side note, I think it's fine to offer a cache that's also a write accelerator. But I think that this should be clearly documented, the risks clearly spelled out, and it should not be the default configuration. Certainly it should not be the silent default.
A consequence of NFS locking and unlocking not necessarily being fast
A while back I wrote Cross-system NFS locking and unlocking is not necessarily fast, about one drawback of using NFS locks to communicate between processes on different machines. This drawback is that it may take some time for process B on machine 2 to find out that process A on machine 1 has unlocked the shared coordination file. It turns out that this goes somewhat further than I realized at the time. Back then I looked at cross-system lock activity, but it turns out that you can see long NFS lock release delays even when the two processes are on the same machine.
If you have process A and process B on the same machine, both contending for access to the same file via file locking, you can easily see significant delays between active process A releasing the lock and waiting process B being notified that it now has the lock. I don't know enough about the NLM protocol to know if the client or server kernels can do anything to make the process go faster, but there are some client/server combinations where this delay does happen.
(If the client's kernel is responsible for periodically retrying pending locking operations until they succeed, it certainly could be smart enough to notice that another process on the machine just released a lock on the file and so now might be a good time for a another retry.)
This lock acquisition delay can have a pernicious spiraling effect on an overall system. Suppose, not entirely hypothetically, that what a bunch of processes on the same machine are locking is a shared log file. Normally a process spends very little time doing logging and most of their time doing other work. When they go to lock the log file to write a message, there's no contention, they get the lock, they basically immediately release the lock, and everyone goes on fine. But then you hit a lock collision, where processes A and B both want to write. A wins, writes its log message, and unlocks immediately. But the NFS unlock delay means that process B is then going to sit there for ten, twenty, or thirty seconds before it can do its quick write and release the lock in turn. Suppose during this time another process, C, also shows up to write a log message. Now C may be waiting too, and it too will have a big delay to acquire the lock (if locks are 'fair', eg FIFO, then it will have to wait both for B to get the lock and for the unlock delay after B is done). Pretty soon you have more and more processes piling up waiting to write to the log and things grind to a much slower pace.
I don't think that there's a really good solution to this for NFS, especially since an increasing number of Unixes are making all file locks be NFS aware (as we've found out the hard way before). It's hard to blame the Unixes that are doing this, and anyways the morally correct solution would be to make NLM unlocks wake up waiting people faster.
PS: this doesn't happen all of the time. Things are apparently somewhat variable based on the exact server and client versions involved and perhaps timing issues and so on. NFS makes life fun.
Sidebar: why the NFS server is involved here
Unless the client kernel wants to quietly transfer ownership of the lock being unlocked to another process instead of actually releasing it, the NFS server is the final authority on who has a NFS lock and it must be consulted about things. For all that any particular client machine knows, there is another process on another machine that very occasionally wakes up, grabs a lock on the log file, and does stuff.
Quietly transferring lock ownership is sleazy because it bypasses any other process trying to get the lock on another machine. One machine with a rotating set of processes could unfairly monopolize the lock if it did that.
Bind mounts with systemd and non-
Under normal circumstances the way you deal with Linux bind mounts on a
systemd based system is the same as always: you
put them in
/etc/fstab and systemd makes everything work just
like normal. If you can deal with your bind mounts this way, I
recommend that you do it and keep your life simple. But sometimes
life is not simple.
Suppose, not entirely hypothetically, that
you are dealing with base filesystems that aren't represented in
/etc/fstab for one reason or another; instead they appear through
other mechanisms. For example, perhaps they appear when you import a
ZFS pool. You want to use these filesystems as the source of bind
The first thing that doesn't work is leaving your bind mounts in
/etc/fstab. There is no way to tell systemd to not create them
until something else happens (eg your
unit finishes or their source directory appears), so this is basically
never going to do the right thing. If you get bind mounts at all
they are almost certainly not going to be bound to what you want.
At this point you might be tempted to think 'oh, systemd makes
/etc/fstab mounts into magic <name>.mount systemd units, I can
just put files in
/etc/systemd/system to add some extra dependencies
to those magic units'. Sadly this doesn't work; the moment you have
a real <name>.mount unit file it entirely replaces the information
/etc/fstab and systemd will tell you that your <name>.mount
file is invalid because it doesn't specify what to mount.
In short, you need real .mount units for your bind mounts. You also need to force the ordering, and
here again we run into something that would be nice but doesn't
work. If you run '
systemctl list-units -t mount', you will see
that there are units for all of your additional non-
It's tempting to make your bind mount unit depend on an appropriate
mount unit for its source filesystem, eg if you have a bind mount
/archive/something you'd have it depend on
Unfortunately this doesn't work reliably because systemd doesn't
actually know about these synthetic mount units before the mount
appears. Instead you can only depend on whatever
actually does the mounting, such as
(In an extreme situation you could create a service unit that just
used a script to wait for the mounts to come up. With a
service unit, systemd won't consider the service successful until
the script exits.)
The maximally paranoid set of dependencies and guards is something like this:
[Unit] After=zfs-mount.service Requires=zfs-mount.service RequiresMountsFor=/var ConditionPathIsDirectory=/local/var/local
(This is for a bind mount from
We can't use a
/local/var, because as far
as systemd is concerned it's on the root filesystem and so the
dependency would be satisfied almost immediately. I don't think the
Condition will cause systemd to wait for
appear, just stop the bind mount from trying to be done if ZFS
mounts happened but they didn't managed to mount a
for some reason (eg a broken or missing ZFS pool).
/var is actually on the root filesystem, the
RequiresMountsFor is likely gilding the lily; I don't think there's
any situation where this unit can even be considered before the
root filesystem is mounted. But if it's a separate filesystem you
definitely want this and so it's probably a good habit in general.)
I haven't tested using
local-var.mount in just the
here but I'd expect it to fail for the same reason that it definitely
doesn't work reliably in an
After. This is kind of a pity, but
there you go and the Condition is probably good enough.
(If you don't want to make a bunch of
.mount files, one for each
mount, you could make a single
.service unit that has all of the
necessary dependencies and runs appropriate commands to do the bind
mounting (either directly or by running a script). If you do this,
don't forget to have
ExecStop stuff to also do the unmounts.)
Sidebar: the likely non-masochistic way to do this for ZFS on Linux
If I was less stubborn, I would have set all of my ZFS filesystems
to have '
mountpoint=legacy' and then explicitly mentioned and
mounted them in
/etc/fstab. Assuming that it worked (ie that
systemd didn't try to do the mounts before the ZFS pool came up),
this would have let me keep the bind mounts in
fstab too and
avoided this whole mess.
How you create a systemd .mount file for bind mounts
One of the types of units that systemd supports is mount units (see
'man systemd.mount'). Normally you set up all your mounts with
/etc/fstab entries and you don't have to think about them, but
under some specialized circumstances you can wind up needing to
.mount service files for some mounts.
How to specify most filesystems is pretty straightforward, but it's
not quite clear how you specify Linux bind mounts.
Since I was just wrestling repeatedly with this
today, here is what you need to put in a systemd
.mount file to
get a bind mount:
[Mount] What=/some/old/dir Where=/the/new/dir Type=none Options=bind
This corresponds to the mount command '
mount --bind /some/old/dir
/the/new/dir' and an
/etc/fstab line of '
none bind'. Note that the type of the mount is
as you might expect. This works because current versions of
will accept arguments of '
-t none -o bind' as meaning 'do a bind
(I don't know if you can usefully add extra options to the
setting or if you'd need an actual script if you need to, eg, make
a bind mountpoint read-only. If you can do it in
can probably do it here.)
A fully functioning
.mount unit will generally have other stuff
as well. What I've wound up using on Fedora 20 (mostly copied from
[Unit] DefaultDependencies=no Conflicts=umount.target Before=local-fs.target umount.target [Mount] [[ .... whatever you need ...]] [Install] WantedBy=local-fs.target
Add additional dependencies, documentation, and so on as you need
or want them. For what it's worth, I've also had bind mount units
work without the three
[Unit] bits I have here.
Note that this assumes a 'local' filesystem, not a network one. If
you're dealing with a network filesystem or something depending on
one, you'll need to change bits of the targets (systemd documentation
Copying GPT partition tables from disk to disk
There's a number of situations where you want to replicate partition
tables from one disk to another disk; for example, if you are setting
up mirroring or (more likely) replacing a dead disk in a mirrored
setup with a new one. If you're using old fashioned MBR partitioning,
the best tool for this is
sfdisk and it's done as follows:
sfdisk -d /dev/OLD | sfdisk /dev/NEW
Under some situations you may need '
If you're using new, modern GPT partitioning,
the equivalent of
sgdisk. However it gets used somewhat
differently and you need two operations:
sgdisk -R=/dev/NEW /dev/OLD sgdisk -G /dev/NEW
For obvious reasons you really, really don't want to accidentally flip
the arguments. You need
sgdisk -G to update the new disk's partitions
to have different GUIDs from the original disk, because GUIDs should be
globally unique even if the partitioning is the same.
The easiest way to see if your disks are using GPT or MBR partitioning
is probably to run '
fdisk -l /dev/DISK' and look at what the
Disklabel type' says. If it claims GPT partitioning, you can
then run '
sgdisk -p /dev/DISK' to see if
sgdisk likes the full
GPT setup or if it reports problems. Alternately you can use '
-l /dev/DISK' and pay careful attention to the '
scan' results, but this option is actually kind of dangerous; under
gdisk will stop to prompt you about what to do
about 'corrupted' GPTs.
sgdisk lacks any fully supported way of dumping and
saving a relatively generic dump of partition information; '
-b' explicitly creates something which the documentation says
should not be restored on anything except the original disk. This
is a hassle if you want to create a generic GPT based partitioning
setup which you will exactly replicate on a whole fleet of disks (not that we use GPT partitioning
on our new iSCSI backends, partly for this reason).
(I suspect that in practice you can use '
sgdisk -b' dumps for
this even if it's not officially supported, but enhh. Don't forget
to run '
sgdisk -G' on everything afterwards.)
(This is the kind of entry that I write so I have this information in a place where I can easily find it again.)
The problem with self-contained 'application bundle' style packaging
In a comment on my entry on FreeBSD vs Linux for me, Matt Campell asked (quoting me in an earlier comment):
I also fundamentally disagree with an approach of distributing applications as giant self contained bundles.
Why? Mac OS X, iOS, and Android all use self-contained app bundles, and so do the smarter third-party developers on Windows. It's a proven approach for packaged applications.
To answer this I need to add an important bit of context that may not have been clear in my initial comment and certainly isn't here in this extract: I was talking about PC-BSD in specific and in general the idea that the OS provider would distribute their packages this way.
Let's start with a question. Suppose that you start with a competently done .deb or RPM of Firefox and then convert it into one of these 'application bundles' instead. What's the difference between the contents of the two packagings of Firefox? Clearly it is that some of Firefox's dependencies are going to be included in the application bundle, not just Firefox itself. So what dependencies are included, or to put it another way, how far down the library stack do you go? GTK and FreeType? SQLite? The C++ ABI support libraries? The core C library?
The first problem with including some or all of these dependencies is that they are shared ones; plenty of other packages use them too. If you include separate copies in every package that uses them, you're going to have a lot of duplicate copies floating around your system (both on disk and in memory). I know disk and RAM are both theoretically cheap these days, but yes this still matters. In addition, packaging copies of things like GTK runs into problems with stuff that was designed to be shared, like themes.
(A sufficiently clever system can get around the duplication issue, but it has to be really clever behind the backs of these apparently self contained application bundles. Really clever systems are complex and often fragile.)
The bigger problem is that the capabilities enabled by bundling dependencies will in practice essentially never be used for packages supported by the OS vendor. Sure, in theory you could ship a different minor version of GTK or FreeType with Firefox than with Thunderbird, but in practice no sane release engineering team or security team will let things go out the door that way because if they do they're on the hook for supporting and patching both minor versions. In practice every OS-built application bundle will use exactly the same minor version of GTK, FreeType, the C++ ABI support libraries, SQLite, and so on. And if a dependency has to get patched because of one application, expect new revisions of all applications.
(In fact pretty much every source of variation in dependencies is a bad idea at the OS vendor level. Different compile options for different applications? Custom per-application patches? No, no, no, because all of them drive up the support load.)
So why is this approach so popular in Mac OS X, iOS, Windows, and so on? Because it's not being used by the OS vendor. Creators of individual applications have a completely different perspective, since they're only on the hook to support their own application. If all you support is Firefox, there is no extra cost to you if Thunderbird or Liferea is using a different GTK minor version because updating it is not your responsibility. In fact having your own version of GTK is an advantage because you can't have support costs imposed on you because someone else decided to update GTK.
What I want out of a Linux SSD disk cache layer
One of the suggestions in response to my SSD dilemma was a number of Linux kernel systems that are designed to add a caching layer on top of regular disks; the leading candidates here seem to be dm-cache and bcache. I looked at both of them and unfortunately I don't like either one because they don't work in the way I want.
Put simply, what I want is the ability to attach a SSD read accelerator to my filesystems or devices without changing how they are currently set up. What I had hoped for was some system where you told things 'start caching traffic from X, Y, and Z' and it would all transparently just happen; your cache would quietly attach itself to the rest of the system somehow and that would be that. Later you could say 'stop caching traffic from X', or 'stop entirely', and everything would go back to how it was before. Roughly speaking this is the traditional approach taken by local disks used to cache and accelerate NFS reads in a few systems that implemented that.
Unfortunately this isn't what dm-cache and bcache do. Both of them function as an additional, explicit layer in the Linux storage stack, and as explicit layers you don't mount, say, your filesystem from its real device, you mount it from the dm-cache or bcache version of it. Among other things, this makes moving between using a cached version and a non-cached version of your objects a somewhat hair raising exercise; for example, bcache explicitly needs to change an existing underlying filesystem. Want to totally back out from using bcache or dm-cache? You're probably going to have a headache.
(This is especially annoying because there are two cache options in Linux today and who knows which one will be better for me.)
Both dm-cache and bcache are probably okay for a large deployment where they are planned from the start. In a large deployment you will evaluate each in your scenario, determine which one you want and what sort of settings you want, and then install machines with the caching layer configured from the start. You expect to never remove your chosen caching layer; generally you'll have specifically configured your hardware fleet around the needs of the caching layer.
None of this describes the common scenario of 'I have an existing machine with a bunch of existing data, and I have enough money for a SSD. I'd like to speed up my stuff'. That is pretty much my scenario (at least to start with). I rather expect it's very much the scenario of any number of people with existing desktops.
(It's also effectively the scenario for new machines for people who do not buy their desktops in bulk. I'm not going to spec out and buy a machine configuration built around the assumption that some Linux caching layer will turn out to work great for me; among other things, it's too risky.)
PS: if I've misunderstood how dm-cache or bcache work, my apologies; I have only skimmed their documentation. Bcache at least has a kind of scary FAQ about using (or not using) it on existing filesystems.
Intel has screwed up their DC S3500 SSDs
I ranted about this on Twitter a few days ago when we discovered it the hard way but I want to write it down here and then cover why what Intel did is a terrible idea. The simple and short version is this:
Intel switched one of their 'datacenter' SSDs from reporting 512 byte 'physical sectors' to reporting 4096 byte physical sectors in a firmware update.
Specifically, we have the Intel DC S3500 80 GB model in firmware versions D2010355 (512b sectors) and D2010370 (4K sectors). Nothing in the part labeling changed other than the firmware version. Some investigation since our initial discovery has turned up that the 0370 firmware apparently supports both sector sizes and is theoretically switchable between them, and this apparently applies to both the SC3500 and SC3700 series SSDs.
This is a really terrible idea that should never have passed even a basic smell test in a product that is theoretically aimed at 'datacenter' server operations. There are applications where 512b drives and 4K drives are not compatible; for example, in some ZFS pools you can't replace a 512b SSD with a 4K SSD. Creating incompatible drives with the same detailed part number is something that irritates system administrators a great deal and of course it completely ruins the day of people who are trying to have and maintain a spares pool.
This Intel decision is especially asinine because the 'physical sector size' that these SSDs are reporting is essentially arbitrary (as we see here, it is apparently firmware-settable). The actual flash memory itself is clumped together in much larger units in ways that are not well represented by 'physical sector size', which is one reason that all SSDs report whatever number is convenient here.
There may well be good reason to make SSDs report as 4k sector drives instead of 512b drives; if nothing else it is a small bit closer to reality. But having started out with the DC S3500 series reporting 512b sectors Intel should have kept them that way in their out of box state (and made available a utility to switch them to 4k). If Intel felt it absolutely had to change that for some unfathomable reason, it should have at least changed the detailed part number when it updated the firmware; then people maintaining a spares stock would at least have some sign that something was up.
(Hopefully other SSD vendors are not going to get it into their heads to do something this irritating and stupid.)
In related news we now have a number of OmniOS fileservers which we literally have no direct spare system disks for, because their current system SSDs are the 0355 firmware ones.
(Yes, we are working on fixing that situation.)
Hardware can be weird, Intel 10G-T X540-AT2 edition
Every so often I get a pointed reminder that hardware can be very weird. As I mentioned on Twitter today, we've been having one of those incidents recently. The story starts with the hardware for our new fileservers and iSCSI backends, which is built around SuperMicro X9SRH-7TF motherboards. These have an onboard Intel X540-AT2 chipset that provides two 10G-T ports. The SuperMicro motherboard and BIOS lights up these ports no later than when you power the machine on and leave it sitting in the BIOS, and maybe earlier (I haven't tested).
On some but not all of our motherboards, the first 10G-T port lights up (in the BIOS) at 1G instead of 10G. When we first saw this on a board we thought we had a failed board and RMA'd it; the replacement board behaved the same way but when we booted an OS (I believe a Linux) the port came up at 10G and we assumed that all was well. Then we noticed that some but not all of our newly installed OmniOS fileservers had their first port (still) coming up at 1G. At first we thought we had cable issues, but the cables were good.
In the process of testing the situation out, we rebooted one OmniOS fileserver off a CentOS 7 live cd to see if Linux could somehow get 10G out of the hardware. Somewhat to my surprise it could (and a real full 10G at that). More surprising, the port stayed at 10G when we rebooted into OmniOS. It stayed at 10G in OmniOS over a power cycle and it even stayed at 10G after a full power off where we cut power to the entire case for several minutes. Further testing showed that it was sufficient merely to boot the CentOS 7 live cd on an affected server without ever configuring the interface (although it's possible that the live cd configures the interface up to try DHCP and then brings it down again).
There's a lot of weirdness here. It'd be one thing for the Linux driver to bring up 10G where the OmniOS one didn't; then it could be that the Linux driver was more comprehensive about setting up the chipset properly. For it to be so firmly persistent is another thing, though; it suggests that Linux is reprogramming something that stays programmed in nonvolatile storage. And then there's the matter of this happening only on some motherboards and only to one port out of two that are driven by the same chipset.
Ultimately, who knows. We're happy because we apparently have a full solution to the problem, one we've actually carried out on all of the machines now because we needed to get them into production.
(As far as we can easily tell, all of the motherboards and the motherboard BIOSes are the same. We haven't opened up the cases to check the screen printing for changes and aren't going to; these machines are already installed and in production.)