Wandering Thoughts archives


The shifting goals of our custom NFS mount authorization system

We've been doing custom authorization for NFS mounts in our overall environment for a very long time. Our most 'recent' system for this is the NSS-based custom NFS mount authorization scheme that we introduced on our original Solaris fileservers and now run on our OmniOS-based fileservers; this system has now been running for on the order of a decade. In one sense how this system operates has remained the same over that time (in that it still uses the same basic mechanics); in another sense, things have changed significantly because our goals and priorities for NFS mount authorization have changed in a decade.

In our system, NFS mount permission is based on what netgroup a machine is in but we then authenticate that the machine hasn't been replaced with an impostor before we allow the NFS mount. We have two sorts of machines that do NFS mounts from our fileservers; our own (Linux) servers, which are only on a couple of our networks, and then a number of additional machines run by other people on various of our sandbox networks. Our custom authorization systems have historically verified the identity of all NFS clients, both our machines and other people's machines, and the initial decade-ago version of the current system was no different.

However, over time we ran into issues with verifying our own servers. There were a whole collection of failure modes where some or many of our servers could get verification failures, and then the entire world exploded because NFS mounts are absolutely critical to having a working machine. At one point we made a quick pragmatic decision to temporarily disable the host verification for our own servers, and then as time went on we became more and more convinced that this wasn't just an expedient hack, it was the correct approach. These days our servers live on a machine room network where no outside machines are allowed, so if you can swap your own impostor machine in you have physical access to our machine room and we have major problems.

(Well, there are other options, but they're all about equally bad for us.)

As a result of this, we've now explicitly shifted to viewing our custom NFS mount authorization system as being just for verifying not-us machines (or more exactly, machines on networks we don't trust). This matters because those machines shouldn't be as crucially dependent on our NFS filesystems as our own servers are, and so we can afford to design a system that works somewhat differently, for example by requiring some active step by the NFS client to get a machine authenticated.

(We have a central administrative filesystem that's so crucial to our machines that most of them won't finish booting until they can mount it. No non-us machine should be so dependent on our NFS infrastructure (hopefully we aren't going to find out someday that one of them is anyway).)

Especially with security-related systems, it's probably a good idea to sit down periodically and re-validate all of your assumptions about how they need to work. It's very easy for your threat model to shift (as ours did), as well as your goals and needs. There's also the question of how much security the system has to provide, and at what cost (in potential misfires, complexity, and so on). You may find that the passage of time has changed your views on this for various reasons.

NFSMountAuthShiftingGoals written at 01:11:45; Add Comment


An implementation difference in NSS netgroups between Linux and Solaris

NSS is the Name Service Switch, or as we normally know it, /etc/nsswitch.conf. The purpose of NSS is to provide a flexible way for sysadmins to control how various things are looked up, instead of hard-coding it. For flexibility and simplicity, the traditional libc approach is to use loadable shared objects to implement the various lookup methods that nsswitch.conf supports. The core C library itself has no particular knowledge of the files or dns nsswitch.conf lookup type; instead that's implemented in a shared library such as libnss_files.

(This is a traditional source of inconvenience when building software, because it usually makes it impossible to create a truly static binary that uses NSS-based functions. Those functions intrinsically want to parse nsswitch.conf and then load appropriate shared objects at runtime. Unfortunately this covers a number of important functions, such as looking up the IP addresses for hostnames.)

The general idea of NSS and the broad syntax of nsswitch.conf is portable between any number of Unixes, fundamentally because it's a good idea. The shared object implementation technique is reasonably common; it's used in at least Solaris and Linux, although I'm not sure about elsewhere. However, the actual API between the C library and the NSS lookups is not necessarily the same, not just in things like the names of functions and the parameters they get passed, but even in how operations are structured. As it happens we've seen an interesting example of this divergence in a fundamental way.

Because it comes from Sun, one of the traditional things that NSS supports looking up is netgroup membership, via getnetgrent() and friends. In the Solaris implementation of NSS's API for NSS lookup types, all of these netgroup calls are basically passed directly through to your library. When a program calls innetgr(), there is a whole chain of NSS API things that will wind up calling your specific handler function for this if you've set one. This handler function can do unusual things if you want, which we use for our custom NFS mount authorization.

We've looked at creating a similar NSS netgroup module for Linux (more than once), but in the end we determined it's fundamentally impossible because Linux implements NSS netgroup lookups differently. Specifically, Linux NSS does not make a direct call to your NSS module to do an innetgr() lookup. On Linux, NSS netgroup modules only implement the functions used for getting the entire membership of a netgroup, and glibc implements innetgr() internally by looping through all the entries of a given netgroup and checking each one. This reduces the API that NSS netgroup modules have to implement but unfortunately makes our hack impossible, because it relies on knowing which specific host you're checking for netgroup membership.

At one level this is just an implementation choice (and a defensible one in both directions). At another level, this says something about how Solaris and Linux see netgroups and how they expect them to be used. Solaris's implementation permits efficient network-based innetgr() checks, where you only have to transmit the host and netgroup names to your <whatever> server and it may have pre-built indexes for these lookups. The Linux version requires you to implement a smaller API, but it relies on getting a list of all hosts in a netgroup being a cheap operation. That's probably true today in most environments, but it wasn't in the world where netgroups were first created, which is why Solaris does things the way it does.

(Like NSS, netgroups come from Solaris. Well, they come from Sun; netgroups predate Solaris, as they're part of YP/NIS.)

NSSNetgroupsDifference written at 01:33:20; Add Comment


The increasingly surprising limits to the speed of our Amanda backups

When I started dealing with backups the slowest part of the process was generally writing things out to tape, which is why Amanda was much happier when you gave it a 'holding disk' that it could stage all of the backups to before it had to write them out to tape. Once you had that in place, the speed limit was generally some mix between the network bandwidth to the Amanda server and the speed of how fast the machines being backed up could grind through their filesystems to create the backups. When networks moved to 1G, you (and we) usually wound up being limited by the speed of reading through the filesystems to be backed up.

(If you were backing up a lot of separate machines, you might initially be limited by the Amanda server's 1G of incoming bandwidth, but once most machines started finishing their backups you usually wound up with one or two remaining machines that had larger, slower filesystems. This slow tail wound up determining your total backup times. This was certainly our pattern, especially because only our fileservers have much disk space to back up. The same has typically been true of backing up multiple filesystems in parallel from the same machine; sooner or later we wind up stuck with a few big, slow filesystems, usually ones we're doing full dumps of.)

Then we moved our Amanda servers to 10G-T networking and, from my perspective, things started to get weird. When you have 1G networking, it is generally slower than even a single holding disk; unless something's broken, modern HDs will generally do at least 100 Mbytes/sec of streaming writes, which is enough to keep up with a full speed 1G network. However this is only just over 1G data rates, which means that a single HD is vastly outpaced by a 10G network. As long as we had a number of machines backing up at once, the Amanda holding disk was suddenly the limiting factor. However, for a lot of the run time of backups we're only backing up our fileservers, because they're where all the data is, and for that we're currently still limited by how fast the fileservers can do disk IO.

(The fileservers only have 1G network connections for reasons. However, usually it's disk IO that's the limiting factor, likely because scanning through filesystems is seek-limited. Also, I'm ignoring a special case where compression performance is our limit.)

All of this is going to change in our next generation of fileservers, which will have both 10G-T networking and SSDs. Assuming that the software doesn't have its own IO rate limits (which is not always a safe assumption), both the aggregate SSDs and all the networking from the fileservers to Amanda will be capable of anywhere from several hundred Mbytes/sec up to as much 10G bandwidth as Linux can deliver. At this point the limit on how fast we can do backups will be down to the disk speeds on the Amanda backup servers themselves. These will probably be significantly slower than the rest of the system, since even striping two HDs together would only get us up to around 300 Mbytes/sec at most.

(It's not really feasible to use a SSD for the Amanda holding disk, because it would cost too much to get the capacities we need. We currently dump over a TB a day per Amanda server, and things can only be moved off the holding disk at the now-paltry HD speed of 100 to 150 Mbytes/sec.)

This whole shift feels more than a bit weird to me; it's upended my perception of what I expect to be slow and what I think of as 'sufficiently fast that I can ignore it'. The progress of hardware over time has made it so the one part that I thought of as fast (and that was designed to be fast) is now probably going to be the slowest.

(This sort of upset in my world view of performance happens every so often, for example with IO transfer times. Sometimes it even sticks. It sort of did this time, since I was thinking about this back in 2014. As it turned out, back then our new fileservers did not stick at 10G, so we got to sleep on this issue until now.)

AmandaWhereSpeedLimits written at 23:28:38; Add Comment


A learning experience about the performance of our IMAP server

Our IMAP server has never been entirely fast, and over the years it has slowly gotten slower and more loaded down. Why this was so seemed reasonably obvious to us; handling mail over IMAP required a fair amount of network bandwidth and a bunch of IO (often random IO) to our NFS fileservers, and there was only so much of that to go around. Things were getting slowly worse over time because more people were reading and storing more mail, while the hardware wasn't changing.

We have a long standing backwards compatibility with our IMAP server, where people's IMAP clients have full access to their $HOME and would periodically go searching through all of it. Recently this started causing us serious problems, like running out of inodes on the IMAP server, and it became clear that we needed to do something about it. After a number of false starts (eg), we wound up doing two important things over the past two months. First we blocked Dovecot from searching through a lot of directories, and then we started manually migrating users one by one to a setup where their IMAP sessions could only see their $HOME/IMAP instead of all of their $HOME. The two changes together significantly reduce the number of files and directories that Dovecot is scanning through (and sometimes opening to count messages).

Well, guess what. Starting immediately with our first change and increasing as we migrated more and more high-impact users, the load on our IMAP server has been dropping dramatically. This is most clearly visible in the load average itself, where it's now entirely typical for the daytime load average to be under one (a level that was previously only achieved in the dead of night). The performance of my test Thunderbird setup has clearly improved, too, rising almost up to the level that I get on a completely unloaded test IMAP server. The change has basically been night and day; it's the most dramatic performance shift I can remember us managing (larger than finding our iSCSI problem in 2012). While the IMAP server's performance is not perfect and it can still bog down at some times, it's become clear that all of the extra scanning that Dovecot was doing was behind a great deal of the performance problems we were experiencing and that getting rid of it has had a major impact.

Technically, we weren't actually wrong about the causes of our IMAP server being slow; it definitely was due to network bandwidth and IO load issues. It's just that a great deal of that IO was completely unproductive and entirely avoidable, and if we had really investigated the situation we might have been able to improve the IMAP server long ago.

(And I think it got worse over time partly because more and more people started using clients, such as the iOS client, that seem to routinely use expensive scanning operations.)

The short and pungent version of what we learned is that IMAP servers go much faster if you don't let them do stupid things, like scan all through people's home directories. The corollary to this is that we shouldn't just assume that our servers aren't doing stupid things.

(You could say that another lesson is that if you know that your servers are occasionally doing stupid things, as we did, perhaps you should try to measure the impact of those things. But that's starting to smell a lot like hindsight bias.)

IMAPPerformanceLesson written at 02:06:21; Add Comment


Some numbers for how well various compressors do with our /var/mail backup

Recently I discussed how gzip --best wasn't very fast when compressing our Amanda (tar) backup of /var/mail, and mentioned that we were trying out zstd for this. As it happens, as part of our research on this issue I ran one particular night's backup of our /var/mail through all of the various compressors to see how large they'd come out, and I think the numbers are usefully illustrative.

The initial uncompressed tar archive is roughly 538 GB and is probably almost completely ASCII text (since we use traditional mbox format inboxes and most email is encoded to 7-bit ASCII). The compression ratios are relative to the uncompressed file, while the times are relative to the fastest compression algorithm. Byte sizes were counted with 'wc -c', instead of writing the results to disk, and I can be confident that the compression programs were the speed limit on this system, not reading the initial tar archive off SSDs.

Compression ratio Time ratio
uncompressed 1.0 0.47
lz4 1.4 1.0
gzip --fast 1.77 11.9
gzip --best 1.87 17.5
zstd -1 1.92 1.7
zstd -3 1.99 2.4

(The 'uncompressed' time is for 'cat <file> | wc -c'.)

On this very real-world test for us, zstd is clearly a winner over gzip; it achieves better compression with far less time. gzip --fast takes about 32% less time than gzip --best at only a moderate cost in compression ratio, but it's not competitive with zstd in either time or compression. Zstd is not as fast as lz4 but it's fast enough, while providing clearly better compression.

We're currently using the default zstd compression level, which is 'zstd -3' (we're just invoking plain '/usr/bin/zstd'). These numbers suggest that we'd lose very little compression from switching to 'zstd -1' but get a significant speed increase. At the moment we're going to leave things as they are because our backups are now fast enough (backing up /var/mail is now not the limiting factor on their overall speed) and we do get something for that extra time. Also, it's simpler; because of how Amanda works, we'd need to add a script to switch to 'zstd -1'.

(Amanda requires you to specify a program as your compressor, not a program plus arguments, so if you want to invoke the real compressor with some non-default options you need a cover script.)

Since someone is going to ask, pigz -fast got a compression ratio of 1.78 and a time ratio of 1.27. This is extremely unrepresentative of what we could achieve in production on our Amanda backup servers, since my test machine is a 16-coreCPU Xeon Silver 4108. The parallelism speed increase for pigz is not perfect, since it was only about 9.4 times faster than gzip --fast (which is single-core).

(Since I wanted to see the absolute best case for pigz in terms of speed, I ran it on all cores CPUs. I'm not interested in doing more tests to establish how it scales when run with fewer cores CPUs, since we're not going to use it; zstd is better for our case.)

PS: I'm not giving absolute speeds because these speeds vary tremendously across our systems and also depend on what's being compressed, even with just ASCII text.

BackupCompressionNumbers written at 01:13:23; Add Comment


Today's learning experience is that gzip is not fast

For reasons beyond the scope of this entry, we have a quite large /var/mail and we take a full backup of it every night. In order to save space in our disk-based backup system, for years we've been having Amanda compress these backups on the Amanda server; since we're backing up ASCII text (even if it represents encoded and compressed binary things), they generally compress very well. We did this in the straightforward way; as part of our special Amanda dump type that forces only full backups for /var/mail, we said 'compress server best'. This worked okay for years, which enticed us into not looking at it too hard until we recently noticed that our backups of /var/mail were taking almost ten hours.

(They should not take ten hours. /var/mail is only about 540 GB and it's on SSDs.)

It turns out that Amanda's default compression uses gzip, and when you tell Amanda to use the best compression it uses 'gzip --best', aka 'gzip -9'. Now, I was vaguely aware that gzip is not the fastest compression method in the world (if only because ZFS uses lz4 compression by default and recommends you avoid gzip), but I also had the vague impression that it was reasonably decently okay as far as speed went (and I knew that bzip2 and xz were slower, although they compress better). Unfortunately my impression turns out to be very wrong. Gzip is a depressingly slow compression system, especially if you tell it to go wild and try to get the best compression it can. Specifically, on our current Amanda server hardware 'gzip --best' appears to manage a rate of only about 16 MBytes a second. As a result, our backups of /var/mail are almost entirely constrained by how slowly gzip runs.

(See lz4's handy benchmark chart for one source of speed numbers. Gzip is 'zlib deflate', and zlib at the 'compress at all costs' -9 level isn't even on the benchmark chart.)

The good news is that there are faster compression programs out there, and at least some of them are available pre-packaged for Ubuntu. We're currently trying out zstd as probably having a good balance between running fast enough for us and having a good compression ratio. Compressing with lz4 would be significantly faster, but it also appears that it would get noticeably less compression.

It's worth noting that not even lz4 can keep up with full 10G Ethernet speeds (on most machines). If you have a disk system that can run fast enough (which is not difficult with modern SSDs) and you want to saturate your 10G network during backups, you can't do compression in-stream; you're going to have to capture the backup stream to disk and then compress it later.

PS: There's also parallel gzip, but that has various limitations in practice; you might have multiple backup streams to compress, and you might need that CPU for other things too.

GzipNotFast written at 02:14:06; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.