Wandering Thoughts archives

2014-08-31

We don't believe in DHCP for (our) servers

I understand that in some places it's popular for servers to get their IP addresses on boot through DHCP (presumably or usually static IP addresses). I understand the appeal of this for places with large fleets of servers that are frequently being deployed or redeployed; to put it one way I imagine that it involves touching machines less. However it is not something that we believe in or do in our core network, for at least two reasons.

First and lesser, it would add an extra step when setting up a machine or doing things like changing its network configuration (for example to move it to 10G-T). Not only would you have to rack it and connect it up but you'd also have to find out and note down the Ethernet address and then enter it into the DHCP server. Perhaps someday we will all have smartphones and all servers that we buy will come with machine readable QR codes of their MACs, but today neither is at all close to being true (and never mind the MACs of ports on expansion cards).

(By the way, pretty much every vendor who prints MACs on their cases uses way too small a font size.)

Second and greater, it would add an extra dependency to the server boot process. In fact it would add several extra dependencies; we'd be dependent not just on the DHCP server being up and having good data, but also on the network path between the booting server and the DHCP server (here I'm thinking of switches, cables, and so on). The DHCP server would become a central point of near total failure, and we don't like having those unless we're forced into it.

(Sure, we could have two or more DHCP servers. But then we'd have to make very sure their data stayed synchronized and we'd be devoting two servers to something that we don't see much or any advantage from. And two servers with synchronized data doesn't protect against screwups in the data itself. The DHCP server data is a real single point of failure where a single mistake has great potential for serious damage.)

A smaller side issue is that we label our physical servers with what host they are, so assigning IPs (and thus hostnames) through DHCP creates new and exciting ways for a machine's label to not match the actual reality. We also have their power feeds labeled in the network interfaces of our smart PDUs, which would create similar possibilities for exciting mismatches.

NoDHCPForServers written at 21:40:13; Add Comment

The downside of expanding your storage through bigger disks

As I mentioned recently, one of the simplest ways of expanding your storage space is simply to replace your current disks with bigger disks and then tell your RAID system, file system, or volume manager to grow into the new space. Assuming that you have some form of redundancy so you can do this on the fly, it's usually the simplest and easiest approach. But it has some potential downsides.

The simplest way to put the downsides is that this capacity expansion is generally blind and not so much inflexible as static. Your storage systems (and thus your groups) get new space in proportion to however much space (or how many disks) they're currently using, and that's it. Unless you already have shared storage, you can't reassign this extra new space from one entity to another because (for example) one group with a lot of space doesn't need more but another group with only a little space used now is expanding a lot.

This is of course perfectly fine if all of your different groups or filesystems or whatever are all going to use the extra space that you've given them, or if you only have one storage system anyways (so all space flowing to it is fine). But in other situations this rigidity in assigning new space may cause you heartburn and make you wish to reshape the storage to a lesser or greater amount.

(One assumption I'm making is that you're going to do basically uniform disk replacement and thus uniform expansion; you're not going to replace only some disks or use different sizes of replacement disks. I make that assumption because mixed disks are as much madness as any other mixed hardware situation.)

BiggerDiskExpansionIssue written at 00:35:08; Add Comment

2014-08-27

One reason why we have to do a major storage migration

We're currently in the process of migrating from our old fileservers to our new fileservers. We're doing this by manually copying data around instead of anything less user visible and sysadmin intensive. You might wonder why, since the old environment and the new architecture are the same (with ZFS, iSCSI backends, mirroring, and so on). One reason is that the ZFS 4K sector issue has forced our hands. But even if that wasn't there it turns out that we probably would have had to do a big user visible migration because of a collision of technology and politics (and ZFS limitations). The simple version of the organizational politics are that we can't just give people free space.

Most storage migrations will increase the size of disks involved, and ours was no exception; we're going from a mixture of 750 GB and 1 TB drives to 2 TB drives. If you just swap drives out for bigger ones, pretty much any decent RAID or filesystem can expand into the newly available space without any problems at all; ZFS is no exception here. If you went from 1 TB to 2 TB disks in a straightforward way, all of your ZFS pools would or could immediately double in size.

(Even if you can still get your original sized disks during an equipment turnover, you generally want to move up in size so that you can reduce the overhead per GB or TB. Plus, smaller drives may actually have worse cost per GB than larger ones.)

Except that we can't give space away for free, so if we did this we would have had to persuade people to buy all this extra space. There was very little evidence that we could do this, since people weren't exactly breaking down our doors to buy more space as it stood.

If you can't expand disks in place and give away (or sell) the space, what you want (and need) to do is to reshape your existing storage blobs. If you were using N disks (or mirror pairs) for a storage blob before the disk change, you want to move to using N/2 disks afterwards; you get the same space but on fewer disks. Unfortunately ZFS simply doesn't support this sort of reshaping of pool storage with existing pools; once a pool has so many disks (or mirrors), you're stuck with that many. This left us with having to do it by hand, in other words a manual migration from old pools with N disks to new pools with N/2 disks.

(If you've decided to go from smaller disks to larger disks you've already implicitly decided to sacrifice overall IOPs per TB in the name of storage efficiency.)

Sidebar: Some hacky workarounds and their downsides

The first workaround would be to double the raw underlying space but then somehow limit people's ability to use it (ZFS can do this via quotas, for example). The problem is that this will generally leave you with a bit less than half of your space allocated but unused on a long term basis unless people wake up and suddenly start buying space. It's also awkward if the wrong people wake up and want a bunch more space, since your initial big users (who may not need all that much more space) are holding down most of the new space.

The second workaround is to break your new disks up into multiple chunks, where one chunk is roughly the size of your old disks. This has various potential drawbacks but in our case it would simply have put too many individual chunks on one disk as our old 1 TB disks already had four (old) chunks, which was about as many ways as we wanted to split up one disk.

StorageGrowthPolitics written at 23:52:00; Add Comment

2014-08-21

How data flows around on the client during an Amanda backup

An Amanda backup session involves a lot of processes running on the Amanda client. If you're having a problem with slow backups it can be somewhere between useful and important to understand how everything is connected to everything else and where things go.

A disclaimer: my current information is probably incomplete and it only covers one case. Hopefully it will at least give people an idea of where to start looking and how data is likely to flow around their Amanda setup.

Let's start with an ASCII art process tree of a tar-based backup with indexing (taken from an OmniOS machine with Amanda 3.3.5):

amandad
   sendbackup (1)
      sendbackup (2)
         sh -c '....'
            tar -tf -
            sed -e s/^\.//
      tar --create --file - /some/file/system ....

(In real life the two sendbackup processes are not numbered in pstree or whatever; I'm doing it here to be clear which one I'm talking about.)

Call the 'sh -c' and everything under it the 'indexing processes' and the 'tar --create' the 'backup process'. The backup process is actually making the backup; the indexing processes are reading a copy of the backup stream in order to generate the file index that amrecover will use.

Working outwards from the backup process:

  • the backup process is reading from the disk and writing to its standard output. Its standard output goes to sendbackup (2).

  • sendbackup (2) reads its standard input, which is from the backup process, and basically acts like tee; it feeds one copy of the data to amandad and another into the indexing processes.

  • the indexing processes read standard input from sendbackup (2) and wind up writing their overall standard output to amandad (the tar and sed are in a shell pipeline).

  • sendbackup (1) sits there reading the standard error of all processes under it, which I believes it forwards to amandad if it sees any output. This process will normally be not doing anything.

  • amandad reads all of these incoming streams of data (the actual backup data, the index data, and the standard error output) and forwards them over the network to your Amanda server. If you use a system call tracer on this amandad, you'll also see it reading from and writing to a pipe pair. I believe this pipe pair is being used for signaling purposes, similar to waiting for both file activity and other notifications.

    (Specifically this is being done by GLib's g_wakeup_signal() and g_wakeup_acknowledge() functions, based on tracing library call activity. I believe they're called indirectly by other GLib functions that Amanda uses.)

Under normal circumstances, everything except sendbackup (1) will be churning away making various system calls, primarily to read and write from various file descriptors. The overall backup performance will be bounded by the slowest component in the whole pipeline of data shuffling (although the indexing processes shouldn't normally be any sort of bottleneck).

Depending on the exact Amanda auth setting you're using, streams from several simultaneous backups may flow from your amandad process to the server either as multiple TCP streams or as one multiplexed TCP stream. See amanda-auth(7) for the full details. In all cases I believe that all of this communication runs through a single amandad process, making it a potential bottleneck.

(At this point I don't know whether amandad is an actual bottleneck for us or something else weird is going on.)

(I knew some of the Amanda internal flow at the time that I wrote ReasoningBackwards, but I don't think I wrote it down anywhere apart from implicitly in that entry and it was in less detail than I do now.)

AmandaBackupDataFlows written at 00:42:45; Add Comment

2014-08-17

The challenges of diagnosing slow backups

This is not the techblog entry I thought I was going to write. That entry would have been a quietly triumphal one about finding and squelching an interesting and obscure issue that was causing our backups to be unnaturally slow. Unfortunately, while the particular issue is squelched our backups of at least our new fileservers are still (probably) unnaturally slow. Certainly they seem to be slower than we want. So this entry is about the challenge of trying to figure out why your backups are slow (and what you can do about it).

The first problem is that unless the cause is quite simple, you're going to wind up needing to make detailed observations of your systems while the backups are running. In fact you'll probably have to do this repeatedly. By itself this is not a problem as such. What makes it into one is that most backups are run out of hours, often well out of hours. If you need to watch the backups and the backups start happening at 11pm, you're going to be staying up. This has various knock-on consequences, including that human beings are generally not at their sharpest at 11pm.

(General purpose low level metrics collection can give you some immediate data but there are plenty of interesting backup slowdowns that cannot be caught with them, including our recent one. And backups run during the day (whether test runs or real ones) are generally going to behave somewhat differently from nighttime backups, at least if your overall load and activity are higher in the day.)

Beyond that issue, a running backup system is generally a quite complex beast with many moving parts. There are probably multiple individual backups in progress in multiple hosts, data streaming back to backup servers, and any number of processes in communication with each other about all of this. As we've seen, the actual effects of a problem in one piece can manifest far away from that piece. In addition pieces may be interfering with each other; for example, perhaps running enough backups at once on a single machine causes them to contend for a resource (even an inobvious one, since it's pretty easy to spot saturated disks, networks, CPU, et al).

Complex systems create complex failure modes, which means that there are a lot of potential inobvious things that might be going on. That's a lot of things to winnow through for potential issues that pass the smell test, don't contradict any evidence of system behavior that you already have, and ideally that can be tested in some simple way.

(And the really pernicious problems don't have obvious causes, because if they did they would be easy to at least identify.)

What writing this tells me is that this is not unique to backup systems and that I should probably sit down to diagram out the overall backup system and its resources, then apply Brendan Gregg's USE Method to all of the pieces involved in backups in a systematic way. That would at least give me a good collection of data that I could use to rule things out.

(It's nice to form hypotheses and then test them and if you get lucky you can speed everything up nicely. But there are endless possible hypotheses and thus endless possible tests, so at some point you need to do the effort to create mass winnowings.)

SlowBackupsChallenge written at 00:49:00; Add Comment

2014-08-15

Caches should be safe by default

I've been looking at disk read caching systems recently. Setting aside my other issues, I've noticed something about basically all of them that makes me twitch as a sysadmin. I will phrase it this way:

Caches should be safe by default.

By 'safe' I mean that if your cache device dies, you should not lose data or lose your filesystem. Everything should be able to continue on, possibly after some manual adjustment. The broad purpose of most caches is to speed up reads; write accelerators are a different thing and should be labeled as such. When your new cache system is doing this for you, it should not be putting you at a greater risk of data loss because of some choice it's quietly made; if writes touch the cache at all, it should default to write-through mode. To do otherwise is just as dangerous as those databases that achieve great speed through quietly not really committing their data to disk.

There is a corollary that follows from this:

Caches should clearly document when and how they aren't safe.

After I've read a cache's documentation, I should not be in either doubt or ignorance about what will or can happen if the cache device dies on me. If I am, the documentation has failed. Especially it had better document the default configuration (or the default examples or both), because the default configuration is what a lot of people will wind up using. As a corollary to the corollary, the cache documentation should probably explain what I get for giving up safety. Faster than normal writes? It's just required by the cache's architecture? Avoiding a write slowdown that the caching layer would otherwise introduce? I'd like to know.

(If documenting this stuff makes the authors of the cache embarrassed, perhaps they should fix things.)

As a side note, I think it's fine to offer a cache that's also a write accelerator. But I think that this should be clearly documented, the risks clearly spelled out, and it should not be the default configuration. Certainly it should not be the silent default.

CachesShouldBeSafe written at 22:40:48; Add Comment

2014-08-10

The problem with self-contained 'application bundle' style packaging

In a comment on my entry on FreeBSD vs Linux for me, Matt Campell asked (quoting me in an earlier comment):

I also fundamentally disagree with an approach of distributing applications as giant self contained bundles.

Why? Mac OS X, iOS, and Android all use self-contained app bundles, and so do the smarter third-party developers on Windows. It's a proven approach for packaged applications.

To answer this I need to add an important bit of context that may not have been clear in my initial comment and certainly isn't here in this extract: I was talking about PC-BSD in specific and in general the idea that the OS provider would distribute their packages this way.

Let's start with a question. Suppose that you start with a competently done .deb or RPM of Firefox and then convert it into one of these 'application bundles' instead. What's the difference between the contents of the two packagings of Firefox? Clearly it is that some of Firefox's dependencies are going to be included in the application bundle, not just Firefox itself. So what dependencies are included, or to put it another way, how far down the library stack do you go? GTK and FreeType? SQLite? The C++ ABI support libraries? The core C library?

The first problem with including some or all of these dependencies is that they are shared ones; plenty of other packages use them too. If you include separate copies in every package that uses them, you're going to have a lot of duplicate copies floating around your system (both on disk and in memory). I know disk and RAM are both theoretically cheap these days, but yes this still matters. In addition, packaging copies of things like GTK runs into problems with stuff that was designed to be shared, like themes.

(A sufficiently clever system can get around the duplication issue, but it has to be really clever behind the backs of these apparently self contained application bundles. Really clever systems are complex and often fragile.)

The bigger problem is that the capabilities enabled by bundling dependencies will in practice essentially never be used for packages supported by the OS vendor. Sure, in theory you could ship a different minor version of GTK or FreeType with Firefox than with Thunderbird, but in practice no sane release engineering team or security team will let things go out the door that way because if they do they're on the hook for supporting and patching both minor versions. In practice every OS-built application bundle will use exactly the same minor version of GTK, FreeType, the C++ ABI support libraries, SQLite, and so on. And if a dependency has to get patched because of one application, expect new revisions of all applications.

(In fact pretty much every source of variation in dependencies is a bad idea at the OS vendor level. Different compile options for different applications? Custom per-application patches? No, no, no, because all of them drive up the support load.)

So why is this approach so popular in Mac OS X, iOS, Windows, and so on? Because it's not being used by the OS vendor. Creators of individual applications have a completely different perspective, since they're only on the hook to support their own application. If all you support is Firefox, there is no extra cost to you if Thunderbird or Liferea is using a different GTK minor version because updating it is not your responsibility. In fact having your own version of GTK is an advantage because you can't have support costs imposed on you because someone else decided to update GTK.

ApplicationBundleProblems written at 23:32:03; Add Comment

2014-08-05

Another piece of my environment: running commands on multiple machines

It's my belief that every sysadmin who has a middling number of machines (more than one or two and less than a large fleet) sooner or later winds up with a set of tools to let them run commands on each of those machines or on subsets of those machines. I am no exception, and over the course of my career I have adopted, used, and built several iterations of this.

(People with large fleets of machines generally never run commands on them by hand but have some sort of automated fleet management system based around Puppet, Chef, CFEngine, Salt, Ansible, or the like. Sometimes these systems come with the ability to do this for you.)

My current iteration is based around simple shell scripts. The starting point is a shell script called machines; it prints out the names of machines that fall into various categories. The categorization of machines is entirely hand maintained, which has both problems and advantages. As a result the whole thing looks like:

mach() {
  for i in "$@" do
    case "$i" in
    apps) echo apps0 apps1 apps2 apps3 testapps;;
    ....
    ubuntu) echo `mach apps comps ...` ...;;
    ....
    *) echo $i;;
    esac
  done
}

mach "$@" | tr '\012' ' '; echo

(I put all of the work in a shell function so that I could call it recursively, for classes that are defined partly in terms of other classes. The 'ubuntu' class here is an example of that.)

So far we have few enough machines and few enough categories of machines that I'm interested in that this approach has not become unwieldy.

(There is also a script called mminus for doing subtractive set operations, so I can express 'all X machines except Y machines' or the like. This comes in handy periodically.)

The main script for actually doing things is called oneach, which does what you might think: given a list of machines it runs a command line on each of them via ssh. You can ask it to run the command in a pseudo-tty and without any special output handling, but normally it runs the command just with 'ssh machine command' and it prefixes all output with the name of the machine; you can see an example of that in this awk-based reformatting problem (an oneach run produced the input for my problem). Because I like neat formatting, oneach has an option to align the starting column of all output and I usually use that option (via a cover script called onea, because I'm lazy). The oneach script doesn't try to do anything fancy with concurrent execution or the like, it just does one ssh after the other.

Finally, I've found it useful to have another script that I call replicate. Replicate uses rsync to copy one or more files to destination machines or machine classes (it can also use scp for some obscure cases). replicate is handy for little things like pushing changes to dotfiles or scripts out to all of the machines where I have copies of them.

As a side note, machines has become a part of my dmenu environment. I use its list of machines as one of the the inputs to dmenu's autocompletion (both for normal logins and for special '@<machine>' logins as root), which makes it really quick and convenient to log into most of our machines (this large list of machines is part of the things I hide from dmenu's initial display of completion in a little UI tweak that turned out to be quite important for me).

Note that I don't necessarily suggest that you adopt my approach for running commands on your machines, which is one reason I'm not currently planning to put these scripts up in public. There are a lot of ways to solve this particular problem, many of them better and more scalable than what I have. I just think that you should get something better than manual for loops (which is what I was doing before I gave in and wrote machines, oneach, and so on).

ToolsOneach written at 00:00:04; Add Comment

2014-08-04

Why our new SAN environment is separate from our old SAN environment

Our move from our old ZFS fileservers to our new ZFS fileservers is a total move; we're replacing both the fileservers and the backends and migrating all of the data to newly created ZFS pools (primarily because of the ZFS 4K sector disk issue). However, there are at least two ways to do this overall move.

The first way is to build the new environment out so that it is completely separated from the old environment and both run independently. The new fileservers get new virtual fileserver names and migrating a filesystem from the old to the new setup changes its NFS fileserver (which is slightly user visible) and requires various administrative changes, eg to our backup system. This is the straightforward and obvious approach.

ZFS gives us the option of a second, perhaps more clever way. First, connect up the new fileservers and backends to the existing iSCSI fabric. Then we can have the new fileservers take over from existing fileservers, importing the existing ZFS pools from the existing iSCSI backends and using the same virtual fileserver names; from the outside it would be as if each fileserver had a (slow) reboot. Once the old pools were up and running, we could create new pools on the new storage and copy filesystems across (unfortunately the NFS clients would probably have to unmount and remount the filesystem during the migration).

(We would need to finess some issues, but it's doable.)

I'll end any suspense here: we're taking the first approach. The new fileserver environment isn't connected in any way to the old one (in specific, the iSCSI networks are completely disconnected from each other). This isn't a decision that we directly considered, but I think that was because we all thought it was so obvious a choice that we didn't even need to talk about it. A fully independent new environment might or might not be somewhat more work, but it is undeniably less risky. We don't have to worry that the new environment will perturb the functioning of the old one or vice versa, because they can't.

(I admit that part of this is technical because it's not clear how we'd interconnect the 10G and 1G iSCSI networks so that fileservers on the 10G network had enough bandwidth to talk to the 1G iSCSI backends at decent aggregate speeds (our existing 1G iSCSI switches don't have SFP+ ports). If there's not enough bandwidth for decent performance, we might as well not interconnect them and then we're in the 'fully independent' setup.)

SANWhyTwoSeparate written at 00:10:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.