2011-09-30
Some thoughts from a close view of catastrophe
The Computer Science department (who I work for) is spread out over three buildings, which means that we have switch racks in wiring closets or machine rooms in all three buildings. For historical reasons, our switch racks in the smallest building are in an Electrical Engineering machine room. EE mostly uses the machine room to house some research clusters and computing, but they also have their switches for the building in a rack by ours in the corner. The weekend before last, a series of events lead to this machine room's air conditioning failing and then a single ceiling water sprinkler activating. The sprinkler apparently ran for at least 45 minutes before it was shut off and for extra points, the power was live in the room while this was happening.
(Also in the room at the time was a just purchased, just barely turned on half rack of compute servers that belongs to one of our research groups.)
The EE department's building switches avoided the water entirely by about an inch. Most of our switches got wet and some of them died (we poured water out of a number of them, some of which seem to actually still work now that they've dried out). But the research machines were drenched (in fact often literally flooded), especially one densely packed full-height rack of compute servers that was basically in the worst possible location relative to the sprinkler and its flood of gunk and grunge laden water.
For the EE networking people this is a narrow escape from a bad situation (they have two relatively high end, relatively expensive routing switches in the room). For us it's unpleasant, but we have spares (well, had spares, many of them are now deployed).
For the researchers it's been catastrophic. It's now almost two weeks since the incident and their machines are still off, just sitting there. At least some of them may never be powered on again. A lot of their hard drives are probably dead, along with some unknown amount of other equipment like switches and KVMs. It's almost certain to be more weeks before there's any prospect of reassembling a running cluster. In a very real way they've lost the entire machine room for weeks.
As I've watched things unfold and periodically gone by the machine room to see that things are still powered off, I can't help but think uncomfortable thoughts. Our machine room is about a hundred yards away from this machine room. It has sprinklers, many of them at least as old as the one that activated. This could have happened to us.
We could be looking at all of our central machines being down for weeks; all of our fileservers, all of our backends, all of our login servers, all of the firewalls and routers, everything. We could be looking at trying to glue together some sort of vaguely functional environment on a crash basis using whatever spare hardware we can scrounge up or beg from people. We could be trying to prioritize what services come back and who gets their data restored versus who has to wait until we have enough disk space to hold it all.
I look at the EE machine room, and I can't help but thinking 'thank whatever that it's not us'.
2011-09-28
A multilevel view of DevOps (with more balance)
Phil Hollenback in a comment on my earlier entry:
My take is that DevOps Means Don't Be An A-Hole. That's kind of a compliment to what you are saying I think - devops is about improving communications.
I don't think it's this simple. Instead I think that the broad label of 'DevOps' is a reaction to three different sets of organizational problems, which I am going to list in decreasing order of severity.
First is the blame problem, where developers get blamed for not delivering features and operations gets blamed for services not being available. If you have the blame problem it trumps everything else, because the interests of developers and operations are fundamentally opposed to each other. All of the communication in the world cannot fix that.
Next is problems of ignorance, where development and operations don't understand each other's environments, problems, and constraints. It's stereotypical for sysadmins (me included) to see this as primarily a development problem where developers have ignored issues like installation, logging, performance, and operational reliability, but I'm sure that there's things about development that operations doesn't get. Fixing this requires education and possibly making development and operations care enough about each other's problems to get that education.
(See Tedd Dziuba for some ways to make developers care.)
The final set of problems are cultural ones, where the two groups are assholes to each other mostly because that's how they've always behaved and it's just how operations and development deal with each other. If this is your only 'DevOps' problem, you can indeed fix it with just better communication and more respect.
(The quibble here is that sometimes cultural issues have roots in things like the amount of respect and pay that goes to various groups, because people are people.)
Each level of organization problem creates the levels of problems below it (unless you're really lucky and have a staff of selfless, motivated saints). An organization with the blame problem is going to have cultural problems and almost certainly ignorance problems; an organization with ignorance issues is going to grow cultural ones. It follows that you need to fix problems from the top down in order to make lasting changes.
(Yes, this is a more balanced view of DevOps than my first entry on it. Sometimes I run a little bit too hard with an idea.)
2011-09-26
DevOps and the blame problem: an outsider's view
I'm an outsider on the whole DevOps movement (we don't have anything like a traditional operations or development environment here), but from my outside perspective it looks like DevOps is really an attempt to deal with the blame problem.
The traditional organization's approach is to blame operations when services aren't running (or aren't running well enough) and to blame development when features aren't delivered. When you blame devs for not delivering features, well, your devs deliver features but not necessarily things like stability and performance. Since ops is not stupid, it then does its best to refuse to install or change things from development, or even let them near what it'll get blamed for.
(Even if ops is either stupid or optimistic, it does not take very many rounds of 'do thing for devs, world explodes, get yelled at' for the negative conditioning to sink in.)
You can tell development that stability, performance, installability and so on are important too. But it doesn't help; when you blamed devs for not delivering features, you told them what their first and foremost priority was and you're going to get exactly what you asked for. Equally, you can tell ops that being responsive to development is important (either directly or by invoking the good of the whole business) but when you blamed them for services dying you told them what their big priority was. This is not surprising in the least; people are very good and very motived at not getting yelled at.
(Some people think that they can fix the ops side of the problem by blaming ops both if services go down and if development updates aren't deployed promptly. This is a great way to lose anyone in ops who's smart enough to realize that they've been given all of the responsibility and none of the power. Or to put it the other way: people who get yelled at all the time quit.)
At its best, DevOps transforms this to 'devops gets blamed when features aren't there and reliable on the site', joining together both things that you actually want. At its worst, DevOps at least gives the sucker with all of the responsibility some power as well.
(See also Ted Dziuba.)
Sidebar: why ops gets the short end of the traditional stick
Developers at least have the chance of exceeding expectations and thus earning praise; they can deliver features really fast or they can deliver really impressive features, work beyond what people expected was reasonably possible. And anyone can be impressed by a well executed feature because it's generally quite visible.
Ops, well, how does ops exceed expectations? Ops are like janitors; we assume that clean buildings and working services are the natural state of things. You don't get points for either. It's just your job. But miss a spot and boy, the blame rolls in.
(Ops gets praise in exactly the situations where people understand that something exceptional is going on, which generally requires a disaster. Unfortunately this breeds a tendency towards 'heroism'.)
2011-09-14
Graphs are not enough (for your monitoring system)
There are a lot of monitoring systems out there that will accumulate historical data and draw you pretty graphs of it, and certainly it's useful to use one of them. However, these graphs are not enough by themselves and you should not settle for a closed monitoring system that only does graphs.
Graphs are good to look at to get a quick overview, and looking at them can show you things that you hadn't noticed before. But there is a lot of questions that you cannot answer and things that you cannot see just from looking at graphs, and many things are hard to see on graphs. Thus, what you really need is for your monitoring system to give you the raw historical data in a documented format so that you can do your own data analysis with whatever stats tools you like. As a bonus, this allows you to graph whatever you want (and on whatever scale you want, and in whatever form of graph you like), instead of having to rely on whatever graphs the monitoring system is willing to give you.
(Note that one reason to use a stats package instead of just reading graphs is that reading (or closely estimating) numbers off graphs is hard. Graphs are designed for overviews and for quickly visualizing things, not for answering questions about specifics. Specifics require either the raw data or direct numbers from doing an analysis of the raw data.)
You might be tempted to say that the monitoring system should do this for you and that the lack of a graph shows that the monitoring system is incomplete. The problem is that any system will always be incomplete for someone, because different places need different statistics.
(The corollary to this is that the best way to view your monitoring system's graphs is as something that covers the easy or obvious cases, not as a comprehensive solution.)
2011-09-09
You really want to put your switches in server racks
Once upon a time, not that long ago, when we were perhaps smaller and switches were certainly more expensive, we put our switches in network racks over on one side of the machine room, all of our servers in server racks, and ran cables under our machine room's raised floor from the servers to the switches. Please learn from our painful experience and don't do that; put almost all of your switches in server racks.
Yes, really. Even if this requires putting a stack of switches in a rack to get enough ports (or enough subnets, if you sensibly use one switch per subnet). Even if you need to put one switch in the front of the rack and another in the rear, just to get enough in (switches are shallow, you can usually pull this trick off).
Why you want to do this is simple. The more network cables you run under the floor, the more you discover the charms of machine room archaeology and the more time you will spend trying to trace and pull old cables when you remove old machines. (Unless you don't have the time to pull 'harmless' unused cables, or you're going to get around to it on some slow day, or you're leaving the old cable in place for now because you're pretty sure you're going to put a new machine in the rack in a bit and you'll just be re-running a cable so let's save some work. Then it gets worse.)
Putting as many switches as necessary in your racks means that you'll run roughly one network cable per switch back to your core switch interconnect points, instead of one or more cables per server. This is a lot fewer cables under the floor (or overhead if you use overhead cable trays, and they get messy too), and that is a very good thing. It also makes it a lot easier to remove and add cables as you remove and re-add servers, which usually drastically increases the chances that you'll actually do it.
Four years ago when I wrote RackNetworking, we had just begun to think about moving from our old way to putting a bunch of switches in server racks. Since then we've almost entirely moved to the server rack approach, but we still have a number of machines that were cabled up with the old under the floor approach; every time I have to clean up after one of those machines (as I had to today), I'm reminded of how much better the new approach is.
Sidebar: our answer for uplink bandwidth
One of my concerns back in RackNetworking was uplink bandwidth from the server rack switches to the core interconnect. In practice this has not been an issue for us, because most of our machines are not heavy bandwidth consumers. We continue to run direct connections to the core interconnect switches for the few machines where we think it may actually matter; I wrote the details up in my writeup on how our network is implemented.
Things that could happen to your archives
In the spirit of my old entry on things that could happen to your backups and to reinforce yesterday's entry on not trying to archive things, there's an incomplete list of things that have been known to go wrong with archives. If you're thinking of doing archives, you should be thinking about how you're going to avoid these.
- you aren't archiving everything you need to archive.
- the archive program doesn't work right; it writes a corrupt or
incomplete archive, fails to notice or complain enough about read
errors, or its archive doesn't capture a consistent and usable
state of whatever you want to archive.
With archives you should definitely be doing a full read of the archive and verifying it against the data on disk before you remove anything from disk.
(In general archives are subject to many of the woes of backups. Take them as read.)
- the archive media degrades over time.
This is what most everyone talks about, and for good reason; if your data isn't there any more, nothing else matters. But it's only the tip of the iceberg for what you need and what can go wrong.
- one or more pieces of archive media were physically damaged or
destroyed due to a mishap, accident, water leak, fire, etc.
If you care about real archives, you need more than one copy of any piece of data (and they should not be in the same place). Accidents and mishaps happen, especially to things sitting in the corner.
- you've lost track of one or more pieces of archive media; they're stored somewhere, but you don't know specifically where any more.
- in general you've lost track of what media you have and/or what data you've archived.
- you've lost track of what is on each piece of archive media, so
while you know you have an archival copy of <X> you don't know
which one of fifty tapes it's on (and no one is going to go search
through all fifty tapes unless it is really, really important).
- you don't have anything that can read the media any more.
- the media reading hardware that you carefully saved has quietly stopped working sometime during the years that it was in storage.
- you can't connect the media reading hardware to any of your current systems; it requires an obsolete interface that is no longer supported.
- you have an interface card for the obsolete interface you need, but
it uses a bus type that is no longer supported on your machines.
(I have some PCI SCSI cards. The odds that I will be able to put them in machines drops by the day.)
- you have all of the hardware you need and you even saved cables too,
but the OS driver for the hardware was removed several years ago
after it became unmaintained because no kernel hacker had a copy
of the hardware to test with any more.
- all of your hardware works for the first N tapes (or disks, or whatever),
then something breaks due to the amount of wear you're putting
on old hardware. Since it's all obsolete hardware, there's no
longer any spare parts, maintenance and cleaning kits, or the
expertise to use any of these even if you had them.
- you didn't write down what format the archives are in because it was obvious at the time.
- you don't have any software that can read the archive format.
- the details of the archive format either were never documented or
were only documented in ancient documentation that you got rid
of years ago. You earn bonus irony points if you carefully included
the documentation in your archives.
- the software you have that can read the archive format doesn't run on any of your current machines.
- the old OS you need to run the software to read the archive format doesn't work on any of your current machines.
- you have source code for software to read the archive format, but
it doesn't compile on the current version of the OS because the
compiler has gotten stricter, the library interfaces have changed,
and the OS has moved from 32-bit to 64-bit.
- your commercial archiving system requires a license key, but the
company that made it is out of business now and certainly not
issuing any new ones. Your old license key expired five years
ago.
(Yes, there are people who do long term archiving with commercial software.)
- you have forgotten all of the details about how to work with the
media, the archive format, and any surviving software. In theory
you could with sufficient effort re-master all of the pieces and
reverse engineer the format and extract the data. In practice you
don't have the time to do all of this (because it is not a high
enough of a priority), and so the archives are unreadable and
will never be extracted.
It's common to discover this shortly before your last media reader is decommissioned, because this is when everyone decides that you should move the data from the old media (and format) on to some new media. This is often the first time anyone has thought about the archives for years.
(Even if you can remember all of this, it not infrequently turns out that you simply don't have enough time to cycle all of your old media through to read all of the data off of it.)
There are probably many more, but I have less painful experience with archives than I do with backups.
(Although we had an interesting time when the last 9-track reel to reel tape drive was being taken out of service. I don't think we got all of the old historical 9-track tapes copied that we wanted to.)
2011-09-08
Archival storage in the modern world
Today, the following got asked on a university-wide mailing list for sysadmins:
I've had a request from a research lab about the availability of long term (10 years) backups. The amount of data will be roughly 10 - 20T by the end of that period growing at an estimated 1T/yr. [...]
(This isn't really backup, this is archiving.)
My view is the right answer is not to archive the data at all. If you care about long term availability of some data, practically the last thing you want to do is archive it, because reliable archives are hard. Instead you want to keep it on live disks on a live fileserver (using RAID and ideally a filesystem that has data checksums), and just do backups.
(You're doing backups because RAID is not a backup. You're using some sort of checksums on the data so that you can notice corruption before you overwrite the last of your uncorrupt, pre-corruption rolling backups.)
Keeping the data live doesn't guarantee that the data will survive ten years, but provided that you pay attention to the fileserver it does mean that you and your successors will think about at least some of the issues, that you will notice if data starts to degrade, and that you have a chance to recover from problems. If you decide to turn off the fileserver and abandon the data, it will at least be a conscious choice instead of simply failing to notice that you've just gotten rid of (or lost) something that was necessary to recover the archives.
(If you abandon the fileserver in a corner and therefor fail to notice plaintive complaints about dying disks, failing backups, and so on, all bets are off. But there are lots of ways to screw up archival storage too.)
This might sound expensive, but even 20 TB of RAID storage space plus backups is not all that much money and it's getting cheaper all of the time. I wouldn't be surprised if it was cheaper than 20T of ten year archival storage, especially once you factor staff time into it (to research and build a ten year, multi-terabyte archival system). And as a bonus the researchers get to keep all of this historical data online, which may turn out to be useful or at least interesting.
(When costing out the archival system, don't forget to include the cost of redundant archival media so that damage to a single piece of media will not lose data. Even if the media is perfectly reliable, things like fires and accidents happen.)
One situation where this might not be good enough is if the research lab wants archives that cannot be altered after they're made, so that they can be sure that the data they've restored now is the data that they used for the research paper seven years ago and no one has accidentally modified it since then. You may still be able to come up with technological solutions like archival filesystems that you make read-only once data has been loaded onto them.
(This entry is adopted from comments I made on the mailing list, so local people may find that it looks familiar. The issues are generic, though. My earlier entry on the same subject was more oriented towards personal data, instead of this sort of larger scale.)
2011-09-06
How not to set up your DNS (part 21)
This one is creative, and best presented in point form.
- the nameservers for
co.are ns1.cctld.co through ns6.cctld.co. - if you query them for the NS records of hotmail.co, all of them
point you to NS1.MSFT.NET., NS2.MSFT.NET., and NS5.MSFT.NET.
(They do this slightly oddly, with the aa bit unset, but nameservers for other important zones also do this so I assume that it's the modern style.)
- if you ask any of these MSFT.NET nameservers for the A record
for
www.hotmail.coorhotmail.co, you get answers (with the aa bit set, as you'd expect from an authoritative nameserver). - if you ask any of these MSFT.NET nameservers for MX, NS, or SOA
records for
hotmail.co, you get an interesting reply:
flags: qr aa; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; AUTHORITY SECTION:
. 3600 IN SOA ns1.msft.net. msnhst.microsoft.com. 2009082101 900 600 86400 3600
;; ADDITIONAL SECTION:
ns1.msft.net. 3600 IN A 65.55.37.62
(For bonus weirdness, whether or not you get the A record for ns1.msf.net depends on what query you're making; MX and NS queries do not, but SOA queries do.)
We've seen grandiose claims of authority
before, and it doesn't work any better this time than it did
before. Specifically, if you do MX lookups on hotmail.co, your DNS
server will almost certainly give you a 'cannot resolve this right
now' temporary failure result. This is kind of important because
hotmail.co is one omitted letter away from hotmail.com and thus runs
into my small wish for parked domains.
I guess I'm going to have to add another entry to our list of typo'd email domains that should have their email bounce explicitly.
(That hotmail.co has a working A record doesn't help; if an MX
record lookup returns a temporary failure, a mailer must retry the
MX lookup instead of falling back to the A record. It can only fall
back to the A record if there is a definite 'no MX record' answer.
Not that falling back to the A records would help in this case, as
hotmail.co's IP addresses currently block SMTP connection attempts.)
(It's been a while since the last installment.)
2011-09-01
Things I will do differently in the next building power shutdown
We recently had an overnight, building-wide power shutdown in the building with our machine room. As you can imagine, a total machine room shutdown (and later restart) is an interesting time. We made checklists for both the shutdown and the restart, and for the most part things went fine (although they took longer than expected). But still, there are a few things that I will do differently the next time that this happens:
- make a list of all of our machines and then go through the checklists
making sure that each machine is covered in both, either by name
or as part of a generic group like 'all fileservers' or 'all
Ubuntu machines'. We left out specifically covering a few machines,
which led to uncertainty about when they were intended to be taken
down and brought back.
- for machines that are part of some generic category in the checklist
(eg 'now shut down all fileservers'), print out a list of the
machine names (in advance) so that they can be ticked off as you
shut them down or bring them back up.
(I forgot to do this for some generic categories and only noticed the omission after we'd shut down the print server, which led to me having to hand-write their names on my sheet.)
- when preparing shutdown checklists, try to be sure to remember any odd
bits of your network topology so that you don't shut down a gateway
before the machines behind it. Our firewalls and their hot spares
sit on an odd unrouted subnet that
is reached through one of our general Ubuntu machines, and of
course we shut down all of the Ubuntu machines early. This wasn't
fatal, but it did make us feel kind of silly that we'd missed a
chance to shut down all of the hot spares from the convenience
of our offices.
(We shut down the active firewalls very late, but the hot spares were pretty much unnecessary once our formal shutdown process started.)
- ask yourself what important cron jobs won't get run due to the
shutdown and if you need to do anything to run them by hand after
you bring everything back or if the next cron run will automatically
take care of things. As it turns out, our network traffic accounting
system's daily aggregation process didn't get run and now needs
to get fixed by hand.
- explicitly list and then tick off unusual things to check, even
if everyone is going to remember them. If nothing else, having
them explicitly listed makes it much more likely that only a
single person will check them instead of everyone remembering 'oh
yeah, we need to check weird service <X>' and doing it separately.
(When something is a step on the checklist, you're more likely to pause to at least ask co-workers if anyone has already done it.)
Next time I will also remember to take my printed copy of the checklist with me everywhere, no matter what, so that I can always tick off things on it if I find out that they're done or unnecessary or whatever. It may just be my peculiarity, but I find that I really like having physical paper and a pen so that I can literally tick off or cross out things to keep track of them. After events like this my printed checklists are always a sea of tick marks, crossed out bits, and cryptic notations.
(I don't have any particular consistency in how I mark up my checklists; I just do whatever makes sense for me at the time.)
(Having written this down, hopefully I will remember all of these good intentions when the next major machine room shutdown happens. Hopefully it won't be any time soon, although they have more major power work to do to our building at some point in the future.)