2014-02-28
Yet another problem with configuration by running commands
One of the divides in how programs, daemons, and systems get configured is between configuration files and what I'll call 'configuration by command', where you set up the system by running commands and the system persists them behind your back in some magical or at least internal way. I've written previously about the ways that configuration by command harms manageability but today I stumbled over another problem with it.
Put simply, configuration by command robs you of things to copy. When you have a system that is set up through configuration files, you have a ready made source of things to copy and modify; you find one of the configuration files, make a copy, and change things. When you have a system that's set up by running commands you can see the end state but you almost never have something that will dump out the commands necessary to recreate that end state so that you can copy and modify them. Want to set up another instance of the system? You get to reverse engineer the full set of magic commands and options necessary, which may mean that you have to learn, understand, and perhaps master the entire system.
The real fix for this is something that all too few 'configuration by command' systems have, which is a way to not just report the configuration but to dump out the commands necessary to reproduce it. Documentation is only a partial help and then only if it's clear and ideally has any number of examples.
(This entry is brought to you by yet another attempt to set up a
serial port under Solaris OmniOS. Unlike last time there is another configured serial port
whose configuration I would be happy to copy and modify slightly, except
for this exact problem.)
2014-02-24
Nerving myself up to running experimental setups in production
One of the things that I want to do is move towards gathering OS level performance metrics for our systems, ideally for basically any performance stat that we can collect. All of the IO stats for all disks? Lots of stats for NFS mounts? CPU and memory utilization? Network link utilization and error counts? Bring them on, because the modern view is that you never know when this stuff will be useful or show you something interesting. The good news is that this is not a novel idea and there's a decent number of systems out there for doing all of the pieces of this sort of thing (collecting the stats on machines, forwarding them to a central place, aggregating and collating everything, graphing and querying them, etc). The bad news, in a sense, is that I don't know what we're doing here.
Like many places, we like everything we run in production to be fully baked. We work out all of the pieces in advance with whatever experimentation is needed, test it all, document it, and then put the finalized real version into production. We don't like to be constantly changing, adjusting, and rethinking things that are in production; that's a sign that we screwed up in the pre-production steps. Unfortunately it's become obvious to me that I can't make this approach work for the whole stats gathering project.
Oh, I can build a test stats collection server and some test machines to feed it data and make sure that all of the basic bits work, and I can test the 'production' version with less important and more peripheral production machines. But it's become obvious to me that really working out the best way to gather and present stats is going to take putting a stats-gathering system on real production servers and then seeing what explodes and what doesn't work for us (and what does). I simply don't think I can build a fully baked system that's ready to deploy onto our production servers in a final, unchanging configuration; I just don't know enough and I can't learn with just an artificial test environment. Instead we're going to have to put a half-baked, tentative setup on to production servers and then evolve it. There are going to be changes on the production machines, possibly drastic ones. We won't have nice build instructions and other documentation until well after the fact (once all the dust settles and we fully understand things).
As mentioned, this is not how we want to do production systems. But it's how we're going to have to do this one and I have to live with that. More than that, I have to embrace it. I have to be willing to stop trying to polish a test setup and just go, just put things on (some of) the production servers and see if it all works and then change it.
(I've sold my co-workers on this. Now I have to sell myself on it too (and stop using any number of ways to duck out of actually doing this), which is part of what this entry is about.)
2014-02-08
You cannot have just one network install server
Suppose, hypothetically, that you want to install your fleet of machines in the approved modern way, which is through an automated over the network install system. You PXE boot your new server, pick 'install system' from a boot menu (for good reason), and after a while it's all done for you. This sounds great, doesn't it.
So here's a question: how many network install servers do you have?
Your network install server (or servers) is a crucial core resource. If it is down or broken or suffers data loss, you can't install or reinstall any of your regular servers. You can't add any new ones and you can't replace any broken ones. This is obviously a very undesirable situation to be in, especially if you have any non-redundant regular servers (where you only have, say, one mail gateway because after all if it breaks you can network install a replacement in half an hour out of the generic spares pool).
The obvious conclusion is that if your regular install method is network installs, you cannot have just one network install server. You need at least two for redundancy and you're going to want to think about backups (and just as importantly, restores) for all of those install configurations and possibly install data that you need.
(Whether or not you back up things like standard OS packages used in the install process depends in part on how fast you can re-fetch them from the master sources versus how fast you can restore them from your backups and in part on whether re-fetching them will take too long or use up too much of your network bandwidth or both. In some situations you may actually determine that re-fetching them will be faster than a restore.)
Oh, and hopefully it goes without saying that you probably don't want to network install your network install servers. That can easily create a little chicken and egg situation, no matter how tempting it is once you have things going.
(Among other things, consider a new OS release where the network installer for the new release can only be hosted on a machine running the new release.)
Ideally the install servers shouldn't depend on anything that they're used to install, including for things like DNS service. In practice avoiding circular dependencies here might get very irritating and avoiding them may not be worthwhile. After all, if you've lost all redundant copies of, say, your install servers, your DNS servers, and your firewalls, you probably have bigger problems than the fact that you can't recreate your install servers following your usual canned procedures in order to then recreate the DNS servers and the firewalls. You'll likely be improvising a lot more than just the install server installs.
2014-02-07
A followup to what sudo emails to ignore and not ignore
So I wrote this entry on what sudo emails to ignore and not ignore the other day. Today we got some email from
sudo, reporting:
appsN : Feb 7 12:36:24 : <redacted> : 3 incorrect password attempts ; TTY=pts/NN ; PWD=/h/<redacted> ; USER=root ; COMMAND=/bin/echo great post Chris!
I've got to award this a special bonus prize for probably the most amusing and clever blog feedback I've ever gotten. It certainly made me (and my co-workers) laugh. Well done!
(And yes, because I'm a cautious sysadmin I did indeed check our logs to see if the account might have been compromised and then just to be sure I also verified that the IP the user had logged in from had been used to request URLs here. I was pretty sure even before I started, but after recent events I'm just a little bit jumpy about ignoring things that I think have to be harmless.)
PS: For the record, I'm also pleased that at least one of our users finds my blog interesting enough to read. And I'm happy to take requests for bits of our infrastructure to write up here, if they (or other people) are curious. Email, Twitter, whatever.
2014-02-06
Some thoughts on what sudo emails to ignore and to not ignore
If you run a multi-user system with sudo and your users are anything
like ours, you will periodically get email alerts from sudo about
users trying to do things. In theory and in an ideal world all of these
emails would be evidence of malign intent because, after all, they're
all from unauthorized users trying to do things as root.
In the real world this is not at all the case. In practice we get a
lot of email from sudo about users trying to run things like, oh,
'easy-install PKG' or 'apt-get install PKG' or 'npm install PKG'.
You get the idea. Although we've never tracked down the users involved
to quiz them about it, my assumption is that they've found some 'how to
install X' guide on the web, the guide uses sudo because it's focused
on single-user machines (or machines where you're the administrator),
and they are rationally following the guide. I can't expect that our
users necessarily even know what sudo is and it's not as if sudo
gives you a big glaring warning about this when you run it; it just
prompts you for a password (and then doesn't tell you why things didn't
work, which of course leads users to thinking that they entered their
password wrong and trying several times).
(We've gotten enough of these emails that we actually wrote a sudo
cover script that just tells people 'this won't work, please contact
your point of contact for assistance'. It
doesn't work all of the time for various reasons but it really cuts
down on the noise, and at least we're trying to be friendly.)
But every so often we get sudo emails
with commands that look far less innocent, commands like '/bin/su
-'. I've now learned that sudo attempts that don't look like they
have a straightforward or innocent reason ought to be fully
investigated. I think we're going to be bothering users in the
future for explanations even if there are no signs that their
accounts have been compromised. What exactly are such commands?
I'm not sure yet but I'll probably start with anything that doesn't
have an obvious explanation. We'll inevitably be refining our views
of this as we talk to users and see why they wound up running
somewhat innocent sudo commands. Cynically I expect to find that
there are an awful lot of instructions out there on the web that
have people doing remarkably alarming things through sudo.
(In our recent security incident the
intruders immediately tried to do 'sudo /bin/su -'. This is now
an immediate 'drop everything and go digging' trigger for me; the
odds appear very good that any account that tries to do this has
been compromise.)
2014-02-02
An illustration of the problem of noise
Our backup software was originally written in a world in which you backed up to comparatively slow tape drives and many of its core workings remain in that world even today. One absolutely necessary trick for speeding up tape backups is to first write backups to a 'holding disk' on the backup host, which can be done in parallel, and only later write them serially to tape. Of course if you don't have a holding disk, you must make backups one at a time so you can write them to tape as you make them. Although we back up to disks these days we still do it with Amanda and Amanda still needs a 'holding disk' to do backups in parallel.
On Friday, I noticed that permissions issues meant that one of our backup servers wasn't able to use its holding disk and thus was forced into the much slower mode of only doing one backup at a time. In fact it had been operating this way for some time. More than that, it had been complaining about this for some time, faithfully mentioning the problem in every daily dump report. The mentions looked like this:
driver: WARNING: ignoring holding disk /dumps/holdingdisk: Permission denied
Actually that's not an accurate representation of how they really looked, because I've pulled the one significant line out of over a hundred lines of chatter and I haven't mentioned that this line was down in a section labeled 'Notes', a section that's always full of random unimportant things.
(Amanda backup reports have multiple sections, with more important ones coming earlier. The Notes section is the fourth of five. You can guess how much attention we pay to it based on that.)
This is yet another perfect example of the effects of noise. Amanda put a significant thing in a bunch of noise and we probably spent months not seeing it, because that's what happens when you shower humans with noise; they tune it all out. People don't just tune out the noise, they also tune out any signal that's in the general vicinity and doesn't stand out from the noise enough (generally this is almost all signal).
This reinforces two strong views that I've held for a long time: don't mix signal in with noise and don't generate noise at all. As we've seen here the first doesn't work at all (unless your objective is to hide the signal) and generating noise inevitably contaminates everything around it to some degree.
(I've undoubtedly banged on this particular drum before in other entries. Today I don't feel like hunting them down to add them as links here.)