Wandering Thoughts archives

2014-11-30

You should keep your system logs for longer than you probably are

One simple thing you can do to improve your life is to make your machines keep their logs for longer than they currently do. Most systems ship with relatively short log retention defaults that basically date from the days when systems had what are now very small disks and sysadmins got really grumpy about logs eating up lots of scarce disk space. Those days are over now for most systems; for example our new servers come with 500 GB HDs as the default. A 500 GB disk will hold really quite a lot of logs. SSDs change this a bit, but even small SSDs these days are in the 64 to 80 GB range and you usually have to work hard to get a system install to use more than a few GB. Even on SSDs we wind up with tens of GB free.

(Of course this goes well with having a central syslog server, because usually you can easily give the central syslog server a lot of disk space to store lots of logs. This isn't true in big environments where you have a lot of log traffic in the aggregate, but most sysadmins are not in such environments.)

The core reason to keep logs for a relatively long time is that you don't always find out about things that you want to look into right away. The longer you keep logs for, the further you can look back into history to see things. The obvious case where this is really important is if you ever experience a system compromise or security problem that you didn't detect immediately. But you can also be looking back to see how frequent something is, or even doing long term historical analysis on how things have changed over time.

Having said that, there are some concerns involved if you're thinking of doing this. We're lucky enough to be in a situation without real concerns about information sensitivity and anti-retention policies. For information sensitivity, we don't have any really sensitive logs that we have to closely safeguard and we consider all of our machines about as secure as each other.

(Of course I am a big fan of not logging sensitive information you don't need.)

Once you've made the decision to keep your logs for a relatively long time, there are of course a bunch of things you can do to improve the situation even more. The obvious ones are centralizing your logs and setting up a long-term archival system for them so that if need be you can go back really extended periods of time. If you do periodic archival system backups, for example, you can make sure that your log storage is captured in the backups and that you keep enough logs to cover at least the full time interval between those archival backups.

(This elaborates on a tweet of mine.)

KeepLogsLonger written at 22:10:24; Add Comment

2014-11-29

Sometimes you need to turn things into small, readily solvable problems

If you've read my entry on making IKE work you may have noticed that the ultimate configuration I wound up with doesn't sound all that complicated or as if it took all that much work to create. Yet I've previously been strongly uninterested in trying to create more or less the IKE configuration that I wound up with, and I expected it to take a daunting amount of work (cf). What happened between those two points is not quite as simple as me being wrong about how much work it was.

I said (and meant) that what made this whole thing feasible was my realization that IKE didn't need to actually manage my GRE tunnel; it could just do IPSec keying alone. The vitally important thing about this was that it drastically reduced the scope of the project. The initial project was 'replace everything with IKE'. This was a dauntingly massive thing that was clearly going to go on and on before I got to a point where the new setup actually did anything. Worse, it might not even be possible, so I could be spending a bunch of effort on something that would fail outright. By contrast, the reduced scope project would clearly take much less time to yield results. It was also much more likely to succeed, since negotiating IPSec keys in kind of the core job of an IKE daemon; it would be quite weird if I could not use one to put IPSec on a particular traffic flow. With the scope of the project narrowed I could invest some time and get a relatively immediate payoff, and so I did. Then I could push the project a little bit further for another payoff, and then a little bit further still for another payoff, and by the end of things I had done most if not all of the initial project idea.

Based on my reaction here (and to other projects) I think one of my lessons from this particular experience is that I need to make this sort of shift more often. Faced with a big daunting project, I should look for ways to re-scope at least part of it into a small project that I can do easily and that will have an immediate payoff. This isn't just as simple as breaking the project up into separate steps; as with here, it may require rethinking what the project has to do so that you can narrow the scope to something much more modest. If you're lucky you can re-expand the project later, but you may not be.

(Of course if I look at this right this is not exactly novel. There's a whole collection of advice about turning large 'big bang' style projects into a series of incremental changes, and some of the stated reasons for this are that you start getting improvements from the work sooner and that this makes it less risky.)

FindingSmallSolvableProblems written at 01:46:58; Add Comment

2014-11-24

Delays on bad passwords considered harmful, accidental reboot edition

Here is what I just did to myself, in transcript form:

$ /bin/su
Password: <type>
[delay...]
['oh, I must have mistyped the password']
[up-arrow CR to repeat the su]
bash# reboot <CR>

Cue my 'oh damn' reaction.

The normal root shell is bash and it had a bash history file with 'reboot' as the most recent command. When my su invocation didn't drop me into a root shell immediately I assumed that I'd fumbled the password and it was forcing a retry delay (as almost all systems are configured to do). These retry delays have trained me so that practically any time su stalls on a normal machine I just repeat the su; in a regular shell session this is through my shell's interactive history mechanism with an up-arrow and a CR, which I can type ahead before the shell prompt reappears (and so I do).

Except this time around su had succeeded and either the machine or the network path to it was slow enough that it had just looked like it had failed, so my 'up-arrow CR' sequence was handled by the just started root bash and was interpreted as 'pull the last command out of history and repeat it'. That last command happened to be a 'reboot', because I'd done that to the machine relatively recently.

(The irony here is that following my own advice I'd turned the delay off on this machine. But we have so many others with the delay on that I've gotten thoroughly trained in what to do on a delay.)

PasswordAuthDelayHarm written at 16:01:43; Add Comment

2014-11-20

Sometimes the way to solve a problem is to rethink the problem

After a certain amount of exploration and discussion, we've come up with what we feel is a solid solution for getting our NFS mount authentication working on Linux. Our solution is to not use Linux; instead we'll use OmniOS, where we already have a perfectly working NFS mount authentication system.

To get there we had to take a step back and look at our actual objectives and constraints. The reason we wanted our NFS mount authentication on Linux is that we want to offer a service where people give us disks (plus some money for overhead) and we put them into something and make them available via NFS and Samba and so on. The people involved very definitely want their disks pace available via NFS because they want their disk space to be conveniently usable (and fast) from various existing Linux compute machines and so on. We wanted to do this on Linux (as opposed to OmniOS (or FreeBSD)) because we trust Linux's disk drivers the most and in fact we already have Linux running happily on 16-bay and 24-bay SuperMicro chassis.

(I did some reading and experimentation with OmniOS management of LSI SAS devices and was not terribly enthused by it.)

We haven't changed our minds about using Linux instead of OmniOS to talk to the disks; we've just come to the blindingly obvious realization that we've already solved this problem and all it takes to reduce our current situation to our canned solution is adding a server running OmniOS in front of the Linux machine with the actual disks. Since we don't view this bulk disk hosting as an critical service and it doesn't need 10G Ethernet (even if that worked for us right now), this second server can be one of our standard inexpensive 1U servers that we have lying around (partly because we tend to buy in bulk when we have some money).

(Our first round implementation can even take advantage of existing hardware; since we're starting to decommission our old fileserver environment we have both spare servers and more importantly spare disk enclosures. These won't last forever, but they should last long enough to find out if there's enough interest in this service for us to buy 24-bay SuperMicro systems to be the disk hosts.)

This rethinking of the problem is not as cool and interesting as, say, writing a Go daemon to do efficient bulk authentication of machines and manage Linux iptables permissions to allow them NFS access, but it solves the problem and that's the important thing. And we wouldn't have come up with our solution if we'd stayed narrowly focused on the obvious problem in front of us, the problem of NFS mount authentication on Linux. Only when one of my coworkers stepped back and started from the underlying problem did we pivot to 'is there any reason we can't throw hardware at the problem?'.

There is a valuable lesson for me here. I just hope I remember it for the next time around.

SolvingTheRealProblem written at 00:27:04; Add Comment

2014-11-14

Sometimes there are drawbacks to replicating configuration files

This is a war story, but not my war story; this is all my coworkers' work.

Writing a working Samba configuration is a lot of painful work. There are many options, many of them interact with clients in odd and weird ways, and the whole thing often feels like a delicately balanced house of cards. As a result we we have a configuration that we've painstakingly evolved over the many years that we've been using Samba. When we needed a second Samba server dedicated to a particular group but still using our NFS fileservers we of course copied the file, changed the server name, and used it as is. Starting from scratch would have been crazy; our configuration is battle-tested and we know it works.

We recently built out an infrastructure for cheap bulk storage, originally intended for system backups; the core idea is that people buy some number of disks, give them to us, and we make them accessible via Time Machine (for Macs) and Samba (for Windows). Of course we set up this machine's Samba using our master Samba configuration (again with server names changed, and this time around with a lot of things taken out because eg this server doesn't support printing). Recently we discovered that Samba write performance on this server was absolutely and utterly terrible (we're taking in the kilobytes or very small megabytes a second range). My coworkers chased all sorts of worrysome potential causes and wound up finding it in our standard smb.conf, which had the following lines:

# prevent Windows clients copying files to
# full disks without warning. This can lead
# to data loss.

strict sync = yes
sync always = yes

Surprisingly, when you tell your Samba server to fsync() your writes all the time your write performance on local disks turns out to be terrible. Performance was okay on our main Samba servers for complex reasons involving our NFS servers.

The comment explains the situation we ran into fairly well; Windows clients copying files from the local disk to a Samba disk could run out of space on the filesystem backing the Samba disk, have the write fail, not notice, and delete the local file because it 'copied'. That was very bad. Forcing syncs flushed the writes from the Samba server to the NFS fileserver and guaranteed that if the fileserver accepted them there was space in the filesystem (and conversely that if you were out of space the Samba server knew before it replied to the client). All of this is perfectly rational; we ran into a Samba issue, found some configuration options that fixed it, put them in, and even documented them.

(Maybe there are other configuration options that would have fixed this problem and maybe this problem is not an issue any more on current versions of Samba and everything else in our environment, but remember what I said about us not rewriting Samba configuration files because they're a house of cards.)

This whole thing is a nice illustration of the downside of replicating configuration files when you're setting up new services. Not starting from scratch is a lot faster and may well save you a lot of painful bad experiences, but it can let things slip through that have unpleasant side effects in a new environment. And it's not like you can really avoid this problem without starting from scratch; going through to question and re-validate every configuration setting is almost certainly too mind-numbing to work. Plus there's no guarantee that even a thorough inspection would have caught this issue, since the setting looks perfectly rational unless you've got the advantage of hindsight.

CopyingConfigsDrawback written at 00:54:21; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.