2012-09-24
The wrong way to harvest system-level performance stats
I've recently been looking at packages to harvest system-level performance stats (because this is something that we should really be doing), and in the process I've repeatedly observed an anti-pattern that I need to rant about now.
One popular implementation technique is to have a central master program that runs 'plugin' scripts (or programs) to gather the specific stats. Each time tick the master program runs all the plugin programs, each of which is supposed to spit out some statistics that the master winds up forwarding to wherever. This model is touted as being flexible yet powerful; to add some new stats to track, you just write a script and drop it in the right directory.
Unfortunately you cannot do real stats gathering this way, not unless you are prepared to offload a significant amount of work to your stats backend. The problem is that too many interesting stats must be computed from the difference between two measurements. At the system level, such delta stats are simply presented as a count of events since a start time (which is very simple to implement). If you want to compute a rate, you need to look at the change in the counts over a time interval. This fundamentally requires an ongoing process, not a 'sample every time tick' one-shot script.
Delta stats are important and common. They occur all over Linux (many system performance stats are exported by the kernel this way) and you also find them in things like Solaris's network stats (and probably in other Solaris performance stats, but I haven't looked that closely). Given the simplicity of the basic stats-exporting interfaces, I'd expect to find them in pretty much any OS.
You can make delta stats work in a run-every-tick environment, but it requires significant extra work and generally an auxiliary database in some form. But it is generally trying to hammer a round peg into a square hole; it can be done, but you're forcing things instead of using the right tool for the job. Unfortunately for the nice simple 'run command X every so often and give me output' stats gathering model, it is fundamentally the wrong model. The right approach for stats harvesting is a constantly-running program that has the specific stats gathering directly embedded into itself somehow.
2012-09-15
Sensible reboot monitoring
I have an embarrassing confession: we recently discovered that some of our machines had been spontaneously rebooting every so often, and we hadn't noticed. This is not really a good thing; if your servers are spontaneously rebooting, you should know about it. We have a monitoring system, so of course the right answer is to have the monitoring system alert us when a system reboots.
(Some of you are laughing sadly right now.)
The problem with alert-on-reboot is that you get alerted for every reboot. Including all of the times that you deliberately reboot a machine. And unless you have serious problems, almost all system reboots are deliberate reboots, which means that you've created an alert that is almost entirely noise. Pretty soon you're going to be completely habituated to reboot alerts and you'll screen them out automatically. Just like all other alerts, in order to make reboot alerts work you need to make them low-noise. In other words, reboot alerts need to ignore deliberate reboots and only alert you when a machine reboots unexpectedly.
The best way to do this depends on your monitoring system. You can do it in the notifier agent that you run on your systems (such that it only sends a 'machine rebooted unexpectedly' alert under some circumstances), or you may be able to do it in the monitoring system itself if it's smart enough.
(If you do this, I think that you also should track all system reboots (but not alert on them). Partly this is a just in case issue, and partly because there may turn out to be other events that are correlated with system reboots, even deliberate ones. If you don't track all reboots, you can't spot these relationships.)
2012-09-08
When you can log bad usernames for failed authentications
I've said before that when you log failed authentications you should omit the login name if it's a nonexistent user; otherwise, sooner or later you will log someone's password. But sometimes this is inconvenient and you'd really like to log at least some nonexistent usernames; for example, you might want to find out which usernames attackers are probing.
The simple answer is that you can log nonexistent usernames when you know that they're not passwords. Of course, the devil is in the details, specifically on how you know this.
The straightforward way is to have a list of popular or interesting
nonexistent usernames, for example standard or common logins that
you've removed from your system (or never had in the first place). You
probably don't have a guest account but you might want to know how
often people try it.
The more advanced way is to have your software know something about your policies on strong passwords. If a nonexistent username fails your password strength tests, it pretty much can't possibly be a valid password for any of your accounts and you're free to log it. You don't have to implement all of your password checks, and in fact I suspect that in most environments you'd get the largest benefit from a few very basic and simple ones, most especially a 'no all lower case' rule if you have it (since most usernames are all lower case).
Web apps that use email addresses as the user identifier can apply similar basic heuristics. If you have a decent validation system for 'is this an email address', I think it's very unlikely that you have a user's password.
(Of course if failed authentications are not actionable logs then the only reason to do any of this is for your vague interest and there's no point in working particularly hard at it.)
2012-09-06
A little trick and gotcha with Exim ratelimits
In light of recent events, I've been exploring Exim's features for ratelimiting incoming email. In the process I have found something that I feel like documenting. Suppose you're trying to establish what ratelimit settings you actually want (ones that won't cut off your users). The simple way to do this is to create a do-nothing ratelimit ACL stanza that exists just to log the actual rate volumes, something like:
warn ratelimit = 10 / 10m / per_rcpt / $sender_address log_message = SENDER RATE: $sender_rate
This looks like just what you want, except there's a gotcha; this is not actually measuring what you think it is and what you want. The default Exim behavior of ratelimits is that they are 'leaky'. In Exim terminology, this means that if an actual rate exceeds the limit, it doesn't update the rate counters; in effect, the rate counters stick at the point where they started to go over the limit. This behavior is generally what you want for rates that are actually enforced, but it's definitely not what you want if you're only using ratelimits to explore the actual real rates. In this situation what you'd see is a reported 'sender rate' that was almost always only slightly over 10, regardless of what the real sending rate was.
What you want in this situation is for the ratelimit to be 'strict', where it's updated for every action (whether or not the action pushes the rate over the limit). Then it actually tracks the real sending rate. So you actually want the ACL stanza:
warn ratelimit = 10 / 10m / per_rcpt / strict / $sender_address log_message = SENDER RATE: $sender_rate
(If you're doing ratelimiting, you may or may not want to exclude email to internal addresses. Locally we're going to exclude them, because we mostly care about a compromised user here being exploited to spam outside sites and poison our sender reputation. Spamming our own users is annoying but we can deal with it.)
2012-09-05
Some thoughts on logging failed login attempts (for existing users)
When I wrote my entry on logging all successful user authentications I thought that I had strong heretical views on logging failed authentications for existing users. Since thinking more about it, I have backed off on that a bit so this is somewhat more ambivalent than I was expecting.
(I've written before about why you should basically never log nonexistent user names for bad authentication attempts. The issue here is whether you should log failed authentications for usernames that do exist.)
The web app world these days has the concept of 'actionable data'; this is something that you measure where you will actually take action based on what you see. Actionable data drive actions; non-actionable data just create pretty graphs that you can distract yourself with. We can extend this concept to system administration, so the question you should ask yourself about logging failed authentication attempts is 'how actionable is this?'
In some environments, the answer is 'very actionable'; you will use the logging data to drive various anti-intrusion systems (often automated ones), to set off alerts, and so on. Many environments are very much the reverse; you'll never really do anything with this information except let it accumulate to a file and perhaps later search it to see if a successful intruder probably did some door-rattling before hand. In the middle are environments where you will periodically use the failed authentication to troubleshoot problems your users are having (so you can say, eg, 'we saw you fail to authenticate; are you sure the password is set right in your app?').
How actionable information about failed authentications is tells you how important logging them is. In particular if it's not actionable at all, tracking failed authentications is just a distraction; you can log them if you really want to, but you should put them in a place where they will not get in the way of more important messages.
My cynical view is that many developers over-estimate the importance and actionability of failed authentications. Outside of perhaps troubleshooting user problems once in a while, most sysadmins on the Internet today will never look at failed authentication logs; there are simply too many people rattling your doorknobs to make it worthwhile to pay attention to them.
Sidebar: what I'd like developers to do on Unix
I will assume that you've written your program to log to syslog. If so, provide an option to log authentication failures at priority DEBUG instead of priority INFO.
If you think that you should log authentication failures at any priority higher than INFO, you are wrong (unless you are writing software for a very specialized environment). There is only one sort of bad authentication that should be logged on any urgent basis, and that is 'the authenticator was right but local policy doesn't allow this to happen'. For example, 'someone used the right root password over SSH but we don't allow remote logins to root' or 'attempt to become root with the right password but from someone who is not in group wheel'.
2012-09-02
What I would like: testable mailers
Suppose that you are developing a change in your mailer's configuration files. There are very few 'clearly correct' changes to mailer configurations so of course you would like to test your change, to make sure that it does what you want and that it doesn't break anything.
If system administration worked like programming does today, you would fire up your suite of automated tests (they would probably be considered functional tests) and check that they all passed, both the old ones that verified existing functionality and the new tests you added to verify the change you're making. When your tests came up all green, you would have a pretty solid assurance that everything was good.
(Certainly you'd know that you weren't going to mangle email in any of the ways you've done it in the past, because you'd have tests to check for all of those.)
In today's world, well, this doesn't happen. The major reason it doesn't happen is that very few mailers are testable. Testability isn't just being able to test a program (you can always run the program by hand to test it that way); instead, it's the property of being easily tested by automated tests. To be easily tested, you need to expose the ability to ask the program questions and get answers that you can easily automate.
Today, a great deal of mailer functionality is not exposed this way at all and can only be tested through end to end tests where you set up a mailer, run messages through it, and examine the logs and the message deliveries to see what happened to them. When mailers let you ask them limited questions (such as 'what does this address expand to?'), the interfaces for doing this are usually intended for debugging mailer problems instead of verifying mailer functionality. As a result, the output is generally verbose, hard to parse, and not necessarily stable; after all, it was intended to be read by humans (who want a bunch of information and can sort it out themselves) rather than by programs.
A testable mailer would have interfaces that are explicitly designed to be used for automated tests. These interfaces would have stable and minimal output that directly answered your questions and let you control all of the context (in a sophisticated mailer, any number of decisions can depend on things like the source IP). This sort of testability would go well with a true programmable mailer, should we ever get one.