A gotcha with analyzing syslog logs
Consider the following command:
; grep 'gabba gabba hey' /var/log/messages | wc -l
This looks like a perfectly sane and sensible way of doing syslog log
analysis: fish out a pattern, then count or otherwise crunch it up. It's
just the sort of thing Unix tools are built for, and I do it all the
howmany program often gets used here.)
Unfortunately, as I was reminded recently, this appealingly simple approach is not quite correct. It can undercount things, because syslogd likes to save space and time by condensing repeated messages into things like:
Apr 16 03:49:34 gpu last message repeated 26 times
That's a convenient space savings when people are pounding on your mail server, but it does make producing accurate counts in log summaries a bit more complicated than it looks. In fact it was this very line that poked me about the whole issue, after I shuffled our SMTP filtering so that the CBL would be checked first. Once I saw it skimming the raw logs, it became clear to me that I wasn't going to get usefully accurate stats unless I did something.
My approach is crude and brute force: I now have a filter that gets fed logs and de-summarizes syslogd's summarization, expanding 'last message repeated N times' into N more copies of the previous line. The expanded logs can then be fed to other commands, which get to stay straightforward while now being correct. This is not quite accurate (the timestamps will be wrong, but then you can't recover the timestamps anyways) but is close enough for counting purposes.
(If you're just looking for whether or not certain messages show up at all you don't need to worry about this; you are guaranteed at least one copy of each message, even if syslogd then swallows a lot more.)
The Perl script I use for this is just large enough to not fit in this entry. So instead you can find it as expsyslog.pl; see the comments for usage. Typically it's something like:
; expsyslog.pl /var/log/messages | grep 'gabba gabba hey' | wc -l
(Yes, I write Perl scripts every so often. For some jobs you just can't beat its conciseness and convenient command of Unix idioms.)