In search of modest scale structured syslog analysis

September 30, 2016

Every general issue should start from a motivating usage case, so here's ours: we want to be able to find users who haven't logged with SSH or used IMAP in the past N months (this perhaps should include Samba authentication as well). As a university department that deals with graduate students, postdocs, visiting researchers, and various other sorts of ongoing or sporadic collaborations, we have a user population that's essentially impossible to keep track of centrally (and sometimes at all). So we want to be able to tell people things like 'this account that you sponsor doesn't seem to have been used for a year'.

As far as I can tell from Internet searches and so on, there are an assorted bunch of log aggregation, analysis, and querying tools. Logstash is the big one that many people have heard of, but then there's Graylog and fluentd and no doubt others. In theory any of these ought to be the solution to our issue. In practice, there seem to be two main drawbacks:

  • They all seem to be designed for large to very large environments. We have what I tend to call a midsized environment; what's relevant here is that we only have on the order of 20 to 30 servers. Systems designed for large environments seem to be both complicated and heavyweight, requiring things like JVMs and multiple servers and so on.

  • None of them appear to come with or have a comprehensive set of parsers to turn syslog messages from various common programs into the sort of structured information that these systems seem designed to work with. You can write your own parsers (usually with regular expressions), but doing that well requires a relatively deep knowledge of just what messages the programs can produce.

(In general all of these systems feel as if they're primarily focused on application level logging of structured information, where you have your website or backend processing system or whatever emit structured messages into the logging backend. Or perhaps I don't understand how you're supposed to use these systems.)

We can undoubtedly make these systems solve our problem. We can set up the required collection of servers and services and get them talking to each other (and our central syslog server), and we can write a bunch of grok patterns to crack apart sshd and Dovecot and Samba messages. But all of this feels as if we are using the back of a large and very sharp axe to hammer in a small nail. It works, awkwardly, but it's probably not the right way.

It certainly feels as if structured capturing and analysis of syslog messages from common programs like sshd, Dovecot, and so on in a moderate sized environment ought to be a well solved problem. We can't be the first people to want to do this, so this particular wheel must have been reinvented repeatedly by now. But I can't find even a collection of syslog parsing patterns for common Unix daemons, much less a full system for this.

(If people know of systems or resources for doing this, we would of course be quite interested. There are some SaaS services that do log analysis for you, but as a university department we're not in a position to pay for this (staff time is free, as always).)

Comments on this page:

Sounds like something that can be done with pam_lastlog?

By dozzie at 2016-09-30 09:54:59:

First of all, Fluentd and logstash are message routers, where message is JSON-compatibile data. These routers don't collect (normally shouldn't collect) logs at all. Their purpose is to pass whatever message you submit to wherever it should go (Fluentd uses categories, which are external to the message itself).

This means that (a) Fluentd and logstash can be used for whatever messages you generate (logs, monitoring, hardware/software inventory, others) and (b) for logs alone you can be perfectly happy with syslog's own mechanisms.

Now there's the issue of parsing unstructured logs to JSON-compatible data structure. If you go with Fluentd/logstash, you probably need to parse the logs before submitting them to FL/L, and for this I have written a daemon, logdevourer (it uses liblognorm instead of regexps, which is significantly better in the longer use). If you go with syslog forwarding, you can set up log parsing in a central place (if at all; about that a little later).

Let's assume you have logs parsed already. This time you need to store them somewhere. People usually use ElasticSearch for this (with Kibana as a web interface), since it's a search engine and stores JSON documents. This is also more or less what Graylog2 does (or at least used to do three years ago, when I last checked). But since the logs are JSON documents, they can be stored in any document store, like MongoDB, CouchDB, or PostgreSQL (json/jsonb columns), or even in flat files.

If you went with plain syslog forwarding, it can still be put into ElasticSearch with some basic parsing (e.g. date, host, program, message), and rsyslog even has a plugin for sending logs to ES. Unfortunately I'm not sure how much gain there will be (single ES instance used to be slower than grepping through flat files, but it provides a query language, so I considered that as a progress).

I'm not sure if I would deploy ElasticSearch for dozen servers (probably yes, because query language and Kibana as a somewhat sensible web interface). For Fluentd/logstash, I probably would use them, especially that I could use them for monitoring system as well. But deploying a central syslog server is certainly a low-hanging fruit.

I hope my explaination will be of some use to you.

By cks at 2016-09-30 12:08:19:

Unfortunately lastlog isn't suitable for us for a number of reasons, including that it's not centralized and that it doesn't capture all 'login' events (eg some ssh connections won't create lastlog entries).

dozzie: thank you, that's all useful information and logdevourer looks interesting. It looks like we'd still need to write patterns for all of the syslog stuff we're interested in, though, unless there's a collection of them somewhere.

(Our approach for ingestion will probably be to have all of the machines forward their syslog logs to the ingestion point (in addition to our central log server), since this seems relatively easy. We could also process the central syslog logs once a day when they get rolled over, but direct transmission seems more fire-and-forget.)

By edgar n. at 2016-09-30 12:13:41:

Alternatively, perhaps use some mechanism (perhaps like that pam lastlog thing posted in the first comment?) to just track this on each system. Could use that lastlog mechanism or use a log monitor like SEC to make a new log file of people's logins on each system. I know SEC can be setup to only log for a given thing once in a specified time period.

Hmm, perhaps use SEC to feed logstash the exact info you want so don't have to learn too much about logstash formats?

By erlogan at 2016-09-30 14:32:01:

The Elasticsearch/Logstash/Kibana ("ELK") suite is popular in industry for this purpose. All the various centralized pieces can be set to run on the same machine, and I have done so on relatively modest hardware for clients in the past. It's quite good at either running predefined filters or handcrafted ones against source data to extract salient fields.

It does require a JVM, but I find this less onerous than I used to.

By liam at unc edu at 2016-10-05 12:40:24:

Splunk is an option for you if you log less than 500MB a day.

The free-as-in-beer tier is probably too low volume for you - but it's a good tool if you have small enough needs (it's a good tool if you have large needs and a budget too :-))

Written on 30 September 2016.
« Making systemd-networkd really skip trying IPv6 on your networks
Some git repository manipulations that I don't know how to do well »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 30 01:30:15 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.