2023-12-09
I recently used Grafana Loki for fast, flexible log searching
One of the ways our environment is different from usual ones is that we have a bunch of different systems and services that lots of people log in to. We have a long standing central syslog server that collects syslog logs from all of our Linux servers, and one of the things we've long used it for is to search for all of the recent logins across our environment for a particular person. We don't do this all that often but we do it often enough that we have a script for it (which basically boils down to grep with the right patterns).
We also have a Grafana Loki server. For all that I'm not entirely happy with Loki and can't recommend it at our small scale, I do like using Loki for some things. One of the things that Loki is especially good at is narrow log searches, where you want to look at some specific logs in some specific time period. Recently, I decided to take our central syslog 'find all logins for a person' script and re-do a version of it that used Loki and was hopefully both easier to restrict to a narrow time range and perhaps faster.
(When the dust settled, supporting narrow time ranges required GNU Date crimes.)
On the one hand, this wasn't as straightforward as I was hoping it would be, mostly because of peculiar limitations of how logcli behaves (it's the Loki command line tool for making log queries, so any script like this is going to reach for it as the first option). And LogQL limitations forced me to make multiple queries to Loki instead of rolling everything into a single one, which made me do more work in the script to present log lines in time order across the different services.
On the other hand, the result works, and because I was working with LogQL it was straightforward to reformat some of the information into more useful forms by default (for example, defaulting to summarizing sources of IMAP logins rather than report each login). This reformatting was made easier by LogQL limitations forcing me into those separate queries; since I was only getting one sort of information from each query, it was easy to have LogQL's straightforward pattern matching pull out just the information I was looking for (usually the remote IP address) and report it.
(Recasting the syslog script (which was at its heart a giant 'grep' with a big set of patterns) into a script that made separate queries for each sort of information also made it easy to be selective about what information it was reporting. If we only want SSH logins, well, now that's easy.)
I haven't timed the Loki based script against our original version, but in practice it's basically guaranteed to be faster for many cases simply because it's easier to use a shorter time range in the new script, or only look for certain sorts of logins instead of all of them. Our syslog script uses a large time range by default, which was right for some uses but not for many others, and it was sufficiently painful and obscure to change that we mostly didn't. The Loki script accepts easy to use time arguments and defaults to a much smaller (and more accurate) time range.
(In theory the Loki based script should be faster because even if Loki's decompression and searching isn't as fast as gzip and grep, it's searching a lot less logs since I'm being narrowly selective in log labels. But I haven't tried to specifically time it, and it also does somewhat more than the syslog script because it has access to some non-syslog log data. In practice the Loki based script runs fast enough to be convenient.)
Overall I'm quite glad I got around to writing the Loki version. I expect to use it periodically and be glad that I have it, and I learned a certain amount about logcli that will be useful for the next time.
(Out of curiosity I just did a timing comparison, and for basically the same time duration the syslog version took three minutes and the Loki version two minutes. Shorter duration queries in Loki can be much faster, although there may be caching effects at work. Still, caching effects are useful if we're asking about several different logins, as we sometimes are.)