Wandering Thoughts archives

2023-12-10

Some notes on using the logcli program to query Grafana Loki

One of the pieces of Grafana Loki, sometimes misleadingly described as 'Prometheus for logs', is logcli, an all purpose command line program for querying Loki in various ways. Some of what it can do is mostly of interest to Loki administrators, but it has two major sub-commands for making LogQL queries for either logs or metrics. I recently wrote a script that dealt with logcli and in the process I learned some things I want to write down for future use, although by the time I use them Loki may have changed some of them.

Logcli has two sub-commands for log queries, 'logcli query' for queries over time and 'logcli instant-query' for instant queries. Although it is technically possible to make metrics queries witn 'logcli query', you will normally use 'logcli instant-query' for this. Despite what logcli's help will tell you, instant queries will only output some form of JSON; you can't get their results in tabular form for text presentation, and you'll need to use, for example, jq's options for text output. Instant queries are made at some instant in time (the '--now' argument) and the metrics query itself will normally use some '*_over_time' operator with a duration. If you start out with a start time and an end time in a script, deriving the duration may involve GNU Date crimes.

To get log lines themselves, you start with 'logcli query'. For time ranges, you can give it either --from and --to together or '--since DUR', which is implicitly relative to 'now'. The largest time duration modifier LogQL and logcli accept is hours; if you want to query over days or weeks, you get to covert that into hours yourself. Because a LogQL query searches for a single thing in a single set of logs, if you want to get multiple sorts of logs, for example both SSH logins and IMAP logins, you'll need to run 'logcli' twice, each time with a separate query. If you want to put things in time order regardless of what query the log line came from, 'sort -V' is your brute force tool (in combination with some option to force the log lines to be presented as a single line with the timestamp first).

(Also, 'logcli query' defaults to printing log lines in reverse time order, so you probably want 'logcli query --forward ...', unless you're already using 'sort -V' and don't care.)

By default, 'logcli query' (silently) limits the output to 30 log entries. If you use '--limit 0', logcli issues multiple requests to Loki, each one asking for '--batch' log entries (1000 by default) and working out the time range that needs to be covered by the query. You can see this if you look carefully at the queries that logcli reports, and it's sort of covered in the logcli documentation for batched queries. However, even with '--limit 0' logcli (and perhaps Loki) will have problems reporting all of the log lines over a long enough time interval. To get around this you seem to need to use 'logcli query' parallelization, which is currently documented only in 'logcli help query' and then only vaguely (this is a Loki tradition). The simplest way to use query parallelization is to use 'logcli query --limit 0 --parallel-max-workers N' where N is some reasonable number like the number of CPUs you have. Apparently this can make the logs be out of order, which is another reason to put them back in the right order with 'sort -V'.

(In my traditional Loki experience, I don't really understand what's going on here and I couldn't find the answers when I poked at the documentation.)

In theory a LogQL metrics query ought to be more efficient and more reliable than dumping out the necessary information from the log lines and then generating the metric yourself. In practice, my metric queries started failing once the duration got long enough, so I abandoned doing them in favour of printing the necessary information from each log line and feeding it through, for example, 'sort | uniq -c | sort -nr'. This also got me out of the business of reformatting 'logcli instant-query' JSON into something textual.

(Because I was getting my metrics with 'sum( count_over_time(...) ) by (...)', it's possible that the inner count_over_time() had a high label cardinality (although most of them were ignored by the sum()) and that's what blew up Loki. I don't know, all I know is that now that I'm working out the metrics myself outside of Loki, it works.)

sysadmin/GrafanaLokiLogcliNotes written at 23:18:17;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.