2023-12-10
Some notes on using the logcli
program to query Grafana Loki
One of the pieces of Grafana Loki,
sometimes misleadingly described as 'Prometheus for logs', is logcli
, an all purpose
command line program for querying Loki in various ways. Some of
what it can do is mostly of interest to Loki administrators, but
it has two major sub-commands for making LogQL queries for either
logs
or metrics.
I recently wrote a script that dealt with logcli and in the process I learned
some things I want to write down for future use, although by the
time I use them Loki may have changed some of them.
Logcli has two sub-commands for log queries, 'logcli query
' for
queries over time and 'logcli instant-query
' for instant queries.
Although it is technically possible to make metrics queries
witn 'logcli query
', you will normally use 'logcli instant-query
'
for this. Despite what logcli's help will tell you, instant queries
will only output some form of JSON; you can't get their results in
tabular form for text presentation, and you'll need to use, for
example, jq's options for text output.
Instant queries are made at some instant in time (the '--now'
argument) and the metrics query itself will normally use some
'*_over_time
' operator with a duration. If you start out with
a start time and an end time in a script, deriving the duration may
involve GNU Date crimes.
To get log lines themselves, you start with 'logcli query
'. For
time ranges, you can give it either --from and --to together or
'--since DUR', which is implicitly relative to 'now'. The largest
time duration modifier LogQL and logcli accept is hours; if you
want to query over days or weeks, you get to covert that into hours
yourself. Because a LogQL query searches for a single thing in
a single set of logs, if you want to get multiple sorts of logs,
for example both SSH logins and IMAP logins, you'll need to run
'logcli' twice, each time with a separate query. If you want to put
things in time order regardless of what query the log line came
from, 'sort -V
' is your brute force tool (in combination with
some option to force the log lines to be presented as a single line
with the timestamp first).
(Also, 'logcli query' defaults to printing log lines in reverse time order, so you probably want 'logcli query --forward ...', unless you're already using 'sort -V' and don't care.)
By default, 'logcli query' (silently) limits the output to 30 log entries. If you use '--limit 0', logcli issues multiple requests to Loki, each one asking for '--batch' log entries (1000 by default) and working out the time range that needs to be covered by the query. You can see this if you look carefully at the queries that logcli reports, and it's sort of covered in the logcli documentation for batched queries. However, even with '--limit 0' logcli (and perhaps Loki) will have problems reporting all of the log lines over a long enough time interval. To get around this you seem to need to use 'logcli query' parallelization, which is currently documented only in 'logcli help query' and then only vaguely (this is a Loki tradition). The simplest way to use query parallelization is to use 'logcli query --limit 0 --parallel-max-workers N' where N is some reasonable number like the number of CPUs you have. Apparently this can make the logs be out of order, which is another reason to put them back in the right order with 'sort -V'.
(In my traditional Loki experience, I don't really understand what's going on here and I couldn't find the answers when I poked at the documentation.)
In theory a LogQL metrics query ought to be more efficient and more reliable than dumping out the necessary information from the log lines and then generating the metric yourself. In practice, my metric queries started failing once the duration got long enough, so I abandoned doing them in favour of printing the necessary information from each log line and feeding it through, for example, 'sort | uniq -c | sort -nr'. This also got me out of the business of reformatting 'logcli instant-query' JSON into something textual.
(Because I was getting my metrics with 'sum( count_over_time(...)
) by (...)
', it's possible that the inner count_over_time()
had a high label cardinality (although most of them were ignored
by the sum()) and that's what blew up Loki. I don't know, all I
know is that now that I'm working out the metrics myself outside
of Loki, it works.)