Grafana Loki and what can go wrong with label cardinality

July 18, 2022

Grafana Loki (documentation) is described as 'Prometheus for logs', or to quote its website, it's 'a log aggregation system designed to store and query logs from all your applications and infrastructure'. Similar to Prometheus, it has the idea of data points having both labels and a value (and a timestamp); where in Prometheus the value was always a number, in Loki the 'value' is the log message. On a modern Linux system, the obvious easy way to get started with Loki is to use the Promtail agent to ship the systemd journal into your Loki server. How to do this is covered in the "journal" section of the promtail configuration, and there's a convenient example Journal configuration.

One of the nice things about the systemd journal is that messages logged in the journal come with a lot of metadata, as covered in systemd.journal-fields. The promtail journal collector allows you to turn some (or all) of these systemd metadata fields into labels. If you aren't shipping the raw JSON from journald to Loki as the 'log message', turning metadata into labels is the only good way to preserve it for later examination and use. Unfortunately there is a problem here, because Loki is more like Prometheus than you'd like.

The Loki documentation on labels will tell you that each combination of labels defines a stream. All log messages are associated with a stream, which serves to aggregate them together; like Prometheus metrics, this aggregation is defined by the labels and their label values. The documentation tells you:

High cardinality causes Loki to build a huge index (read: $$$$) and to flush thousands of tiny chunks to the object store (read: slow). Loki currently performs very poorly in this configuration and will be the least cost-effective and least fun to run and use.

This is all well and good, but suppose that you plan to operate a small scale Loki setup, where you'll be feeding it the logs for a modest number of systems and you don't really care if it's not that efficient. You might decide that a certain amount of cardinality explosion is okay because you really want to capture various attractive bits of systemd journal metadata, such as the process ID or session ID (so that you can at least search for all log messages from a particular process or session).

So you start up your test Loki server and you feed it some systemd journal data from various systems through promtail, and you hook it up to Grafana (and query it in Grafana's 'Explore' stuff), and everything looks fine. Since this is a basic setup, you're using local filesystem storage. One day (hopefully very early), you happen to look in your /data/loki/db/chunks directory (or wherever you're storing it) and you notice that you have tens of thousands of files and almost all of them are very small, around 512 bytes or less. This is not good on basically any filesystem; very few of them handle tens or hundreds of thousands of small files very well, and some handle them very badly.

What has happened to you is most lucidly explained by the write path section of the architecture, specifically the picture. What is going on is that each separate log stream is being stored in a separate file; each separate chunk is also a separate file. When you have a high cardinality label, each separate label value for it creates a new chunk file. It's very likely that this chunk file will only have a few log messages in it (maybe even one). As a result, you basically wind up with a 'one file per log message', which doesn't work very well.

(You'll get some long-running label sets that give you big chunk files, but there will be a long tail of very small chunk files.)

Ironically, the example promtail 'journal' configuration actually suffers from this cardinality explosion to some degree, because it adds a label for the systemd unit. The systemd unit for log messages can be a session scope, which has a name like 'session-341255.scope'. That number counts up, which gives you lots of cardinality. To fix this you're going to need to relabel all scope units to a single name:

    - source_labels: ['unit']
      regex: 'session-\d+.scope'
      replacement: 'session-NNN.scope'
      target_label: unit

It's an unfortunate limitation of Loki that it doesn't have any way other than labels to attach metadata to messages. I would be perfectly happy to have much of the systemd metadata merely preserved so that I could filter on it (as opposed to the efficient label searches).

You can preserve all of the systemd metadata in a way that doesn't cause cardinality explosions, but you have to ship journal entries to Loki as JSON blobs instead of (only) the message. Sending JSON to Loki means that by default what you see when you search or process your logs is large JSON blobs, with the message hiding in them; to see (only) the messages and not get drowned by the extra metadata, you have to post-process the logs every time you look at them. Your Loki-stored journal log 'messages' will also wind up different than most other log messages that you ship to Loki (for example, log messages from actual log files that aren't in the journal and you need to read separately).

It's possible to somewhat patch up this difference and reformat the systemd JSON. Since I worked this out, I'm going to document the LogQL necessary:

{...} | json | line_format "{{ .MESSAGE | default __line__}}"

This will handle both systemd JSON logs and non-JSON logs from elsewhere at the same time (although Grafana Explore may plaintively tell you that there were errors processing some lines, which is true for lines that weren't in JSON). If you don't need to handle both at once, the simpler version is:

{...} | json | line_format "{{ .MESSAGE }}"

Journald logs always have a MESSAGE (JSON) field and (one hopes) their JSON always parses in Loki.

Although I don't like it, I suspect that we're going to wind up sending systemd journal entries to Loki in JSON format, in order to preserve all of the metadata. We already have a central syslog server for when we just want to read the text of log messages.

(This elaborates on a Fediverse toot and a tweet.)

Comments on this page:

By Walex at 2022-07-19 18:51:48:

The issue is not really high cardinality labels, it is a far more general one, that lazy and silly programmers love to use filesystems as if they were small-record databases (and POSIX filesystems are particularly bad at that). Here a comparison of creating many records as individual files r as records in a simple database:

And similar test sometimes later:

More comments:

Especially but not only web developers, because in many cases they careers depend so much more on delivering cool-looking demos to management and investors than on delivering scalable and reliable applications.

By cks at 2022-07-19 19:07:03:

I don't think this is lazy use of the filesystem, because individual records are not being stored in separate files. Instead, collections of records are being stored in separate files, one file per collection. If you use Loki labels the way you're supposed to, these collections will be large because plenty of log lines will be aggregated together into each collection. It is the combined set of choices of 'one collection per file' and '(often) one or two log lines per collection' that create the issue (and you're not supposed to do the second). If you don't have a label cardinality issue, you wind up with increasingly large files for each collection as lots of log lines are aggregated together.

I can't particularly blame Loki for optimizing their filesystem storage for how you're supposed to use their system, instead of building a more complicated filesystem storage backend.

Written on 18 July 2022.
« An assortment of timestamp formats found in our (Unix) logs
We won't be sending systemd logs to Grafana Loki in JSON format »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jul 18 22:33:42 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.