Grafana Loki doesn't compact log chunks and what this means for you

February 21, 2023

Suppose that you have Grafana Loki set up to collect system logs from a midsized number of Linux servers, say a hundred of them. Logs in Loki need labels so you can narrow searches down and understand the context of the messages; at least the host name of the server, and then it's sensible to want things like the systemd unit, the syslog priority, the syslog facility (since not all messages come from a systemd unit), and likely the syslog identifier (especially since you probably don't want to log in JSON format). Once you've carefully avoided certain traps with systemd unit names, this will probably give you several thousand unique combinations of labels (what Loki calls streams). While this is maybe a bit large, those label combinations will be reused over time and it may not sound like too much from the perspective of Prometheus (where 10,000 metrics from a single server is not unheard of). Unfortunately, with the current Loki defaults you've just aimed a gun at your foot, one that you may not discover for some time, because of the combination of two issues: many of these log streams will only get messages infrequently, and Loki doesn't compact log chunks once they've been written out.

Many of these streams of log messages are low volume and infrequent because some parts of their labels are uncommon. Some systemd units only emit a few messages a day; systems may not log kernel messages for weeks; 'warning' level syslog messages (especially for specific units) are uncommon, and so on. Certainly you'll have some frequent, high volume log streams, but you'll also have many that only emit at most a few messages a day. These low volume streams are where your problems are in a default Loki configuration.

Loki stores log data in chunks. As we've seen when things go wrong with Loki label cardinality, each chunk stores only a single stream's data, and in Loki's filesystem 'object' storage, each chunk is a separate file. And chunks are immutable once written out; they never get aggregated together the way Prometheus will eventually compact and aggregate blocks of metrics data. Although Loki has a component called the 'compactor', the compactor only acts on the index to chunks, not chunks themselves. So what you see once a chunk is written out (to the filesystem or to your cloud object store) is what you'll always have until the chunk is deleted for some reason (such as exceeding a retention limit). If you write out tiny chunks, that's what you're stuck with.

Chunks are written out by the Loki ingester component, and in the default ingester configuration, a chunk will be written out for a stream after it's reached about 1.5 MBytes of compressed size, it's two hours old, or after it's been idle (has received no further log data) for 30 minutes. For low frequency log streams, that 30 minute idle timeout means that essentially every time you generate a (small) burst of log messages (or even a single log line) in the stream, you'll write out a new, tiny chunk, because you won't get any more messages for that stream in 30 minutes. For moderate frequency log streams (ones that generate at least one log line every 30 minutes so they don't go idle), the two hour maximum duration means that you'll get a sequence of small log chunks two hours apart. Since chunks don't get touched after being written, those small chunk files will stay as they are even when you later receive more log messages in the stream. Those new messages don't get appended to the end of the existing chunk and when they get written out in another chunk, it won't eventually get compacted together with the first one.

(This behavior is very different from Prometheus, where all of your metrics are stored together and on top of that, they get compacted regularly.)

The net effect is that even if you have only a few thousand sets of unique labels (ie, log streams), and a lot of those log streams generate log messages slowly, you can easily wind up creating tens of thousands of (small) chunks a day. If you use Loki's filesystem store these will all go into one directory, and in six months you may accumulate ten million files in this directory, which is a problem because directories can have a maximum number of entries.

(Even if you use a cloud object store, you may not be very happy.)

To avoid this, you need to drastically increase a number of ingester configuration parameters, even though this will raise Loki's memory requirements. You'll want to raise max_chunk_age, probably into the range of days (we set it to a week), ideally set chunk_idle_period to max_chunk_age so that streams never time out, and likely increase chunk_target_size as well to write fewer chunks even for active streams. If you've got an operating Loki environment, you can estimate your requirements from the number of streams you have over a day or three. If 'logcli series --analyze-labels --since=168h "{}"' says that you have 7,500 streams over the past week, then a maximum chunk age of a week means you'll write at least 7,500 chunk files every week, which is still on the order of almost 400,000 files a year or more.

(Your low volume logs will be written out at the maximum chunk age, but your higher volume logs will be written out more frequently when they hit your ingester chunk target size.)

The current Loki documentation on Loki's filesystem storage does mention this, but in my opinion it doesn't mention it prominently enough. This isn't a scaling issue, this is a simple operational issue (and on top of that it applies beyond the filesystem storage backend, per issue 5605). If Loki considers the filesystem store not good for any real use, it should say so explicitly and say that Loki is not like Prometheus this way (since one Loki tag line is 'Prometheus for logs', which leads people like me to have certain expectations).

(This is the background to some Fediverse posts.)

Written on 21 February 2023.
« A bit on unspecified unique objects in Python
How to block people's automatic mail forwarding (to GMail, at least) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 21 22:24:14 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.