What I think goes wrong periodically with our Grafana Loki on restarts

January 29, 2024

We have now had two instances where restarting Grafana Loki caused it to stop working. Specifically, shortly after restart, Loki began logging a flood of mysterious error messages of the form:

level=warn ts=2024-01-29T19:01:30.[…]Z caller=logging.go:123 [...] msg="POST /loki/api/v1/push (500) 148.309µs Response: \"empty ring\\n\" [...] User-Agent: promtail/2.9.4; [...]"

This is obviously coming from promtail trying to push logs into Loki, but I got 'empty ring' errors from trying to query the logs too. With a flood of error messages and these messages not stopping, both times I resorted to stopping Loki and deleting and restarting its log database (which we've also had to do for other reasons).

As far as I can tell from Internet searches, what Loki's 'empty ring' error message actually means here is that some component of Loki has not (yet) started properly. Although we operate it in an all in one configuration that I can't recommend, Loki is 'supposed' to be operated as a cooperative fleet of a whole horde of individual microservices, or at least as three separate services ("three-target" mode). To operate in these modes with possibly multiple instances of each (micro)service, Loki uses hash rings to locate which instance of a particular component should be used. When Loki reports an 'empty ring' error, what it means is that there's nothing registered in the hash ring it attempted to use to find an instance. Which hash ring? Loki doesn't tell you; you're presumably expected to deduce it from context. Although we're operating Loki in an all-in-one configuration, Loki apparently still internally has hash rings (most of them probably with exactly one thing registered) and those hash rings can be empty if there are issues.

(As best I can tell from Loki metrics, our current configuration has hash rings for the ingester, the (index) compactor, and the (query?) scheduler.)

Since this error comes from promtail log pushing, the most likely component to have not registered itself is the ingester, which receives and processes incoming log lines, eventually writing them to your chunk storage, which in our case is the filesystem. The ingester doesn't immediately write each new log line to storage; instead it aggregates them into those chunks in memory and then writes chunks out periodically (when they are big enough or old enough). To avoid losing chunks that aren't yet full if Loki is stopped for some reason, the ingester uses a write ahead log (WAL). As the Loki documentation says, when the ingester restarts (which in our case means when Loki restarts), it must replay the WAL into memory before 'registering itself as ready for subsequent writes' to quote directly from the documentation. I have to assume that what Loki really does is that the ingester replays the WAL before adding itself to the ingester hash ring. So while WAL replay is happening there is probably no ingesters registered in your ingester hash ring and attempts to push logs will fail (well, be rejected) with 'empty ring'.

Due to how Loki actually creates and stores chunks and doesn't compact them we try to have as few chunks as possible, which means that we have a very long chunk lifetime and maximum chunk size. This naturally leads to having a lot of chunks and log data sitting in the memory of our Loki process, and (probably) a big WAL, although how big will depend partly on timing. The ingester's WAL can be configured to be flushed on regular shutdown (the flush_on_shutdown wal option) but we have historically turned this off so that Loki restarts don't flush out a bunch of small chunks (plus flushing a big WAL will take time). So after our Loki has been running for long enough, when it shuts down it will have a large WAL to replay on startup.

So what I believe happened is that our configuration wound up with a very big ingester WAL, and when Loki started, the ingester just sat there replaying the WAL (which is actually visible in the Loki metrics like loki_ingester_wal_replay_active and loki_ingester_wal_recovered_bytes_total). Since the ingester was not 'ready', it did not register in the ingester hash ring, and log pushing was rejected with 'empty ring'. Probably if I had left Loki alone long enough (while it spewed messages into the log), it would have finished WAL replaying and all would have been fine. There's some indication in historical logs that this has actually happened in the past when we did things like reboot the Loki host machine for kernel updates, although to a lesser extent than this time. Deleting and restarting the database fixes the problem for the obvious reason that with no database there's no WAL.

(This didn't happen on my Loki test machine because my test machine has far fewer things logging to it, only a couple of other test machines. And this also explains how the first time around, reverting to our previous Loki version didn't help. We'd have seen the same problem if we'd restarted Loki without an upgrade, which is accidentally what happened this time.)

Probably the most important fix to this is to enable flushing the WAL to chunk storage on shutdown (along with vastly lengthening systemd's shutdown timeout for Loki, since this flushing may take a while). In practice we restart Loki very infrequently, so this won't add too many chunks (although it will make me more reluctant to restart Loki), and when it works it will avoid having to replay the WAL on startup. A related change is to raise the ingester wal parameter replay_memory_ceiling, because otherwise we'll wind up flushing a bunch of chunks on startup if we start with a big WAL (for example, if the machine lost power or crashed). And the broad fix is to not take 'empty ring' failures seriously unless they last for quite a long time. How long is a long time? I don't know, but probably at least ten minutes after startup.

(I believe that promtail will keep retrying after receiving this reply from Loki, and we have relatively long retry times configured for promtail before it starts discarding logs. So if this issue clears after ten or twenty minutes, the only large scale harm is a massive log spam.)

PS: Based on past experience, I won't know if I'm right for a fairly long time, probably at least close to a year.

Written on 29 January 2024.
« Servers are (probably) starting to drop serial ports
Putting a Python executable in venvs is probably a necessary thing »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 29 21:13:47 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.