I can't recommend serious use of an all-in-one local Grafana Loki setup

April 27, 2023

Grafana Loki is often (self-)described as 'Prometheus for logs'. Like Prometheus, it theoretically has a simple all in one local installation mode of operation (which is a type of monolithic deployment mode), where you install the Loki server binary, point it at some local disk space, and run Promtail to feed your system logs (ie, the systemd journal) into Loki. This is what we do, to supplement our central syslog server. Although you might wonder why you'd have two different centralized log collection systems, I've found that there are things I like using Grafana Loki for.

However, I can no longer recommend running such an all-in-one Grafana Loki setup for anything serious, including what you might call 'production', and I think you should be wary about attempting to run Grafana Loki yourself in any configuration.

The large scale reason I say this is that most available evidence is that Grafana Inc, the developers of Loki, are simply not very interested in supporting this usage case or possibly any usage case involving other people running Loki. Unlike Prometheus, where the local usage case is considered real and how many people operate Prometheus (us included), the Loki 'local usage' comes across as a teaser to convince you of Loki's general virtues, and ingesting systemd logs through Promtail merely the most convenient way to get a bunch of logs (you can even get them in JSON format, although you probably shouldn't in real usage).

If you do try to operate Grafana Loki in this all-in-one configuration (and perhaps in other ones), you'll likely run into an ongoing series of issues. In general I've found the Loki documentation to be frustratingly brief in important areas such as what all of the configuration file settings mean and what the implications of setting them to various values are. The documentation's example configuration for promtail reading systemd logs is actively dangerous due to cardinality issues in systemd labels, and while Loki is called 'Prometheus for logs' it differs from Prometheus in critical ways that can force you to destroy your accumulated log data. The documentation will not tell you about this.

Even if you do everything as right as you can, things may well still go wrong. Grafana Inc shipped a Linux x86 promtail 2.8.0 binary that didn't read the systemd journal, which is only one of the (nominal) headline features of promtail on its dominant platform. An attempt to upgrade our Loki 2.7.4 to 2.8.1 failed badly and could not be reverted, forcing us to delete our entire accumulated log data for the second time in a few months (after the first time). Worse, I feel that diagnosing and thus fixing this issue would have been all but impossible within a reasonable time because Loki simply didn't log enough useful information for a system administrator. When the only reported 'error', to the extent that there is one, is 'empty ring', there is both a specific problem (what 'ring' out of several and how do you make it non-empty given that you're running in a monolith and don't have rings as such) and a deep-seated problem.

The deep seated problem is that Loki doesn't feel like it's been built to be operable by people who don't know its code and its internal details. If you are a Loki specialist who understands everything there is about Loki, perhaps you can diagnose '"empty ring" as the response to everything'. But if you're running Loki in the all-in-one filesystem setup as a busy system administrator, you probably aren't such a specialist and never will be. Loki doesn't feel like it's built to be run in production by you and me, not safely and reliably, and I don't expect Grafana Inc to ever change that.

We will probably keep running Grafana Loki, because it's already there, I derive some value from it, it's been integrated a bit into our Grafana dashboards, and since we already have our central syslog server I can live with periodically throwing away the accumulated log data and starting over from scratch, although I don't like it. But if I ever leave I'll advise my co-workers to rip out all of the Loki and Promtail infrastructure, which is also my plan if dealing with it becomes too time-consuming and irritating. If I'd known what I now know back when I started to set up Loki, I'm not sure I'd have bothered.

(This elaborates on some Fediverse posts.)

PS: Loki also has some container-ized multi-component run-it-yourself example setups. I don't have any experience with them so I have no idea if they're better supported and more reliable in practice than the all-in-one version (which isn't particularly, as we've seen). A container based setup ingesting custom application logs with low label cardinality and storing the actual logs in the cloud instead of the filesystem may be a much better place to be for using Loki in practice than 'all in one systemd journal ingestion to the filesystem'. Certainly it's closer to how Grafana Inc probably runs Loki in their 'Grafana Cloud' service, and the VC-funded Grafana Inc certainly wants you to use Grafana Cloud instead of wrestling with Loki configuration and operation.

Written on 27 April 2023.
« Putting the 'User-Agent' in your web crawler's User-Agent
More notes on Linux's /proc/locks and NFS as of Ubuntu 22.04 »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 27 23:20:09 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.