Your Grafana Loki setup needs security and access control
Grafana Loki is a nice and easy to set up way to send your (Linux) systemd logs (and other logs too with more work) to a central log server where you can then conveniently search them and do some other things. In a simple setup, you're going to set up a single 'does everything' Loki server, install Promtail on your client machines to ship their logs to Loki, and then probably install Grafana (perhaps behind a web server frontend) and connect it to Loki so you have a convenient way to query your logs. All of this is very similar to how you might set up Prometheus with Grafana. However, you don't want to stop here, because Loki really needs some security around who can access it, often more so than the metrics systems (like Prometheus) that it resembles.
The direct reason that you want access control for Loki is that Loki provides direct access to your logs, in full and basically raw form. All of your logs, from all of the systems that you're having feed in to Loki, with all of the potentially sensitive information that might be appearing in them. In many situations, you don't want to provide this sort of log access to everyone internally and you would be much more restrictive about who had access to read the logs on, say, a central syslog server. This applies both to direct access to Loki's HTTP API endpoints and to access to Loki through, say, Grafana's 'Explore' ad-hoc query system (which is a convenient way to poke through your Loki logs in a browser, instead of using LogCLI to do it from the command line).
(Even if you (collectively) don't have any concerns about co-workers and other internal users having access to the logs, consider how much of a potential treasure trove they could be to an attacker who gains access to your internal systems. For example, such an attacker could get a great deal of information about what user accounts have (SSH) access to which systems, and how they authenticate, as well as internal processing flows. And by default all of this can be accessed via HTTP, which means that any vulnerability that allows an attacker to make HTTP requests to internal web servers can probably be used to extract logs.)
Grafana Loki perhaps makes this not entirely clear, since there's nothing in the current documentation that explicitly says that leaving a normally set up 'all in one' Loki server exposed to your intranet gives everyone on the intranet the ability to ask it for logs. And unfortunately, adding access control to Loki is not entirely easy because Loki is a 'push' system, where client machines running Promtail must be able to talk to some Loki API endpoints (and if you make them require credentials to do so, it's likely that a sufficiently privileged person or attacker on a client machine can get those credentials).
That Loki is a push based system has another effect, which is that the Loki server isn't really in control of what logs it ingests from clients the way a Prometheus server is in control of what metrics it pulls from who. Again, unless you go out of your way your Loki server will probably accept logs from anyone on your intranet who cares to ship them to you, and it will normally believe a lot of labels in those logs (such as the hostname they're allegedly from). Loki is almost certainly not the log capturing system you want to use in a hostile environment, or even in an environment where other people may copy your system configurations (complete with your setting for the Loki server).
(We wound up with a brute force Apache solution to this, which is made more complicated because of how Loki co-mingles its various API endpoints.)
PS: I don't blame the Grafana Loki developers for not addressing this. Access control is a hard problem and it's probably best solved by a frontend, which is much better placed to take on all of the various potential complexities. I do wish Loki gave you somewhat more control over log ingestion and that the documentation had a bigger warning about this issue.
PPS: System and application metrics can be potentially sensitive too, but my general sense is that they're less dangerous, partly because they're generally less specific and more aggregated. Logs are extremely specific, that's their purpose.
Reaching past our firewalls with WireGuard (some thoughts)
Our network design carries with it the implicit assumption that all of our machines (and all machines run by other people) are within our network perimeter, so we can safely expose dangerous services to them, like an unauthenticated SMTP 'smarthost' and a central syslog server. In the beginning this was completely true, but over time we've acquired a few machines that are outside our network perimeter (and if there is cloud in our future, we'll get more). We'd rather like for these external machines to have some access to our standard services; at a minimum we want them to be able to email us when they have issues liked failed disks, and it would be nice to be able to collect logs and so on. Our current solution to this is to poke holes through our firewall, but recently I've been tempted by the idea of using WireGuard as a more secure approach.
The appeal of WireGuard for this is that it's a lightweight service that requires little configuration or operation, and is now supported across all of our Ubuntu fleet. This creates two obvious options, depending on how much work we want to do on these external machines. The first option is to run WireGuard in a non-routed, "point to point" mode on each of the internal machines that have services we want to provide access to. The internal machine would expose its service(s) on an private WireGuard network as well as its normal IP address (in many cases this requires no service changes), and external machines would reach the service by talking to the internal machine's private WireGuard IP address. The one drawback to this is that it requires configuring each external machine to use the appropriate magic WireGuard IPs for these services, instead of the hostnames we normally use.
The other approach would be to create a NAT'ing WireGuard gateway machine and then configure external machines to route traffic to specific machines that run relevant services (our mail smarthost, our central syslog server, etc) through their WireGuard tunnel to the gateway. Because of the problem of network tunnels and asymmetric routing this mostly requires these internal machines to never want to initiate connections to the external ones (which is true for services today), and it makes the WireGuard NAT gateway machine a central point of failure for all external machines talking to all internal services. But it avoids configuring WireGuard on a bunch of internal machines and changing the service configuration on the external machines (since they can keep on using the regular hostnames for things like our SMTP smarthost).
(Not having to change the configuration of things on external hosts makes life much easier, because we don't have to keep two copies of configurations.)
You can do this (in either approach) with any VPN or secure IP tunneling system. The advantage of WireGuard is that it's very easy and lightweight, so it's simple to configure on the external machines and viable to configure on a bunch of internal machines. It also just works, and in the kernel (on Linux and modern OpenBSD), which means we don't have to worry about it breaking or about necessary daemons dying for one reason or another.
(So far all of this is me thinking about things. We'll probably keep on opening holes in our external firewall until this becomes impossible for some external machine, perhaps because it has to keep changing IP addresses.)
Authenticated SMTP and IMAP authentication attacks and attempts we see here
A while back I wrote about how large scale SSH brute force attacks seem to have stopped here. SSH isn't the only form of authentication that we have exposed to the Internet; we also have both an IMAP server and an authenticated SMTP server, and unsurprisingly they also see activity. To my surprise, the activity patterns are quite different (which took some time to discover, since they both actually authenticate through Dovecot).
Our authenticated SMTP server sees widespread and determined probes from a wide range of IP addresses that appear to be attempting to brute force email addresses here; basically the kind of activity that I expected to see for SSH. However, many of these brute force attacks have no chance of success because they're being directed against either logins that no longer exist or email addresses that were never logins in the first place, and were only aliases or mailing lists. The obvious guess is that attackers targeting authenticated SMTP simply scrape every From: address from your domain that they can find and then set their hordes loose on brute force attacks.
(Over the past 7 days, the most targeted name is a mailing list, for over 18,000 attempts, and the next most targeted is an alias, for almost 4,000 attempts.)
The source IPs of these probes changes over time. Although some sources continue to probe us over a long time scale, it's more often to see a source active for a day or two (usually against more than one login name) and then go away. My guess is that either the attacker loses access to that IP or they lose interest in us and change targets for a while.
By contrast, our IMAP server sees only a very low level of what appear to be brute force attempts. Instead it sees an entirely different pattern, where it appears that people who once had logins here still have some devices that are attempting to log in to their IMAP accounts. The typical pattern is that a single IP or a few closely related IPs will make ongoing attempts to log in to a single previously-valid login name. These attempts can continue from the same IP for weeks. I'd like to say I'm surprised that there are any IMAP clients that would be this determined, but I'm not that optimistic about IMAP client quality. I find it entirely believable that there are clients who won't stop even after months of failure (and people who don't notice that an IMAP account doesn't work or even still exists).
(This elaborates on a tweet of mine about the IMAP situation.)
There's probably nothing we can sensibly do about these IMAP clients, and they're not doing us any harm (apart from cluttering up the logs). If we seemed to have attackers going after IMAP instead of authenticated SMTP I might be more worried about the log clutter, but as it is the extra IMAP stuff seems harmless. The clear attacker action is in authenticated SMTP, which leads to the guess that this is would be spammers looking for a way to send their spam.
Grafana Loki doesn't duplicate a central syslog server (or vice versa)
We've had a central syslog server for a long time, and recently we've set up a Grafana Loki server as well, where we're sending pretty much a duplicate of the logs that go to the syslog server. After using Loki for a while, I've come to the conclusion that the two serve different purposes and neither makes the other unnecessary.
(Grafana Loki is concisely called "Prometheus for logs", or to quote its website it's 'a log aggregation system designed to store and query logs from all your applications and infrastructure'. You can see how this might sound like it duplicates a central syslog server.)
Our central syslog server is a central source of accessible, text based truth. It requires almost no infrastructure to be working on any machine in order to get logs to it and accept logs, and while the logs are unstructured and generally have little metadata, they are in text and so are extremely accessible. Anything can read, search, and process text, and it's straightforward to back up and otherwise deal with. It's also quite compact when compressed. We have logs from most of our systems going back to late 2017, and /var/log on the central syslog server takes up less than 150 GB.
Grafana Loki stores logs in its own format, which is generally less space efficient than a few giant compressed text files, and requires a much more complicated set of programs and systems to be running in order to accept logs. While Loki can store more metadata more readily than syslog text files can, it's fragile if you feed it too much metadata. Loki also has no track record of longevity or of storage durability, and is a project almost entirely developed by a single VC funded corporation (cf).
However, Loki has its own features. It does let you readily capture more metadata than syslog does, it lets you integrate multiple log sources together (since the Promtail agent can read from log files as well as the systemd journal), it integrates well with Grafana dashboards so that information from your logs is more accessible (letting you see things that were always there but previously too tedious to look at), and in practice it's faster to query for small scale things (really long time ranges or big results are probably still best done with your text syslogs). Once you learn enough LogQL, it's also possibly easier to make somewhat complex log queries.
(Having more metadata about even syslog level logs is useful both for narrowing searches down and discovering additional things about log messages, like what systemd unit or executable is associated with them. It actually makes syslog priorities somewhat useful, in contrast to my usual feelings about them. Effectively Loki throws all messages in one place and lets you sort them out later.)
So what I've come to feel is that our central syslog server is there to be our ultimate source of truth, while Grafana Loki is there to make logs accessible. We may not go to the central syslog server very often, but we definitely want it to be there (and we actually do drive some things from its logs). Meanwhile, Loki's accessibility through Grafana makes it more likely to be used.
(In theory we could get the same usefulness from our central syslog server if there was a sidecar system that indexed some basic information from the logs (and saved the file names and offsets of where to read the raw text) and then provided an API to it that Grafana could use. In practice, as far as I know there's no such thing; you can feed your logs to various things to index them, but just like Loki the things want to keep their own copy of the logs in their own format.)
How we monitor the temperature of our machine rooms
As I mentioned recently, we have machine rooms (many of them rather old) and with them a setup that monitors their temperatures, along with the temperatures of some of our important "wiring closets". The functional difference between a machine room and a wiring closet is that wiring closets are smaller and only have switches in them, not servers (and generally they have two-post aka "telco" racks instead of four-post server racks). We are what I'd consider a mid-sized organization, and here that means that we have real machine rooms (with dedicated AC and generally with raised floors) but not the latest and most modern datacenter grade equipment and setups. Including, of course, our temperature monitoring.
The actual temperature reading is done by network-accessible temperature sensor units, like the Control By Web X-DAQ-2R1-4T-E, which are basically boxes that you mount somewhere and plug into the network (and sometimes power). Each of these has some number of temperature probes connected to them by wires (which can be pretty long wires), and then the unit reports the readings of all connected temperature probes via either HTTP or SNMP depending on the unit. I suspect that the actual units are small embedded Linux machines, and thus are guaranteed to be running ancient versions of Linux. Our units have been very reliable so far, which is good because they're all at least fifteen years old and I'm not sure what modern replacements are available.
(Most or all of our units use Power over Ethernet, which was once very convenient because we already had a network with PoE switches, but which now means we have a more or less dedicated set of switches for them.)
These units aren't cheap; based on looking at list prices for current equivalents that I could easily find, you're probably looking at upwards of $300 US for a reasonably equipped setup. Plus the units need their own network connection on an isolated or secure network, because I certainly wouldn't trust their network stack. This limited how many of them we have and where we put them, and means that we don't have any spares. There are probably less expensive options if you want a single temperature sensor somewhere (even without going the DIY route). On the other hand, these have been solid, trouble free performers and we trust their temperature readings to be pretty accurate.
To get temperature readings from the units into our Prometheus system, I wrote some brute force scripts that either scrape their built in HTTP servers or query them by SNMP (for the one unit that really wants us to do that). The script collects the data, generates Prometheus metrics from it, and sends the metrics to Pushgateway, where Prometheus scrapes them. There's no particularly strong reason to use Pushgateway over, say, the host agent's "textfiles" collector; it's just how we started out. Once the temperature readings are in Prometheus, we use them to trigger alerts through some alert rules. We also have alerts if the temperature readings are stale or missing, and we have Prometheus ping all of the sensor units and alert if one of them stops responding.
Getting these temperature readings requires a fair amount of infrastructure to be working; there's the temperature unit itself, its PoE switch, and the network switches between it and the Prometheus server (although mostly not a router or firewall; the Prometheus server is directly on most of the networks the temperature sensors are one). Because we consider machine room temperature monitoring to be relatively critical, we've recently been looking at backup temperature data sources. One of them is that some of our servers have IPMI temperature sensors for the 'ambient' or 'inlet' temperature. We don't currently trust these readings as much as we trust the sensor units, but we can at least trigger alerts based on clearly extreme readings.
(Hence also part of our recent interest in USB temperature sensors.)
PS: There are more DIY approaches to temperature monitoring units, but if you have a genuine machine room I'd strongly suggest paying the money for a dedicated unit. Among other things, I suspect that you'll get more trustworthy temperature readings. And in many settings, the cost of your time to build out and maintain a DIY solution is more than a dedicated unit will cost.
Grafana's problem with the order of dashboard panel legends and Prometheus
Although it's purely to deal with a missing Grafana feature, I do wish that Prometheus had some way to sort the returned query results by the value of some label. (I could hope for Grafana to add some way to deal with this, but good luck with that.)
It's common to build Grafana dashboards with graph panels (including bar graphs) that have multiple things in them. Generally when you do this, you include a legend with labels. The legend labels come from the query or queries you make, as covered in the query options for Prometheus as a datasource (the same is true for Loki as a datasource).
If each separate thing on the panel comes from a separate query, it's easy to put the legend labels in any order that you want, because the legend order comes from the order of the queries, and Grafana lets you freely re-order them. You might have this if you're graphing various memory metrics from a Linux server, for example, because the Prometheus host agent exposes each different thing from /proc/meminfo as a separate metric (and some things aren't exposed directly anyway, such as the amount of swap space used; you get only 'total swap' and 'free swap').
However, often you generate multiple time series from a single query and given them legend labels based on the value of some label from the metric; for example, all disk writes are together under one metric name, node_disk_written_bytes_total, and it has a 'device' label. In this case, the order of legend labels is the order that Prometheus returns all the time series in. This is not necessarily the order that you actually want (although I believe it's stable), and also isn't necessarily the same between different queries about the same things (so, for example, different queries about different disk metrics may return results in different order, resulting in graphs with different order of disks in the legend).
(I don't think Prometheus documents the order that query results are returned in. Experimentally, as of the current 2.38.0, Prometheus seems to order the labels alphabetically by name and then sort alphabetically by the resulting combination of label names and label values.)
Grafana provides no feature to change this order, although it could probably easily offer a 'sort by label name (ascending/descending)' option in the query options, or perhaps offer it as a general transformation. Prometheus currently provides no documented feature to control the order that time series are returned in based on label values, which isn't surprising since it mostly doesn't make any promises about the order in the first place.
Prometheus does provide sort() and sort_desc() to sort the time series by value. Although this can be useful to force an order, there can be three issues with it. The first is that it's generally not going to be consistent from panel to panel in the same dashboard. Second, the order can change when the dashboard refreshes (or when you explicitly look at a different time range). And third, I'm not sure how this interacts with actual graphs, as opposed to 'value over entire time interval' bar graphs, since the order may be different at every different step across the range.
Machine room temperatures and the value of long Prometheus metrics history
We have a few machine rooms. These aren't high-tech, modern server rooms, which is not surprising since they've generally been there for decades. As part of this, our machine rooms don't really have a specific set temperature that they're supposed to stay at. They're not supposed to get too hot, but the actual temperature they're at varies over the year and depends on a lot of things, including what we're running in them at the moment. To make sure that everything is (still) working, we have temperature sensors in the machine rooms that feed into our Prometheus setup.
Recently we were looking at our dashboards and noticed that one of the machine rooms had an oddly high temperature. It wasn't alarmingly high, and we could see it going up and then jumping back down in a familiar pattern that we see in all of our machine rooms as the AC cycles on and off. But it felt like the temperature of that machine room should be lower and maybe something was wrong. Since we have a long metrics history (we keep years worth of Prometheus metrics), we started looking at historical temperature data for this machine room, both in the past of this year and at this time in previous years (to see if this was something that had happened at this time of year before).
Looking at historical data showed a clear difference in the pattern of temperatures between the recent past and before then, especially in the minimum temperatures; starting in late June, things start drifting slowly upward. This is a pattern we've never seen before and it's a pattern we don't see in the temperatures of our other machine room in the same building. We don't know if this is really a problem or if things are still okay and the AC is behaving safely and as expected, but at least we know that there's something clearly exceptional going on.
(And if there is a real problem, we've been given a chance to fix it before the temperature drifts so high it's a real problem and triggers our alarms. Well, we've been given a chance to call in the people who are responsible for the AC so they can fix it. Who is responsible for what in a university building can be complicated and a little tangled.)
However, getting this confidence took quite a deep metrics history, far longer than the 14-day retention that Prometheus defaults to. Right now, going back 90 days is barely enough to show the clear start of the deviation with some time before it, which means we really want to point at more than 90 days of data to show that this wasn't happening before then in smaller form. Being able to go back years (our metrics go back to late 2018) means we can more readily see how unusual this is.
Relatively short metrics retention works if the change you're looking at or into is obvious and big, and you catch it soon enough (and sometimes it's all that you can afford). But not all changes happen that fast; sometimes things just drift quietly over time. This incident shows me once again that it's useful to have a real historical reference so that you can go back to see how things used to be far enough ago that you've forgotten.
Our Prometheus host metrics saved us from some painful experiences
A couple of weeks ago, a few days after a kernel upgrade on our servers, we had an Ubuntu 22.04 server basically die with a constant series of out-of-memory kills from the kernel of both vital demons and random bystander processes. There was no obvious clue as to why, with no program or cgroup consuming an unusual amount of memory. In the constant spew of text from the kernel as it repeatedly OOM killed things, we did notice something:
kernel: [361299.864757] Unreclaimable slab info: kernel: [361299.864757] Name Used Total [...] kernel: [361299.864924] kmalloc-2k 6676584KB 6676596KB [...]
Among our Grafana dashboards is one that provides a relatively detailed look into the state of a particular server, including various bits of memory usage. After we rebooted the failing server I took a look at its dashboard, and immediately noticed that the 'Slab' memory usage was basically a diagonal line going up over time from when it had its kernel update a few days ago and been rebooted.
This caused me to immediately go look at the Slab memory usage for other servers, and all of our 22.04 servers had the same behavior. All of them had a constantly increasing amount of slab memory usage (and in some digging, 'unreclaimable' slab memory usage); it was just that this particular server had a combination of usage and low(er) RAM that caused it to run out of memory sooner than anything else. It was clear we had a systemic issue that would take down every one of our 22.04 servers sooner or later, with a number of them already being alarmingly close to also running out of memory (including our Prometheus metrics server).
At first this looked very much like an issue with the new kernel. But it occurred to us that we'd effectively made another kernel change at the same time. Back at the start of August, after discovering that AppArmor profiles had started activating themselves, we'd set a kernel command line option to turn off AppArmor in the kernel. However, activating that option requires a reboot (to use the new command line), and on most machines we hadn't rebooted them until our kernel update. However, a few 22.04 machines had been rebooted earlier with the command line update in place, and some of those machines were even running older kernels. Inspection of Prometheus host metrics showed that their Slab usage had started going up in the same pattern from the moment they were rebooted, including the machines that had older kernels.
We immediately reverted this kernel command line change on a few machines that we could readily reboot without affecting people (including the Prometheus metrics server), while leaving them using the current kernel. Within a few hours it was fairly clear that disabling AppArmor on the kernel command line was the trigger for this kernel memory leak, and by the next morning it was basically certain. We reverted the kernel command line change everywhere and started scheduling server reboots for all of our 22.04 machines.
(We also filed Ubuntu bug 1987430.)
Without our Prometheus and Grafana setup, this most likely would have been a rather different and more painful experience. We probably would have written off the first server going out of memory as a one time piece of weirdness and only started reacting when a second server had the same thing happen to it a day or two later (and there probably would have been a succession of servers hitting limits at that time). Then it might have taken longer to realize that we had a steady slab leak over time, and we'd probably have blamed the recent kernel update and spent a bunch of time and effort reverting to a previous 22.04 kernel without actually fixing the problem. As it was, our Grafana dashboards surfaced a big indicator of the problem right away and then our historical data let us see that it wasn't actually the recent kernel update at fault.
Most of the time our metrics system just seems nice and useful, not a critical thing (alerts are critical, but those don't necessarily require metrics and metric history). This was not one of those times; it's one of the few times where having metrics, both current and historical, clearly saved our bacon. A part of me feels that this incident justifies our metrics systems all by itself.
An rsyslog(d) syslog forwarding setup for Grafana Loki (via Promtail)
Suppose, not hypothetically, that you have a shiny new Grafana Loki setup to store and query your logs (or at least the logs that come from the systemd journal on your Linux machines, and maybe some additional log files on them). Also suppose that you have some OpenBSD machines whose logs you would like to get into Loki. OpenBSD doesn't have the systemd journal, or for that matter a build of Promtail, the Loki log-shipping client. However, Promtail can receive logs via syslog, and OpenBSD can send syslog logs to remote servers (which we're already using for our central syslog server). Unfortunately Promtail only accepts syslog messages in RFC 5424 format, and OpenBSD doesn't send that. Instead, OpenBSD syslog sends what is usually called RFC 3164 format, which is really "the BSD syslog protocol" written down for informational purposes. In order to send OpenBSD syslog to Promtail, we need a converter in the middle (which is the recommended configuration anyway).
In an ideal world there would be a simple standalone converter that did this for you. As far as I can tell, such a thing doesn't exist; instead, people doing this use a full scale syslog daemon with a minimal configuration file, generally either syslog-ng or rsyslog. Syslog-ng is more popular and you can find suitable configuration examples with Internet searches (eg); however, Ubuntu uses rsyslog by default and you can't install syslog-ng without throwing rsyslog out. Thus, rsyslog it is.
On Ubuntu 22.04 LTS, a minimal 'UDP or TCP syslog to RFC 5424 TCP syslog' forwarding configuration file is the following:
# Listen on UDP and TCP 50514 # forward to TCP 127.0.0.1:40514 module(load="imudp") module(load="imtcp") # Pick your port to taste input(type="imudp" port="50514") input(type="imtcp" port="50514") # Forward everything *.* action(type="omfwd" protocol="tcp" target="127.0.0.1" port="40514" Template="RSYSLOG_SyslogProtocol23Format" TCP_Framing="octet-counted" KeepAlive="on" action.resumeRetryCount="-1" queue.type="linkedlist" queue.size="50000")
Much of the
action(...) portion comes from the Promtail
output configuration for rsyslog.
I've added the queueing options to insure that we don't drop messages
if the local receiving Promtail is down for a bit (because it's being
restarted, or because we're starting during system boot).
- job_name: syslog-receiver syslog: listen_address: 127.0.0.1:40514 # Don't disconnect the forwarder idle_timeout: 12h use_incoming_timestamp: true labels: job: syslog-receiver # Copy syslog bits to standard labels relabel_configs: - source_labels: ['__syslog_message_hostname'] target_label: host - source_labels: ['__syslog_message_severity'] target_label: level - source_labels: ['__syslog_message_facility'] target_label: syslog_facility - source_labels: ['__syslog_message_app_name'] target_label: syslog_identifier
There are some additional RFC 5424 syslog message fields that
Promtail can theoretically record, such as the procid (which is the
traditional syslog PID). In practice, you can't record the
because it will lead to a Loki cardinality explosion and syslog messages from OpenBSD
machines don't seem to have a msgid. OpenBSD
syslog messages also have no RFC 5424 structured data, so setting
label_structured_data' is pointless. You may also want to do
some label remapping on the reported host name, if (for example)
your OpenBSD machines or other syslog sources report their fully
qualified domain names while you normally use short names in your
Since our intermediate rsyslog may hold on to messages in case of problems, we set use_incoming_timestamp so that the syslog log messages carry their original incoming times instead of whenever our Promtail actually received them.
You can run the rsyslog forwarder and the Promtail it forwards to on more than one host if you want to. This may be useful if you have a Linux machine that has an interface on an otherwise isolated network segment with OpenBSD machines, which happens to describe one of our 'sandbox' networks. As far as Promtail goes, you can run a separate Promtail instance with just the syslog scrape configuration, or include the syslog scrape configuration in the host's standard Promtail setup. The latter is what we're going to do, since it involves fewer daemons and configuration files.
Sidebar: Whether or not to label the Promtail host
If you're going to run multiple copies of this setup, you might or
might not want to include a label to identify which host and Promtail
instance logs were submitted through (for example, as a '
label). On the one hand, this lets you identify where any surprise
logs are coming from. On the other hand, this increases Loki label
cardinality if your syslog machines can naturally change which host
they forward their logs through.
In our environment we're probably going to add a '
because anything that's using a non-standard forwarding host will
be doing it because they can't reach the standard one (which is
also the Loki host).
Large scale Internet SSH brute force attacks seem to have stopped here
The last time I paid attention to what happened when you exposed an SSH port on the Internet was years and years ago, when I gave up being annoyed by log messages and either stopped paying attention or firewalled of my SSH ports from the general Internet. Back then, it was received wisdom (and my general experience) that having an SSH port open drew a constant stream of SSH brute force attacks against a revolving cast of whatever logins the attackers could come up with.
Recently I set up a Grafana Loki setup that captures our systemd logs. As part of getting some use out of it (beyond questions about how server clocks drift), I built a Grafana dashboard that reports on SSH authentication failures across our Ubuntu fleet (among other things). What I saw surprised me, because what our exposed SSH servers experience today seems to be nothing like it was in the past.
(One caution is that it may be that most attackers no longer direct their attention against universities at all, and now aim their scans at, say, cloud providers, which could be much richer territory for insecure SSH servers.)
For the most part, SSH brute force attacks against us are gone. When they appear in
some time period, they come in high volume from single IP addresses
(or only a few IP addresses); some of the time these are cloud
server IPs. Almost all of the brute force attacks are directed
against the '
root' account, and any single round tends to be
directed against only a single one of our servers rather than being
spread over multiple ones. As mentioned, attacks are bursty; there
are periods with no login attempts and then periods where someone
apparently fires up a single attacking IP address for an hour or a
For some numbers, over the past 7 days we had 24,000 attempts against
root' and only 749 against the next most popular target, which
is a login name ('
admin') that doesn't even exist here. Just over
10,000 of those attempts came from a single IP address, and just
four IPs made 1,000 or more attempts against anything. Besides
root, only five login names had more than 100 attempts (and none
of them exist here): 'admin', 'user', 'ubuntu', 'debian', and 'pi'.
And only three machines saw more than 1,000 attempts (across all
targeted login names).
One of the things I've learned from this is that targeted blocking
of only a few IPs is disproportionately effective at stopping brute
force SSH attacks here. Also, since we already block Internet logins
root', we're in almost no danger. No matter how many times
they try, they have literally no chance of success.
(It does make me curious about what sort of passwords they're trying
root'. But not curious enough to set up a honeypot SSH server
and then try to give it a hostname that's interesting enough to