Wandering Thoughts archives

2022-08-08

Two example Grafana Loki log queries to get things from ntpdate logs

Recently I wanted to estimate how much the clocks on our servers drift over time (to give me a quick comparison to how BMC clocks drift). Most of our servers keep their time in sync by periodically running ntpdate, which syslogs messages that look like this:

adjust time server 192.168.0.1 offset +0.000579 sec

These days we have a Grafana Loki installation in addition to our long standing central syslog server, so rather than write some awk I decided to use Loki's LogQL query language to answer my question, and then mentioned it on the fediverse with the final query. Today I want to run down how this query is put together, along with a less obvious version that tells me how frequently our different NTP servers are used as a time source by ntpdate.

We collect logs from the systemd journal in non-JSON format, so the starting point is a query that will give me just the ntpdate logs with the log line being the plain text of the ntpdate message (as opposed to a JSON blob which we'd have to further unpack):

{syslog_identifier="ntpdate"}

We could in theory get all logs and then use Loki's text matching to trim them down, but Loki is very efficient at identifying just the logs for a particular label (which has some downsides) and the less work we give it the better (especially if we want to query over a long time range, such as a week or a month).

In order to calculate things with the offset delta, we need to extract it as a label (as far as I know). The best way to do this is with a pattern log parser expression:

{syslog_identifier="ntpdate"}
   | pattern "adjust time server <server> offset <delta> sec"

Now we have server and delta labels for each ntpdate log line. Well, we have those labels for each ntpdate log line that matches them, and also errors for any line that doesn't. We don't want the errors (Grafana's Explore view will complain about them, for example), so we'll match a plausible line first:

{syslog_identifier="ntpdate"} |= "adjust time server "
   | pattern "adjust time server <server> offset <delta> sec"

Since we want to calculate with the delta label, we're in the land of log metric queries. To get a rough per-24 hour clock drift figure, we'll sum over time on the delta label over the last 24 hours of logs:

sum_over_time( {syslog_identifier="ntpdate"}
    |= "adjust time server "
    | pattern "adjust time server <server> offset <delta> sec"
    | unwrap delta [24h] )

('unwrap' is how you tell LogQL what label value you want to work on.)

If you try this query you'll discover that this isn't quite what we want, which is partly our own fault. Sum_over_time() and other metrics functions produce one result for each different set of label values, including the server label that we added, which is going to vary over time for a single host. To give us the number we really want, we need to sum that by host, the hostname:

sum ( sum_over_time( {syslog_identifier="ntpdate"}
    |= "adjust time server "
    | pattern "adjust time server <server> offset <delta> sec"
    | unwrap delta [24h] ) ) by (host)

Wrapping everything in the sum() also gets rid of any other source of variable label values; we want a sum by the host, and that's what we get.

(Not generating the delta label at all would probably be more efficient in Loki. You'd do that by replacing the '<delta>' with a '<_>' placeholder.)

One tricky thing to think about here is that ntpdate can move the clock in both directions (it can use both positive and negative offsets). By summing over time we're letting those cancel out, generating a number that is theoretically how much the clock would have drifted if ntpdate wasn't correcting it. But we aren't generating a number for how much correction ntpdate had to apply; if ntpdate corrected by +0.5 one time and then -0.5 the next time, we'll sum that out to 0. As far as I know, LogQL has no abs() function currently, so if we wanted to get and sum up the absolute delta we'd have to write a more complicated regular expression that skipped the '+' or '-' on front of ntpdate's offset and just gave us the raw (absolute) number.

There's another question we can ask here, which is how frequently ntpdate selects one time server or another over the past day (or other time period). If ntpdate significantly prefers one server, there might be something going on.

LogQL is deliberately modeled after PromQL, the Prometheus query language, but it has enough differences that PromQL instincts don't necessarily work. One of them is that in LogQL, count_over_time() works on log lines, not on metric values. So our starting point is a version of our sum over time that omits the unwrap (instead of trying to unwrap server and work on it):

count_over_time( {syslog_identifier="ntpdate"}
    |= "adjust time server "
    | pattern "adjust time server <server> offset <delta> sec" [24h] )

If you try this query, you will probably get an error message about how the maximum number of series has been reached for a single query, and this is our fault. Like sum_over_time(), count_over_time() produces a separate series for every unique set of labels, and we've made the ntpdate adjustment delta into a label. It's very likely that every single adjustment delta will be unique, which means we've created a very high cardinality label. We have to sum things up anyway, but we want to give Loki an easier time, so we want to not create the delta label:

count_over_time( {syslog_identifier="ntpdate"}
    |= "adjust time server "
    | pattern "adjust time server <server> offset <_> sec" [24h] )

Instead of summing by host, we want to sum by (NTP) server:

sum( count_over_time( {syslog_identifier="ntpdate"}
    |= "adjust time server "
    | pattern "adjust time server <server> offset <_> sec" [24h] )
) by (server)

Running this query on our logs reveals a systematic bias in which of our NTP servers our systems pick. We have three NTP servers, each listed separately under its own name on ntpdate's standard command line. The first and the third server run chrony on Ubuntu (18.04 and 20.04 respectively), and the second runs ntpd on CentOS 7. Ntpdate picks the first listed server about 90% of the time and the third listed server about 10% of the time, and never seems to pick the second (although if you ask, ntpdate does seem to consider it a viable time source).

PS: One of the lessons I've taken from this is that life is simpler if you only create the label or label you actually need in your 'pattern' patterns. In both cases we'd have been better off not extracting the label we didn't want, even though the information is there. On the other hand, customizing your 'pattern' pattern for each specific query may make it harder to verify that all of them are actually matching the same pattern and there hasn't been any quiet drift.

sysadmin/GrafanaLokiNtpdateQueries written at 20:17:31; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.