The temptation of smartctl's JSON output format given NVMe SSDs

May 4, 2022

Over on the Fediverse, I said something:

I have a real temptation to combine smartctl's (new) JSON output with jq to generate Prometheus metrics from SMART data (instead of my current pile of awk of non-JSON smartctl output). On the other hand, using jq for this feels like a Turing tarpit; it feels like the right answer is having a Python/etc program ingest the JSON and do all the reformatting and gathering in a real programming language that I'll be able to read and follow in a few months.

We believe in putting data from SMART into our metrics system so that we have it captured and can do various things with it, now and in the future. Today, this is done by processing the normal output of 'smartctl -i' and 'smartctl -A' for our SATA and SAS drives using a mix of awk and other Unix programs in a shell script. The fly in the ointment on a few machines today (and more machines in the future) is NVMe SSDs, because NVMe SSDs have health information but not SMART attributes, so while 'smartctl -A' works on them it produces output in a completely different format that my script has no idea how to deal with.

There are three attractions of using smartctl's new-ish JSON output format with some post-processing step. The first is that I can run smartctl only once for each drive, because the JSON output format makes it straightforward to handle the output of 'smartctl -iA' all at once. The second is that I could probably condense a lot of the extraction of various fields and the chopping up of various bits into a single program that runs once, instead of a bunch of Unix programs that run repeatedly. The third and biggest is that I could unify processing of SMART attributes and NVMe health information and handle it all in the same processing of the JSON output. The processing would simply look for SMART attributes and NVMe health information in the JSON and output whatever it found, rather than having to tell the two apart from how the input was formatted.

(In other words, the JSON output comes conveniently pre-labeled.)

Using smartctl's JSON output format doesn't solve all of the problems presented by NVMe SSDs, because the health information presented by NVMe SSDs doesn't map exactly on to SMART attributes. If I wanted to be honest, I would generate different Prometheus metrics for them that didn't pretend to have, for example, a SMART attribute ID number. But if I did that, I would make it harder to do metrics queries like 'show us the most heavily written to drives' across all of our drives regardless of their type.

(Or, more likely, 'show us all of the drive temperatures', since how things like power-on hours and write volume is represented in SMART varies a lot between different drives).

The usual tool for processing JSON in shell scripts is jq. In theory jq might be able to do all of the selection and processing of smartctl's JSON output that's needed for this. In practice, I suspect I will be much happier doing this in Python, because the logic of what is extracted and reported (and how it's mangled) will be much clearer in a programming language than in jq's terse filtering and formatting mini-language.


Comments on this page:

By Eike at 2022-05-05 12:50:43:

Why not write a small program in go to perform this task?

There already is a Prometheus exporter for smartctl data: https://github.com/prometheus-community/smartctl_exporter

By cks at 2022-05-05 21:09:38:

The problem with Go is that while the source code might be small, the executable will be big. It's also still dependent on smartctl for the actual work, and I feel Go isn't as flexible as Python for basically bashing a nested set of maps (dicts) around in odd ways, which is what this calls for.

The existing Prometheus community smartctl exporter unfortunately doesn't meet our needs in a variety of ways. For example, we need to collect both /dev/disk/by-path and /dev/sdX names for all disks on our fileservers because the sdX names change over time for various reasons, but are what appears in other places that we need to correlate against SMART metrics (eg, in kernel messages). There are probably other differences of opinion on what should be collected and how it should be presented between our setup and that exporter, since I haven't looked into it in detail.

(Some of the output of the smartctl exporter seems to have never been examined by someone who wants to actually use the results. SMART attribute raw values often need hacky post-processing to be really useful; the smartctl exporter seems to punt on anything that smartctl itself doesn't process and generate JSON for.)

Written on 04 May 2022.
« NVMe disk drives and SMART attributes (and data)
When you install systems semi-manually, when updates get done matters »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed May 4 21:50:27 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.