Some things on ZFS (on Linux) per-dataset basic IO statistics
Sufficiently recent versions of OpenZFS on Linux have not just performance statistics for overall pool IO (also), but some additional per-dataset IO statistics. Conveniently, these IO statistics are exposed through the Prometheus host agent, so if you're using Prometheus (as we are), so you don't have to write something to collect and manipulate them yourself. However, what these statistics actually mean is a little bit underexplained.
(I believe these first appeared in ZFS 0.8.0, based on the project's git history.)
The per-dataset statistics appear in files in /proc/spl/kstat/zfs/<pool> that are called objset-0x<hex>. A typical such file looks like this:
28 1 0x01 7 2160 5523398096 127953381091805 name type data dataset_name 7 ssddata/homes writes 4 718760 nwritten 4 7745788975 reads 4 29614153 nread 4 616619157258 nunlinks 4 77194 nunlinked 4 77189
(For what the header means, see kstat_seq_show_headers() in spl-kstat.c.)
Paradoxically, the easiest fields to explain are the last two,
nunlinked. What these reflect is the number of
files, directories, and so on that have been queued for deletion
in the ZFS delete queue and the number
of things that have actually been deleted (they may start out at
non-zero, see dataset_kstats.h).
In many cases these two numbers will be the same, because you have
no pending deletes. In this
case, there are some files in my home directory that have been
deleted but that are still in use by programs.
nread fields count the
number of writes and reads and the bytes written and read, but what
makes them complicated is what is and isn't included in them. I
believe the simple version is that they count normal user level
read and write IO performed through explicit system calls, starting
write() but probably including various other
related system calls. They definitely don't count internal ZFS IO
to do things like read directories, and I don't think they count
IO done through
mmap()'d files. However it appears that they may
include some IO to read (and perhaps write) ZFS xattrs, if you use
those extensively. It may not include user level IO that is performed
as direct IO; I'm not sure. This isn't documented explicitly and the
code is unclear to me.
I have no idea if these read and write statistics count NFS IO (I
have to assume that the
nunlinked statistics do
count things deleted over NFS). Not counting NFS IO would make them
much less useful in our fileserver environment,
because we couldn't use it to find active filesystems. Of course,
even if these dataset statistics don't include NFS IO now (as of
ZFS 2.0.4 and an impending ZFS 2.1.0 release), they may well in the
future. If you're tempted to use these dataset statistics, you
should probably conduct some experiments to see how they react to
your specific IO load.
(Our fileservers are currently running Ubuntu 18.04, which has an Ubuntu version of 0.7.5. This is recent enough to have the pool level IO statistics, but it doesn't have these per dataset ones.)
Update: Based on some experimentation, the Ubuntu 20.04 version of ZFS on Linux 0.8.3 does update these per-dataset read and write statistics for NFS IO.
What NFSv3 operations can be in the Linux nfsd reply cache
The Linux kernel NFS server (nfsd) provides a number of statistics
/proc/net/rpc/nfsd, which are often then exposed by metrics
agents such as the Prometheus host agent. One guide to of what
overall information is in this rpc/nfsd file is SvennD's nfsd
stats explained. The
first line is for the "reply cache", which caches replies to NFS
requests so they can be immediately sent back if a duplicate request
comes in. The three numbers provided for the reply cache are cache
hits, cache misses, and the number of requests that aren't cacheable
in the first place. A common explanation of this 'nocache' number is,
well, I will quote the Prometheus host agent's help text for its version
of this metric:
# HELP node_nfsd_reply_cache_nocache_total Total number of NFSd Reply Cache non-idempotent operations (rename/delete/…).
Knowing how many renames, deletes, and so on were going on seemed like a good idea, so I put a graph of this (and some other nfsd RPC numbers) into one of our Grafana dashboards. To my slowly developing surprise, generally almost all of the requests to the NFS servers I was monitoring fell into this 'nocache' category (which was also SvennD's experience, recounted in their entry). So I decided to find out what NFSv3 operations were cacheable and which ones weren't. The answer was surprising.
For NFSv3 operations, the answer is in the big
array at the end of fs/nfsd/nfs3proc.c.
Operations with a
RC_NOCACHE aren't cacheable;
entries with other values are. The non-cacheable NFSv3 operations
access commit fsinfo fsstat getattr lookup null pathconf read readdir readdirplus readlink
The cacheable ones are:
create link mkdir mknod remove rename rmdir setattr symlink write
NFS v3 operations that read information are not cacheable in the reply cache and show up in the 'nocache' category, while NFS v3 operations that change the filesystem mostly are cacheable.
Contrary to what you might guess from the Prometheus host agent's help text and various other sources, the non-cacheable NFS v3 operations aren't things like RENAME and CREATE, they are instead the NFSv3 operations that just read things from the filesystem (with the exception of COMMIT). In particular, GETATTR is an extremely frequent operation, so it's no wonder that most of the time the 'nocache' category dominated in my stats.
If you want to track the number of creates, writes, and so on, what you want to track is the number of misses to the reply cache. Tracking the 'nocache' number tells you how many read operations are happening.
(All of this makes sense once you understand why the reply cache is (or was) necessary, which is for another entry. I actually knew this as background NFS protocol knowledge, but I didn't engage that part of my memory when I was putting together the Grafana graph and had that tempting help text staring me in the face.)
Modern Linux can require a link signal before it configures IP addresses
I recently had an interesting troubleshooting experience when an Ubuntu 18.04 Dell server would boot but not talk to the network, or in fact even configure its IP address and other networking. I was putting it into production in place of a 16.04 server, which meant I had changed its netplan configuration and recabled it (to reuse the 16.04 server's network wire). I spent some time examining the netplan configuration and trawling logs before I took a second look at the rear of the machine and realized that when I had shuffled the network cable around I had accidentally plugged it into the server's second network port instead of the first one.
What had fooled me about where the problem was that when I logged
in to the machine on the console,
ip both reported
that the machine didn't have its IP address or other networking set
up. Because I come from an era where networking was configured
provided that the network device existed at all, that made me assume
that something was wrong with netplan or with the underlying
configuration it generated. In fact what was going on is that these
days nothing may get configured if a port doesn't have link signal.
The inverse is also true; your full IP and network configuration
may appear the moment you plug in a network cable and give the port
(I think this is due to netplan using systemd's networkd to actually handle network setup, instead of it being something netplan itself was doing.)
People using NetworkManager have been experiencing this for a long time, but I'm more used to static server network configurations that are there from the moment the server boots up and finds its network devices. This behavior is definitely something I'm going to have to remember for future troubleshooting, along with making sure that the network cable is plugged into the port it should be.
This does have some implications for what you can expect to happen if your servers ever boot without the switch they're connected to being powered on. In the past they would boot with IP networking fully configured but just not be able to talk to anything; now they'll boot without IP networking (and some things may wait for some time before they start, although not forever, since systemd's wait for the network to be online has a 120 second timeout by default).
(There may be some other implications if networkd also withdraws configured IP addresses when an interface loses link signal, for various reasons including someone unplugging the wrong switch port. I haven't tested this.)
ZFS on Linux is improving the handling of disk names on import
Back in 2017 I wrote up how ZFS on Linux names disks in ZFS pools. One of the important parts of this is that ZFS on Linux only stored a single name for each disk (the full path to it), and if the path wasn't there when you tried to import the pool (including the import during boot using a cachefile), you had a problem. The good news is that ZFS on Linux (well, OpenZFS) has recently landed a change that significantly improves this situation.
To summarize what the change does (assuming I understand it correctly),
if an expected device in a pool is not found at the path for it in
the cachefile (which is the same as what the pool stores), ZoL will
attempt to search for all of the disks in the pool as if you were
zpool import'. This search has two restrictions; first,
it looks for the pool by GUID instead of by name (which is what
you'd expect), and second it only looks at devices in the same
directory or directories as the pool's original device names. So
if you specified all your devices by /dev/disk/by-path paths, the
import won't scan, say, /dev or /dev/disk/by-id. In the common case
this won't make a difference because all disks will be present in
all the /dev directories.
(Of course this assumes that udev is populating your /dev properly and promptly.)
It looks like this change will be in ZFS on Linux 2.1.0 when that gets released (and there was a release candidate just made recently for 2.1.0, so hopefully that won't be too far away). Interested parties can read the full commit message and description in commit 0936981d8. Based on reading the code, there seems to be no way to turn this off if for some reason you really want pool import to fail unless the disks are exactly where or what you expected. You can probably effectively disable it by setting up your own directory of symlinks to your pool's disks and then arranging for your pool to use that directory in all of its device names.
(I don't know if this change is even necessary in FreeBSD. If it is, presumably it will get there someday, but I have no idea how fast things move from OpenZFS to FreeBSD.)
Programs that read IPMI sensors tell you subtly different things
In my first installment, I talked
about how I had started out reading IPMI
sensors using ipmitool,
or more specifically the command '
ipmitool sensor', which took
too long on some of our machines. I then discovered that ipmi-sensors'
ran quite fast on all our machines and appeared to give us the same
information, so I was going to switch to it. Unfortunately it turned
out that ipmi-sensors doesn't give us all of the information that
ipmitool does. Fortunately I was able to find an ipmitool
command that did run fast enough for the sensors we really care
ipmitool -c sdr list full'. The 'full' is important, and
doesn't actually seem to mean what you'd think.
Based on this Sun/Oracle documentation,
the important and unclear arguments to '
ipmitool sdr list' are
full', which really means 'temperature, power, and fan sensors'
compact', which really means 'failure and presence sensors'
(which I think '
ipmitool sensor' will call 'discrete'). On some
of our Dell servers, reading all of the various 'discrete' sensors
can take some time, while reading only the temperature, power, and
fan sensors runs fast enough.
Ipmitool and ipmi-sensors give the same information on a lot of sensors. However, on at least some sensors ipmitool appears to be reading additional information. For instance, on one machine ipmi-sensors reports PSUs and fans as:
ID,Name,Type,Reading,Units,Event [...] 3,Power Supply 1,Power Supply,N/A,N/A,'Presence detected' [...] 13,Fan 1,Fan,N/A,N/A,'transition to Running'
Ipmitool's 'sdr list' reports the same PSU and fan as:
Power Supply 1,200,Watts,ok,10.1,Presence detected Fan 1,39.20,percent,ok,7.1,Transition to Running
Ipmitool is somehow extracting additional readings from the IPMI that tell it the PSU wattage and the fan speed, where for some reason the ipmi-sensors standard sensor reading gives it nothing. Not all IPMI sensors need this extra information extracted (or perhaps have it), so for many of them ipmi-sensors can supply a reading (and it matches what ipmitool reports).
(The obvious assumption is that this extra reading is also why ipmitool is much slower than ipmi-sensors when reporting on some sensors on some IPMIs. My guess would be that doing the extra reading of those specific sensors is extra-slow.)
If I knew enough about how IPMIs worked and how you communicated with them, I would probably understand more of what the difference was between what the two programs are doing. As it is, I'm just thankful that I've found a way to read all of the important IPMI sensors that's fast enough. I do find it a little frustrating that the ipmitool manpage is so uninformative, and that the ipmi-sensors and ipmitool manpages seem to describe what they do the same way (at least to a non-specialist) despite clearly doing things somewhat differently.
myhostname module surprised me recently
The other day I did a plain
traceroute from my Fedora 33 office
workstation (instead of my usual
traceroute -n') and happened to notice that the first hop was
being reported as '
_gateway'. This is very much not the name
associated with that IP address, so I was rather surprised and
Although I initially suspected systemd-resolved because of
a Fedora 33 change to use it, the
actual cause turned out to be the
myhostname NSS module,
which was listed relatively early in the
hosts: line in my
If configured in
myhostname module provides three
services, only two of which have to do with your machine's hostname. The
simplest one is that
localhost and variants on it all resolve to the
appropriate localhost IPv4 and IPv6 addresses, and those localhost IPv4
and IPv6 addresses resolve back to 'localhost' in
and its friends. The second one is that the exact system host name
resolves to all of the IP addresses on all of your interfaces; this
is the name that
hostname prints, and nothing else. Shortened or
lengthened variants of the hostname don't do this. As with localhost,
all of these IP addresses also resolve back to the exact local host
name. This is where the first peculiarity comes up. To quote the
- The local, configured hostname is resolved to all locally configured IP addresses ordered by their scope, or — if none are configured — the IPv4 address 127.0.0.2 (which is on the local loopback) and the IPv6 address ::1 (which is the local host).
If you do a reverse lookup on 127.0.0.2,
myhostname will always report
that it has the name of your machine, even if you have configured IP
addresses and so
myhostname would not give you 127.0.0.2 as an IP
address for your hostname. A reverse lookup of ::1 will report that it's
called both 'localhost' and your machine's name.
The third service is that the hostname "
_gateway" is resolved to
all current default routing gateway addresses. As with the first two
services, the IP addresses of these gateways will also be resolved to
the name "
_gateway", which is what I stumbled over when I actually
paid attention to the first hop in my
The current manpage for
doesn't document that it also affects resolving IP addresses into
names as well as names into IP addresses. A very charitable person
could say that this is implied by saying that various hostnames
'are resolved' into IP addresses, as proper resolution of names to
IP addresses implies resolving them the other way too.
Which of these effects trigger for you depends on where
is in your
nsswitch.conf. For instance, if it's present at all
(even at the end), the special hostname "
_gateway" will resolve
to your gateway IPs, and names like "
resolve to your IPv4 and IPv6 localhost IPs (and probably 127.0.0.2
will resolve to your hostname). If it's present early, it will steal
the resolution of more names and more IPs from DNS and other sources.
myhostname NSS module is part of systemd and has worked like this
for over half a decade (although it started out using "
instead of "
_gateway"). However, it's not necessarily packaged,
installed, and configured along with the rest of systemd. Ubuntu splits
it out into a separate package, libnss-myhostname, which isn't installed
on our Ubuntu servers. Fedora packages it as part of 'systemd-libs',
which means it's guaranteed to be installed, and appears to default to
using it in
(What I believe is a stock installed Fedora 33 VM image has a
hosts: line of "
files mdns4_minimal [NOTFOUND=return]
resolve [!UNAVAIL=return] myhostname dns". You might think that
this would make DNS results from systemd-resolved take precedence
myhostname, but in a quiet surprise systemd-resolved does
this too; see the "Synthetic Records" section in
PS: I don't know why I never noticed this special
myhostname has been doing all of this for some time (and
I've had it in my
nsswitch.conf ever since Fedora started shoving it
in there). Possibly I just never noticed the name of the first hop when
I ran plain '
traceroute', because I always knew what it was.
PPS: The change from "
gateway" to "
_gateway" happened in
systemd 235, released 2017-10-06. The "
gateway" feature for
myhostname was introduced in systemd 218, released 2014-12-10.
All of this is from systemd's NEWS file.
What tool you use to read IPMI sensor information can matter
Most modern servers have some sort of onboard IPMI (sometimes called a service processor), and there are good reasons to want to read the IPMI's sensors instead of relying purely on whatever sensor readings you can get from the Linux kernel. There are a number of open source projects for reading IPMI information on Linux (and sometimes other systems); I have direct experience with ipmitool and FreeIPMI, and the Arch wiki lists a few more.
Until very recently I would have told you that these tools were
more or less equivalent for reading IPMI sensors and you could use
whichever one you wanted to (or whichever had the more convenient
output format). Locally we've used
ipmitool for this, and generated
Prometheus metrics by parsing its output. Except that we haven't
been able to this on all of our machines, because on some of our
models of Dell servers, reading the IPMI sensors is achingly slow.
I was all set to write a grumpy entry about the need for reading
IPMI sensors to be fast in order to be really useful when I decided
to test FreeIPMI to see if it would give me better diagnostics
about why reading IPMI sensors on these Dell servers was so slow.
To my surprise, FreeIPMI's
was able to read IPMI sensors with blazing speed on these servers.
This blazing speed is partly the result of building a cache of
sensor data records (SDRs),
but even when
ipmi-sensors has to build its cache it runs several
times faster than
ipmitool (on the Dell server models that are a
ipmitool). Even on Dell server models that
runs decently fast,
ipmi-sensors seems faster.
I have no idea what the two projects are doing differently when they
read IPMI sensors, but there clearly is a real difference in FreeIPMI's
favor, contrary to what I expected. My minimum learning experience from
this is that if I run into IPMI problems when I'm using one project,
ipmitool, I should try other projects just to see. I'm also
probably somewhat too cynical about the quality of implementation of
IPMIs, at least from major server companies; Dell probably wouldn't ship
an IPMI that really took five minutes to read out all of its sensors.
I've got a lot of ingrained habits around using
with some scripting. I'm probably going to switch at least the
scripting to use FreeIPMI (which means installing and testing it
on more servers here). I may not switch my habits, especially right
away, since they're pretty ingrained and ipmitool is convenient in
how everything is right there in one program and one manpage.
PS: I was lucky by already having some familiarity with FreeIPMI,
since I'd used its
program to set some OEM-specific options in the IPMIs for some servers
Linux's hardware monitoring can lie to you
Let's start with my tweets:
Fedora 33's 5.11.7 kernel seems to do a real number on power/fan/temperature levels of my Radeon RX 550 sitting idle in framebuffer text console. Fan RPMs went from 780 or so to 2100 and temperature jumped from 29C to 31C and climbing.
Ah. The reason my GPU's temperature is steadily climbing despite the fans running at 2100 RPM or so, as reported by Linux hardware monitoring, is because the fans are in fact not running at all.
The Linux kernel exposes hardware monitoring information in /sys, as covered in some kernel documentation, although you need the relevant drivers to support this. My office machine has an AMD Radeon RX 550, and the kernel amdgpu driver module for it exposes various sensor information through this general hardware monitoring interface. Lm_sensors reports the driver's reported sensors as 'vddgfx' (in volts), 'fan1', 'edge' (temperature), and 'power1' (in watts).
(The exact /sys path for a GPU is somewhat arcane, but you can usually get to it with /sys/class/drm/card0/device/hwmon/ and then some numbered 'hwmonN' subdirectory. There's also /sys/kernel/debug/dri/0 with various things, including an amdgpu_pm_info file that reports things in text.)
My GPU's fan (really fans) seem to use pulse width modulation (pwm), based on PWM-related sensor information showing up in amdgpu's hwmon directory. Under 5.11.7 (and 5.11.8), the PWM value appears to be 0 (instead of its usual '81'). I suspect that this means that regardless of the reported RPMs, the PWM duty cycle was 0% and so the fan wasn't turning. Why the GPU and the amdgpu driver together reported 2100 RPM instead of some other value, I have no idea (and it wasn't a constant 2100 RPM, it fluctuated around a bit).
At a minimum, this tells me that straightforward interpretations of hwmon values may be misleading because you need to look at other hwmon values for context. More generally, hwmon values are only as trustworthy as the combination of the hardware and the driver reporting them and clearly some combinations don't report useful values. Common tools, like lm_sensors, may not cover corner cases (such as the PWM duty cycle being 0), so looking at their output may mislead you about the state of your hardware. In the end, nothing beats actually looking at things in person, which is a little bit alarming in these work from home times when that's a bit difficult.
(The good news is that the Prometheus host agent does capture the hwmon pwm, so you can go back and look for anomalies.)
What OpenSSH sshd logs when a session disconnects (on Linux)
On Twitter, I recently was disgruntled about sshd's logging:
I really wish sshd logged one single line that said a session had ended, what user it was for (and where from), and what the reason for ending the session was. You might think it already did this, but sadly not.
There are many reasons you might care about the causes of SSH session disconnections, including that you're trying to troubleshoot potential network or firewall problems and you want to see if people are getting abruptly disconnected from their SSH sessions or if the sessions are ending normally.
For many sessions,
sshd will log something like:
sshd: Received disconnect from 192.168.10.10 port 51726:11: disconnected by user sshd: Disconnected from user ckstst 192.168.10.10 port 51726 sshd: pam_unix(sshd:session): session closed for user ckstst
(You can also see 'disconnected by server request'.)
This is already less than ideal, because the user is mentioned on a different line than the nominal disconnection reason. In addition, the PID logged for the first two messages appears from nowhere; it's not in any other log lines for the session, which are found for the PID in the third line.
However, a suspicious system administrator might see the message about 'disconnected by user' and wonder what happens when the user's TCP connection just gets cut for some reason, such as their machine shutting down abruptly. At least some of the time, what you get is more obscure:
sshd: Timeout, client not responding sshd: pam_unix(sshd:session): session closed for user ckstst
The second PID logged, PID 14719 here, is associated with other earlier log lines that give you the remote IP. The first PID, PID 14867, has never been seen before or since (apart from PID rollover).
However, sshd can log even less. Suppose that something either abruptly terminates the user's SSH client program or causes TCP resets (RSTs) to be generated for the TCP connection. Then you will get only the log line:
sshd: pam_unix(sshd:session): session closed for user ckstst
One reason that TCP resets might be generated is state table entries timing out on some firewall between your server and the person logging in. Many home routers are also NAT firewalls and often have small state tables and aggressively time out entries in them.
All of this lack of clear logging forces forces you into reasoning by omission. If there is a 'Received disconnect' and a 'Disconnected' message logged by sshd, the session was disconnected in an orderly way and you can get the reason from the specific trailer. Even here in the best case you need to correlate three log lines to recover all information about the session. If there's no messages about the session ending from sshd but there is a 'Timeout' or other logged sshd error immediately before the PAM message, the TCP connection was most likely lost. Finally, if there's nothing other than the PAM message the session probably ended because of some abrupt termination of the TCP connection (either by client death causing the TCP connection to be closed or by firewalls deciding to reset it).
The one bright spot in all of this is that you always get the PAM message (as far as I know) and it always has the login name.
Systemd needs (or could use) a linter for unit files
Today I made a discovery:
Today's Fedora packaging failure: /usr/lib/systemd/system/lttng-session.service (from lttng-tools) is a PGP key, not a systemd .service unit. (Insert joke about people once again not managing to use PGP properly)
Yes, bug filed: #1935426
I discovered this because I was watching '
journalctl -f' while I
upgraded my office workstation to Fedora 33 in my usual way. The upgrade process causes systemd to re-examine
your unit files and complain about things. Most of the complaints
were normal ones like:
mcelog.service:8: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether xinetd.service:10: PIDFile= references a path below legacy directory /var/run/, updating /var/run/xinetd.pid → /run/xinetd.pid; please update the unit file accordingly.
But mixed in with those complaints I noticed the much more unusual:
lttng-sessiond.service:1: Assignment outside of section. Ignoring. lttng-sessiond.service:3: Assignment outside of section. Ignoring. [.. for a bunch more lines ..]
That made me curious what was in the file, whereupon I discovered that it was actually a PGP public key instead of a systemd unit file.
We can laugh at this mistake because it's funny in several different ways (given that it involves PGP for extra spice). But it's actually pointing out a systematic problem, one that is also illustrated by those other messages about other systemd units, which is that there's no easy way to check your unit files to see if systemd is happy with them. In other words, there is no linter for systemd unit files.
If there was a linter, none of these problems would be there, or at least any that were still present would be ones that Fedora (or any other Linux distribution) had decided were actually okay. With a linter, Linux distributions could make it a standard packaging rule (in whatever packaging system they use) that all systemd units in a package had to pass the linter; this would have automatically detected the lttng-tools problem, probably among others. Without a linter, the only way to detect systemd unit problems is to enable them and see if systemd complains. This is not something that's easy to automate, especially during package builds, and so it's fallible and limited.
(Because of that it invites people to file bugs for things that may not be bugs. Are these issues with the PIDFile location actual oversights in the packaging or an area where Fedora's standard doesn't line up with the systemd upstream? I can't tell.)
An automatically applied linter would be especially useful for the less frequently used packages and programs, where an issue has a much easier time lurking for some time. Probably not very many people have lttng-tools even installed on Fedora, and clearly not very many of them use things that require the lttng sessiond service.
PS: This isn't the only systemd project where standards have changed and some systemd bit is now complaining. Systemd-tmpfiles complains about various things wanting to clean up bits in /var/run, for example.