Wandering Thoughts archives

2018-03-14

What I think I want out of a hypothetical nfsiotop for Linux

I tweeted:

I wish there was a version of Linux's nfsiostat that worked gracefully when you have several hundred NFS mounts across multiple NFS fileservers.

(I'm going to have to write one, aren't I.)

Linux exposes a very large array of per-filesystem NFS client statistics in /proc/self/mountstats (see here) and there are some programs that digest this data and report it, such as nfsiostat(8). Nfsiostat generally works decently to give you useful information, but it's very much not designed for systems with, for example, over 250 NFS mounts. Unfortunately that describes us, and we would rather like to have a took which tells us what the NFS filesystem hotspots are on a given NFS client if and when it's clearly spending a lot of time waiting for NFS IO.

(We have some machines with this sort of problem.)

As suggested by the name, a hypothetical nfsiotop would have to only report on the top N filesystems, which raises the question of how you sort NFS filesystems here. Modern versions of nfsiostat sort by operations per second, which is a start, but I think that one should also be able to sort by total read and write volume and probably also by write volume alone. Other likely interesting things to sort on are the average response time and the current number of operations outstanding. An ideal tool would also be able to aggregate things into per fileserver statistics.

(All of this suggests that the real answer is that you should be able to sort on any field that the program can display, including some synthetic ones.)

As my aside in the tweet suggests, I suspect that I'm going to have to write this myself, and probably mostly from scratch. While nfsiostat is written in Python and so is probably reasonably straightforward for me to modify, I suspect that it has too many things I'd want to change. I don't want little tweaks for things like its output, I want wholesale restructuring. Hopefully I can reuse its code to parse the mountstats file, since that seems reasonably tedious to write from scratch. On the other hand, the current nfsiostat Python code seems amenable to a quick gut job to prototype the output that I'd want.

(Mind you, prototypes tend to drift into use. But that's not necessarily a bad thing.)

PS: I've also run across kofemann/nfstop, which has some interesting features such as a per-UID breakdown, but it works by capturing NFS network traffic and that's not the kind of thing I want to have to use on a busy machine, especially at 10G.

PPS: I'd love to find out that a plausible nfsiotop already exists, but I haven't been able to turn one up in Internet searches so far.

linux/NfsiotopDesire written at 22:48:49;

Why Let's Encrypt's short certificate lifetimes are a great thing

I recently had a conversation on Twitter about what we care about in TLS certificate sources, and it got me to realize something. I've written before about how our attraction to Let's Encrypt has become all about the great automation, but what I hadn't really thought about back then was how important the short certificate lifetimes are. What got me to really thinking about it was a hypothetical; suppose we could get completely automatically issued and renewed free certificates but they had the typical one or more year lifetime of most TLS certificates to date. Would we be interested? I realized that we would not be, and that we would probably consider the long certificate lifetime to be a drawback, not a feature.

There is a general saying in modern programming to the effect that if you haven't tested it, it doesn't work. In system administration, we tend towards a modified version of that saying; if you haven't tested it recently, it doesn't work. Given our generally changing system environments, the recently is an important qualification; it's too easy for things to get broken by changes around them, so the longer it's been since you tried something, the less confidence you can have in it. The corollary for infrequent certificate renewal is obvious, because even in automated systems things can happen.

With Let's Encrypt, we don't just have automation; the short certificate lifetime insures that we exercise it frequently. Our client of choice (acmetool) renews certificates when they're 30 days from expiring, so although the official Let's Encrypt lifetime is 90 days, we roll over certificates every sixty days. Having a rollover happen once every two months is great for building and maintaining our confidence in the automation, in a way that wouldn't happen if it was once every six months, once a year, or even less often. If it was that infrequent, we'd probably end up paying attention during certificate rollovers even if we let automation do all of the actual work. With the frequent rollover due to Let's Encrypt's short certificate lifetimes, they've become things we trust enough to ignore.

(Automatic certificate renewal for long duration certificates is not completely impossible here, because the university central IT has already arranged for free certificates for the university. Right now they're managed through a website and our university-wide authentication system, but in theory there could be automation for at least renewals. Our one remaining non Let's Encrypt certificate was issued through this service as a two year certificate.)

sysadmin/LetsEncryptDurationGood written at 01:24:45;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.