Wandering Thoughts archives

2018-05-18

How I usually divide up NFS (operation) metrics

When you're trying to generate metrics for local disk IO, life is generally relatively simple. Everyone knows that you usually want to track reads separately from writes, especially these days when they may have significantly different performance characteristics on SSDs. While there are sometimes additional operations issued to physical disks, they're generally not important. If you have access to OS-level information it can be useful to split your reads and writes into synchronous versus asynchronous ones.

Life with NFS is not so simple. NFS has (data) read and write operations, like disks do, but it also has a large collection of additional protocol operations that do various things (although some of these protocol operations are strongly related to data writes, for example the COMMIT operation, and should probably be counted as data writes in some way). If you're generating NFS statistics, how do you want to break up or aggregate these other operations?

One surprisingly popular option is to ignore all of them on the grounds that they're obviously unimportant. My view is that this is a mistake in general, because these NFS operations can have an IO impact on the NFS server and create delays on the NFS clients if they're not satisfied fast enough. But if we want to say something about these and we don't want to go to the extreme of presenting per-operation statistics (which is probably too much information, and in any case can hide patterns in noise), we need some sort of breakdown.

The breakdown that I generally use is to split up NFS operations into four categories: data reads, data writes (including COMMIT), operations that cause metadata writes such as MKDIR and REMOVE, and all other operations (which are generally metadata reads, for example READDIRPLUS and GETATTR). This split is not perfect, partly because some metadata read operations are far more common (and are far more cached on the server) than other operations; specifically, GETATTR and ACCESS are often the backbone of a lot of NFS activity, and it's common to see GETATTR as by far the most common single operation.

(I'm also not entirely convinced that this is the right split; as with other metrics wrestling, it may just be a convenient one that feels logical.)

Sidebar: Why this comes up less with local filesystems and devices

If what you care about is the impact that IO load is having on the system (and how much IO load there is), you don't entirely care why an IO request was issued, you only care that it was. From the disk drive's perspective, a 16 KB read is a 16 KB read, and it takes as much work to satisfy a 16 KB file as it does a 16 KB directory or a free space map. This doesn't work for NFS because NFS is more abstracted, and both the amount of operations and the amount of bytes that flow over the wire don't necessarily give you a true picture of the impact on the server.

Of course, in these days of SSDs and complicated disk systems, just having IO read and write information may not be giving you a true picture either. With SSDs especially, we know that bursts of writes are different from sustained writes, that writing to a full disk is often different than writing to an empty one, and apparently giving drives some idle time to do background processing and literally cool down may change their performance. But many things are simplifications so we do the best we can.

(Actual read and write performance is a 'true picture' in one sense, in that it is giving you information about what results the OS is getting from the drive. But it doesn't necessarily help to tell you why, or what you can do to improve the situation.)

NFSMyMetricsSplit written at 01:44:01; Add Comment

By day for May 2018: 18 20 23 24 28; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.