2021-12-09
Why it's good to explicitly document the purposes of things, illustrated
Today we decommissioned some internal DNS zones for things that we aren't using or aren't doing any more. Among them were three internal iSCSI zones, which makes sense since we haven't used iSCSI since we moved away from our OmniOS fileservers to our Linux fileservers. But when my co-worker doing the work told me about this, I was surprised that there were three iSCSI networks, since our Solaris and OmniOS fileservers only used two iSCSI networks.
We have a worklog system to record changes and I was able to find the worklog for the addition of the 'iscsi3' iSCSI network, but it didn't have an explanation for what the network was for. Instead it was just written as a matter of fact 'I added this internal zone to DNS' report. Only after a surprising amount of searching our email archives was I able to turn up another entry about a related change that had a brief aside at the bottom that partially explained the intended purpose of this third iSCSI zone. Without that aside, I probably would still be in the dark about the purpose of this mysterious zone.
Broadly speaking, this isn't a surprising new issue for me with our worklog system. I've written about how our worklog messages often assume a bunch of context and how, for example, we lost and regained a piece of Amanda knowledge. But I think this is the first time I've seen us lose track of something as large as the purpose for an entire internal DNS zone (even if we never actually used the zone or really implemented the idea it was for). I'm glad that we sort of documented the 'obvious' (at the time) purpose of the zone in an aside, even if we could have written down the entire thing.
One of the things that this suggests to me is that we should consider sending design documents and other large scale 'what we are doing here' and 'what this is for' documentation to our worklog system, or at least coming up with some way of recording them for posterity. Design documents aren't changes so we don't necessarily naturally write them and send them to our worklog system.
(We do tend to worklog how complicated systems work so we have a reference document that we can search for later, but for other, more 'obvious' things we so far just sort of assume we'll remember the context.)
Some NVMe drive temperature things from my drives
I said some things on Twitter about the temperatures of the NVMe drives in my home and work machines (as reported by the Linux kernel), so I'm going to write down more on that here. I don't know if it means anything, but it's at least some data points.
Both my work desktop and my home desktop now have a pair of NVMe drives in them, all of which report the nominal drive temperature to the Linux kernel. The machines are built using the same case but have different motherboards, different sorts of CPUs, and different PCIe card layouts (which influences heat flow inside the case). And of course they sit in different environments, so both the ambient exterior temperature and the case interior temperature are likely different.
At home I now have a pair of Crucial P5 NVMe drives, which are reported to run comparatively hot. Both are mounted in motherboard M.2 slots; one of the slots is low on the board, away from the CPU, and covered by a motherboard heatsink, while the other is just below the CPU. Both are unused right now and at idle, they are both consistently around 41 C. When I put them under test load, the one under the heatsink goes up to 53 C, while the other one goes up to 62 C (despite being only using two PCIe lanes instead of four). So my first moderate surprise is that the motherboard M.2 NVMe heatsink actually does seem to really be doing something. Whether it makes a performance difference I don't know, but it clearly makes a heat difference.
(The Crucial P5 NVMe drives have two temperature sensors; one source suggests that one is on the controller and the other on the flash. Apparently the flash is usually lower temperature than the controller. My numbers are for what Linux reports as the 'Composite' temperature. At idle, the controller temperature seems to be 4 to 5 C higher than the composite temperature, while at load the two controllers reach 60 C and 77 C.)
At work I have a pair of Kingston A2000 NVMe drives, which apparently run relatively cool. One is mounted in a motherboard M.2 slot that has no heatsink and is located right between my Radeon RX 550 and the Ryzen CPU; the other is mounted on a PCIe expansion card, also without a heatsink. These are in use for the machine's root filesystem and core ZFS pool, but the machine itself is mostly idle since I'm not at work. The NVMe drive on the PCIe card seems to idle around 30 C, while the one on the motherboard idles around 33 C. Under normal heavier IO load like compiling Firefox, they can get as high as 45 C and 47 C respectively. Under artificial load of full streaming reads from the block devices, they will go to 50 C and 54 C respectively. The moderate surprise here is how much location and perhaps mounting can matter.
All four drives warm up and cool down fairly rapidly. This is probably not really surprising since all of the drives are obviously small and thus presumably don't store much heat. It's still impressive to see a drive at high temperature lose 10 C in a matter of thirty seconds or so once the load goes away.
PS: Right now, both motherboard temperature sensors claim to be reading about 30 C and all of the NVMe drives are at their 'idle' temperature. Where the various motherboard temperature sensors are located and what actually influences them is an open question; it's not like vendors label them on the board for you.