2024-02-23
Fixing my problem of a stuck 'dnf updateinfo info
' on Fedora Linux
I apply Fedora updates only by hand, and as part of this I like to
look at what 'dnf updateinfo info
' will tell me about why they're
being done. For some time, there's been an issue on my work desktop
where 'dnf updateinfo info' would report on updates that I'd already
applied, often drowning out information about the updates that I
hadn't. This was a bit frustrating, because my home Fedora machine
didn't do this but I couldn't spot anything obviously wrong (and at
various times I'd cleaned all of the DNF caches that I could find).
(Now that I look, it seems I've been having some variant of this problem for a while.)
Recently I took another shot at troubleshooting this. In the system programmer way, I started by locating the Python source code of the DNF updateinfo subcommand and reading it. This showed me a bunch of subcommand specific options that I could have discovered by reading 'dnf updateinfo --help' and led me to find 'dnf updateinfo list', which lists which RPM (or RPMs) a particular update will update. When I used 'dnf updateinfo list' and looked at the list of RPMs, something immediately jumped out at me, and it turned out to be the cause.
My 'dnf updateinfo info' problems were because I had old Fedora 37 'debugsource' RPMs still installed (on a machine now running Fedora 39).
The '-debugsource' and '-debuginfo' RPMs for a given RPM contain symbol information and then source code that is used to allow better debugging (see Debuginfo packages and this change to create debugsource as well). I tend to wind up installing them if I'm trying to debug a crash in some standard packaged program, or sometimes code that heavily uses system libraries. Possibly these packages get automatically cleaned up if you update Fedora releases in one of the officially supported ways, but I do a live upgrade using DNF (following this Fedora documentation). Clearly, when I do such an upgrade, these packages are not removed or updated.
(It's possible that these packages are also not removed or updated within a specific Fedora release when you update their base packages, but since they were installed a long time ago I can't tell at this point.)
With these old debugsource packages hanging around, DNF appears to have reasonably seen more recent versions of them available and duly reported the information on the 'upgrade' (in practice the current version of the package) in 'dnf updateinfo info' when I asked for it. That the packages would not be updated if I did a 'dnf update' was not updateinfo's problem. Removing the debugsource packages eliminated this and now 'dnf updateinfo info' is properly only reporting actual pending updates.
('dnf updateinfo' has various options for what packages to select, but as covered in the updateinfo command documentation apparently they're mostly the same in practice.)
In the future I'm going to have to remember to remove all debugsource and debuginfo packages before upgrading Fedora releases. Possibly I should remove them after I'm done with whatever I installed them for. If I needed them again (in that Fedora release) I'd have to re-fetch them, but that's rare.
PS: In reading the documentation, I've discovered that it's really
'dnf updateinfo --info
'; updateinfo just accepts 'info' (and
'list') as equivalent to the switches.
(This elaborates on a Fediverse post I made at the time.)
2024-02-21
What ZIL metrics are exposed by (Open)ZFS on Linux
The ZFS Intent Log (ZIL) is effectively
ZFS's version of a filesystem journal, writing out hopefully brief
records of filesystem activity to make them durable on disk before
their full version is committed to the ZFS pool. What the ZIL is
doing and how it's performing can be important for the latency (and
thus responsiveness) of various operations on a ZFS filesystem,
since operations like fsync()
on an important file must wait for
the ZIL to write out (commit) their information before they can
return from the kernel. On Linux, OpenZFS
exposes global information about the ZIL in /proc/spl/kstat/zfs/zil
,
but this information can be hard to interpret without some knowledge
of ZIL internals.
(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX, for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information the way there is a global one, but for most purposes you can sum up the ZIL information from all of the pool's datasets.)
The basic background here is the flow of activity in the ZIL and also the comments in zil.h about
the members of the zil_stats
struct.
The (ZIL) data you can find in the "zil
" file (and the per-dataset
kstats in OpenZFS 2.2 and later) is as follows:
zil_commit_count
counts how many times a ZIL commit has been requested through things likefsync()
.zil_commit_writer_count
counts how many times the ZIL has actually committed. More than one commit request can be merged into the same ZIL commit, if two peoplefsync()
more or less at the same time.zil_itx_count
counts how many intent transactions (itxs) have been written as part of ZIL commits. Each separate operation (such as awrite()
or a file rename) gets its own separate transaction; these are aggregated together into log write blocks (lwbs) when a ZIL commit happens.
When ZFS needs to record file data into the ZIL, it has three options,
which it calls 'indirect
', 'copied
', and 'needcopy
' in ZIL
metrics. Large enough amounts of file data are handled with an
indirect write, which writes the data to its final location in the
regular pool; the ZIL transaction only
records its location, hence 'indirect'. In a copied write, the data
is directly and immediately put in the ZIL transaction (itx), even
before it's part of a ZIL commit; this is done if ZFS knows that the
data is being written synchronously and it's not large enough to trigger
an indirect write. In a needcopy write, the data just hangs around in
RAM as part of ZFS's regular dirty data, and if a ZIL commit happens
that needs that data, the process of adding its itx to the log write
block will fetch the data from RAM and add it to the itx (or at least
the lwb).
There are ZIL metrics about this:
zil_itx_indirect_count
andzil_itx_indirect_bytes
count how many indirect writes have been part of ZIL commits, and the total size of the indirect writes of file data (not of the 'itx' records themselves, per the comments in zil.h).Since these are indirect writes, the data written is not part of the ZIL (it's regular data blocks), although it is put on disk as part of a ZIL commit. However, unlike other ZIL data, the data written here would have been written even without a ZIL commit, as part of ZFS's regular transaction group commit process. A ZIL commit merely writes it out earlier than it otherwise would have been.
zil_itx_copied_count
andzil_itx_copied_bytes
count how many 'copied' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.zil_itx_needcopy_count
andzil_itx_needcopy_bytes
count how many 'needcopy' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.
A regular system using ZFS may have little or no 'copied' activity. Our NFS servers all have significant amounts of it, presumably because some NFS data writes are done synchronously and so this trickles through to the ZFS stats.
In a given pool, the ZIL can potentially be written to either the
main pool's disks or to a separate log device (a slog, which can
also be mirrored). The ZIL metrics have a collection of
zil_itx_metaslab_*
metrics about data actually written to the
ZIL in either the main pool ('normal' metrics) or to a slog (the
'slog' metrics).
zil_itx_metaslab_normal_count
counts how many ZIL log write blocks (not ZIL records, itxs) have been committed to the ZIL in the main pool. There's a corresponding 'slog' version of this and all further zil_itx_metaslab metrics, with the same meaning.zil_itx_metaslab_normal_bytes
counts how many bytes have been 'used' in ZIL log write blocks (for ZIL commits in the main pool). This is a rough representation of how much space the ZIL log actually needed, but it doesn't necessarily represent either the actual IO performed or the space allocated for ZIL commits.As I understand things, this size includes the size of the intent transaction records themselves and also the size of the associated data for 'copied' and 'needcopy' data writes (because these are written into the ZIL as part of ZIL commits, and so use space in log write blocks). It doesn't include the data written directly to the pool as 'indirect' data writes.
If you don't use a slog in any of your pools, the 'slog' versions of these metrics will all be zero. I think that if you have only slogs, the 'normal' versions of these metrics will all be zero.
In ZFS 2.2 and later, there are two additional statistics for both normal and slog ZIL commits:
zil_itx_metaslab_normal_write
counts how many bytes have actually been written in ZIL log write blocks. My understanding is that this includes padding and unused space at the end of a log write block that can't fit another record.zil_itx_metaslab_normal_alloc
counts how many bytes of space have been 'allocated' for ZIL log write blocks, including any rounding up to block sizes, alignments, and so on. I think this may also be the logical size before any compression done as part of IO, although I'm not sure if ZIL log write blocks are compressed.
You can see some additional commentary on these new stats (and the code) in the pull request and the commit itself.
PS: OpenZFS 2.2 and later has a currently undocumented 'zilstat
'
command, and its 'zilstat -v' output may provide some guidance on
what ratios of these metrics the ZFS developers consider interesting.
In its current state it will only work on 2.2 and later because it
requires the two new stats listed above.
Sidebar: Some typical numbers
Here is the "zil" file from my office desktop, which has been up for long enough to make it interesting:
zil_commit_count 4 13840 zil_commit_writer_count 4 13836 zil_itx_count 4 252953 zil_itx_indirect_count 4 27663 zil_itx_indirect_bytes 4 2788726148 zil_itx_copied_count 4 0 zil_itx_copied_bytes 4 0 zil_itx_needcopy_count 4 174881 zil_itx_needcopy_bytes 4 471605248 zil_itx_metaslab_normal_count 4 15247 zil_itx_metaslab_normal_bytes 4 517022712 zil_itx_metaslab_normal_write 4 555958272 zil_itx_metaslab_normal_alloc 4 798543872
With these numbers we can see interesting things, such as that the average number of ZIL transactions per commit is about 18 and that my machine has never done any synchronous data writes.
Here's an excerpt from one of our Ubuntu 22.04 ZFS fileservers:
zil_commit_count 4 155712298 zil_commit_writer_count 4 155500611 zil_itx_count 4 200060221 zil_itx_indirect_count 4 60935526 zil_itx_indirect_bytes 4 7715170189188 zil_itx_copied_count 4 29870506 zil_itx_copied_bytes 4 74586588451 zil_itx_needcopy_count 4 1046737 zil_itx_needcopy_bytes 4 9042272696 zil_itx_metaslab_normal_count 4 126916250 zil_itx_metaslab_normal_bytes 4 136540509568
Here we can see the drastic impact of NFS synchronous writes (the significant 'copied' numbers), and also of large NFS writes in general (the high 'indirect' numbers). This machine has written many times more data in ZIL commits as 'indirect' writes as it has written to the actual ZIL.
2024-02-20
NetworkManager won't share network interfaces, which is a problem
Today I upgraded my home desktop to Fedora 39. It didn't entirely
go well; specifically, my DSL connection broke because Fedora
stopped packaging some scripts with rp-pppoe and Fedora's
old ifup
, which is used by my very old-fashioned setup still requires those scripts. After I got
back on the Internet, I decided to try an idea I'd toyed with,
namely using NetworkManager to handle (only) my DSL link. Unfortunately this did not go well:
audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524 uid=0 result="fail" reason="Connection '[...]' is not available on device em0 because device is strictly unmanaged"
The reason that em0 is 'unmanaged' by NetworkManager is that it's managed by systemd-networkd, which I like much better. Well, also I specifically told NetworkManager not to touch it by setting it as 'unmanaged' instead of 'managed'.
Although I haven't tested, I suspect that NetworkManager applies this restriction to all VPNs and other layered forms of networking, such that you can only run a NetworkManager managed VPN over a network interface that NetworkManager is controlling. I find this quite unfortunate. There is nothing that NetworkManager needs to change on the underlying Ethernet link to run PPPoE or a VPN over it; the network is a transport (a low level transport in the case of PPPoE).
I don't know if it's theoretically possible to configure NetworkManager so that an interface is 'managed' but NetworkManager doesn't touch it at all, so that systemd-networkd and other things could continue to use em0 while NetworkManager was willing to run PPPoE on top of it. Even if it's possible in theory, I don't have much confidence that it will be problem free in practice, either now or in the future, because fundamentally I'd be lying to NetworkManager and networkd. If NetworkManager really had a 'I will use this interface but not change its configuration' category, it would have a third option besides 'managed or '(strictly) unmanaged'.
(My current solution is a hacked together script to start pppd and pppoe with magic options researched through extrace and a systemd service that runs that script. I have assorted questions about how this is going to interactive with various things, but someday I will get answers, or perhaps unpleasant surprises.)
PS: Where this may be a special problem someday is if I want to run a VPN over my DSL link. I can more or less handle running PPPoE by hand, but the last time I looked at a by hand OpenVPN setup I rapidly dropped the idea. NetworkManager is or would be quite handy for this sort of 'not always there and complex' networking, but it apparently needs to own the entire stack down to Ethernet.
(To run a NetworkManager VPN over 'ppp0', I would have to have NetworkManager manage it, which would presumably require I have NetworkManager handle the PPPoE DSL, which requires NetworkManager not considering em0 to be unmanaged. It's NetworkManager all the way down.)
2024-02-13
What is in (Open)ZFS's per-pool "txgs" /proc file on Linux
As part of (Open)ZFS's general 'kstats' system for reporting information about ZFS overall and your individual pools and datasets, there is a per-pool /proc file that reports information about the most recent N transaction groups ('txgs'), /proc/spl/kstat/zfs/<pool>/txgs. How many N is depends on the zfs_txg_history parameter, and defaults to 100. The information in here may be quite important for diagnosing certain sorts of performance problems but I haven't found much documentation on what's in it. Well, let's try to fix that.
The overall format of this file is:
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 5846176 7976255438836187 C 1736704 0 5799936 0 299 5119983470 2707 49115 27910766 [...] 5846274 7976757197601868 C 1064960 0 4702208 0 236 5119973466 2405 48349 134845007 5846275 7976762317575334 O 0 0 0 0 0 0 0 0 0
(This example is coming from a system with four-way mirrored vdevs, which is going to be relevant in a bit.)
So lets take these fields in order:
txg
is the transaction group number, which is a steadily increasing number. The file is ordered from the oldest txg to the newest, which will be the current open transaction group.(In the example, txg 5846275 is the current open transaction group and 5846274 is the last one the committed.)
birth
is the time when the transaction group (txg) was 'born', in nanoseconds since the system booted.state
is the current state of the txg; this will most often be either 'C' for committed or 'O' for open. You may also see 'S' for syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An open transaction group will most likely have 0s for the rest of the numbers, and will be the last txg (there's only one open txg at a time).Any transaction group except the second last will be in state 'C', because you can only have one transaction group in the process of being written out.Update: per the comment from Arnaud Gomes, you can have multiple transaction groups at the end that aren't committed. I believe you can only have one that is syncing ('S'), because that happens in a single thread for only one txg, but you may have another that is quiescing or waiting to sync.
A transaction group's progress through its life cycle is open, quiescing, waiting for sync, syncing, and finally committed. In the open state, additional transactions (such as writing to files or renaming them) can be added to the transaction group; once a transaction group has been quiesced, nothing further will be added to it.
(See also ZFS fundamentals: transaction groups, which discusses how a transaction group can take a while to sync; the content has also been added as a comment in the source code in txg.c.)
ndirty
is how many bytes of directly dirty data had to be written out as part of this transaction; these bytes come, for example, from userwrite()
IO.It's possible to have a transaction group commit with a '0' for
ndirty
. I believe that this means no IO happened during the time the transaction group was open, and it's just being closed on the timer.nread
is how many bytes of disk reads the pool did between when syncing of the txg starts and when it finishes ('during txg sync').nwritten
is how many bytes of disk writes the pool did during txg sync.reads
is the number of disk read IOs the pool did during txg sync.writes
is the number of disk write IOs the pool did during txg sync.I believe these IO numbers include at least any extra IO needed to read in on-disk data structures to allocate free space and any additional writes necessary. I also believe that they track actual bytes written to your disks, so for example with two-way mirrors they'll always be at least twice as big as the
ndirty
number (in my example above, with four way mirrors, their base is four timesndirty
).As we can see it's not unusual for
nread
andreads
to be zero. However, I don't believe that the read IO numbers are restricted to transaction group commit activities; if something is reading from the pool for other reasons during the transaction group commit, that will show up innread
andreads
. They are thus a measure of the amount of read IO going during the txg sync process, not the amount of IO necessary for it.I don't know if ongoing write IO to the ZFS Intent Log can happen during a txg sync. If it can, I would expect it to show up in the
nwritten
andwrites
numbers. Unlike read IO, regular write IO can only happen in the context of a transaction group and so by definition any regular writes during a txg sync are part of that txg and show up inndirty
.otime
is how long the txg was open and accepting new write IO, in nanoseconds. Often this will be around the default zfs_txg_timeout time, which is normally five seconds. However, under (write) IO pressure this can be shorter or longer (if the current open transaction group can't be closed because there's already a transaction group in the process of trying to commit).qtime
is how long the txg took to be quiesced, in nanoseconds; it's usually small.wtime
is how long the txg took to wait to start syncing, in nanoseconds; it's usually pretty small, since all it involves is that the separate syncing thread pick up the txg and start syncing it.stime
is how long the txg took to actually sync and commit, again in nanoseconds. It's often appreciable, since it's where the actual disk write IO happens.
In the example "txgs" I gave, we can see that despite the first committed txg listed having more dirty data than the last committed txg, its actual sync time was only about a quarter of the last txg's sync time. This might cause you to look at underlying IO activity patterns, latency patterns, and so on.
As far as I know, there's no per-pool source of information about
the current amount of dirty data in the current open transaction
group (although once a txg has quiesced and is syncing, I believe
you do see a useful ndirty
for it in the "txgs" file). A system
wide dirty data number can more or less be approximated from the
ARC memory reclaim statistics in
the anon_size
kstat plus the arc_tempreserve
kstat, although
the latter seems to never get very big for us.
A new transaction group normally opens as the current transaction
group begins quiescing. We can verify this in the example output
by adding the birth time and the otime
of txg 5846274, which add
up to exactly the birth time of txg 5846275, the current open txg.
If this sounds suspiciously exact down to the nanosecond, that's
because the code involve freezes the current time at one point and
uses it for both the end of the open time of the current open txg
and the birth time of the new txg.
Sidebar: the progression through transaction group states
Here is what I can deduce from reading through the OpenZFS kernel code, and since I had to go through this I'm going to write it down.
First, although there is a txg 'birth' state, 'B' in the 'state' column, you will never actually see it. Transaction groups are born 'open', per spa_txg_history_add() in spa_stats.c. Transaction groups move from 'O' open to 'Q' quiescing in txg_quiesce() in txg.c, which 'blocks until all transactions in the group are committed' (which I believe means they are finished fiddling around adding write IO). This function is also where the txg finishes quiescing and moves to 'W', waiting for sync. At this point the txg is handed off to the 'sync thread', txg_sync_thread() (also in txg.c). When the sync thread receives the txg, it will advance the txg to 'S', syncing, call spa_sync(), and then mark everything as done, finally moving the transaction group to 'C', committed.
(In the spa_stats.c code, the txg state is advanced by a call to spa_txg_history_set(), which will always be called with the old state we are finishing. Txgs advance to syncing in spa_txg_history_init_io(), and finish this state to move to committed in spa_txg_history_fini_io(). The tracking of read and write IO during the txg sync is done by saving a copy of the top level vdev IO stats in spa_txg_history_init_io(), getting a second copy in spa_txg_history_fini_io(), and then computing the difference between the two.)
Why it might take some visible time to quiesce a transaction group is more or less explained in the description of how ZFS's implementations of virtual filesystem operations work, in the comment at the start of zfs_vnops_os.c. Roughly, each operation (such as creating or renaming a file) starts by obtaining a transaction that will be part of the currently open txg, then doing its work, and then committing the transaction. If the transaction group starts quiescing while the operation is doing its work, the quiescing can't finish until the work does and commits the transaction for the rename, create, or whatever.
2024-02-12
Linux kernel boot messages and seeing if your AMD system has ECC
In general, consumer x86 desktops have generally not supported ECC memory, at least not if you wanted the 'ECC' bit to actually do anything. With Intel this seems to have been an issue of market segmentation, but things with AMD were more confusing. The initial AMD Ryzen series seemed to generally support ECC in the CPU, but the motherboard support was questionable, and even if your motherboard accepted ECC DIMMs there was an open question of whether the ECC was doing anything on any particular motherboard (cf). Later Ryzens have apparently had an even more confusing ECC support story, but I'm out of touch on that.
When we put together my work desktop we got ECC DIMMs for it and I thought that theoretically the motherboard supported ECC, but I've long wondered if it was actually doing anything. Recently I was looking into this a bit for reasons and ran across Rain's ECC RAM on AMD Ryzen 7000 desktop CPUs, which contained some extremely useful information about how to tell from your boot messages on AMD systems. I'm going to summarize this and add some extra information I've dug out of things.
Modern desktop CPUs talk to memory themselves, but not quite directly from the main CPU; instead, they have a separate on-die memory controller. On AMD Zen series CPUs, this is the AMD Unified Memory Controller, and there are special interfaces to talk to it. As I understand things, ECC is handled (or not) in the UMC, where it receives the raw bits from your DIMMs (if your DIMMs are wide enough, which you may or may not be able to tell). Therefor, to have ECC support active, you need ECC DIMMs and for ECC to be enabled in your UMC (which I believe is typically controlled by the BIOS, assuming the UMC supports ECC, which depends on the CPU).
In Linux, reporting and managing ECC is handled through a general subsystem called EDAC, with specific hardware drivers. The normal AMD EDAC driver is amd64_edac, and as covered by Rain, it registers for memory channels only if the memory channel has ECC on in the on-die UMC. When this happens, you will see a kernel message to the effect of:
EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
It follows that if you do see this kernel message during boot, you almost certainly have fully supported ECC on your system. It's very likely that your DIMMs are ECC DIMMs, your motherboard supports ECC in the hardware and in its BIOS (and has it enabled in the BIOS if necessary and applicable), and your CPU is willing to do ECC with all of this. Since the above kernel message comes from my office desktop, it seems almost certain that it does indeed fully support ECC, although I don't think I've ever seen any kernel messages about detecting and correcting ECC issues.
You can see more memory channels in larger systems and they're not necessarily sequential; one of our large AMD machines has 'MC0' and 'MC2'. You may also see a message about 'EDAC PCI0: Giving out device to [...]', which is about a different thing.
In the normal Linux kernel way, various EDAC memory controller information can be found in sysfs under /sys/devices/system/edac/mc (assuming that you have anything registered, which you may not on a non-ECC system). This appears to include counts of corrected errors and uncorrected errors both at the high level of an entire memory controller and at the level of 'rows', 'ranks', and/or 'dimms' depending on the system and the kernel version. You can also see things like the memory EDAC mode, which could be 'SECDED' (what my office desktop reports) or 'S8ECD8ED' (what a large AMD server reports).
(The 'MC<n>' number reported by the kernel at boot time doesn't necessarily match the /sys/devices/system/edac/mc<n> number. We have systems which report 'MC0' and 'MC2' at boot, but have 'mc0' and 'mc1' in sysfs.)
The Prometheus host agent exposes this EDAC information as metrics, primarily in node_edac_correctable_errors_total and node_edac_uncorrectable_errors_total. We have seen a few corrected errors over time on one particular system.
Sidebar: EDAC on Intel hardware
While there's an Intel memory controller EDAC driver, I don't know if it can get registered even if you don't have ECC support. If it is registered with identified memory controllers, and you can see eg 'SECDED' as the EDAC mode in /sys/devices/system/edac/mc/mcN, then I think you can be relatively confident that you have ECC active on that system. On my home desktop, which definitely doesn't support ECC, what I see on boot for EDAC (with Fedora 38's kernel 6.7.4) is:
EDAC MC: Ver: 3.0.0 EDAC ie31200: No ECC support EDAC ie31200: No ECC support
As expected there are no 'mcN' subdirectories in /sys/devices/system/edac/mc.
Two Intel servers where I'm pretty certain we have ECC support report, respectively:
EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:64:0a.0 (INTERRUPT)
and
EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)
As we can see here, Intel CPUs have more than one EDAC driver, depending on CPU generation and so on. The first EDAC message comes from a system with a Xeon Silver 4108, the second from a system with a Xeon E3-1230 v5.
2024-02-10
My plan for backups of my home machine (as of early 2024)
In theory, what I should do to back up my home desktop is fairly straightforward. I should get one or two USB hard drives of sufficient size, then periodically connect one and do a backup to it (probably using tar, and potentially not compressing the tar archives to make them more recoverable in the face of disk errors). If I'm energetic, I'll have two USB hard drives and periodically rotate one to the office as an offsite backup. Modern USB should be fast enough for this, and hopefully using (fast) USB drives will no longer kill my performance the way it used to. Large HDDs are reasonably affordable, especially if I decide to live with 5400 RPM ones (which I hope run cooler), so I could store multiple full system backups on a single HDD.
In practice this is a lot of things to remember to do on a regular basis, and although I have some of the pieces (and have for years), those pieces have dust on them from disuse. So this approach isn't workable as a way to get routine backups; at best I might manage to do it once every few months. So instead I long ago came up with a plan that is not so much better as more likely to succeed. The short version of the plan is that I will make backups to an additional live HDD in my home desktop.
My home desktop's storage used to be a mirrored pair of SSDs and a mirrored but mismatched pair of HDDs. Back in early 2023, this became all solid state, with a pair of NVMe drives and a pair of SSDs (not the same SSDs, the new pair is much larger). This leaves me with an unused 4 TB HDD, which I actually (still) have in the case. So I can reuse this 4 TB HDD as an always-live backup drive, or what is really 'a second copy' drive. Because the drive will always be there and live, I can automate copies to it, run them from cron, and more or less forget about it (once it's working).
The obvious and most readily automated way to make the backups is to use ZFS snapshots. I'll make a new ZFS pool on the HDD, and then use snapshots with 'zfs send' and 'zfs receive' to move them from the solid state storage to the HDD pool. ZFS's read only snapshots will insure that I can't accidentally damage the backup copies, and I can scrub the HDD's ZFS pool periodically as insurance against disk corruption. My total space usage in both my current solid state ZFS pools is still a bit under 2 TB, so I should have plenty of space for both on a 4 TB HDD.
This is obviously imperfect, since various sorts of problems could cost me both the live storage and the HDD, and I could have ZFS problems too. But it's a lot better than nothing, and sometimes the perfect is the enemy of the good.
(Having written this, perhaps I will actually implement it. The current obstacle is that the old HDDs are still running my old LVM setup, as backup for the ZFS pool I created on the new SSDs and then theoretically moved all of the LVM's contents to. So I'd have to hold my breath and tear down those filesystems and the LVM storage first. Destroying even supposedly completely surplus data makes me twitch just a bit, and so far it's been easier to do nothing.)
2024-02-07
What I'd like in a hypothetical new desktop machine in 2024
My current work desktop and home desktop are getting somewhat long in the tooth, which has caused me to periodically think about what I'd want in new hardware for them. Sometimes I even look at potential hardware choices for such a replacement desktop (which can lead to grumbling). Today I want to write down my ideal broad specifications for such a new desktop, what I'd get if I could get it all in one spot for an affordable price.
In addition to all of the expected things (like onboard sound), I'd like:
- 64 GB of RAM instead of my current 32 GB. It would be nice if it
was ECC RAM in a system that genuinely supported it, and it would
also be nice if it was fast, but those two attributes are often in
opposition to each other.
(Today I suspect this means choosing DDR5 over DDR4.)
- Three motherboard M.2 NVMe drive slots. I'd like three because I
currently have a mirrored pair of NVMe drives, and having a third
slot would let me replace one of the live two without having to
pull it outright. Two motherboard M.2 NVMe slots (both operating
at PCIe x4) is probably my minimum these days, and I already have
a PCIe M.2 NVMe card for the current work desktop.
My work desktop has 500 GB NVMe drives currently and I'd like to get bigger ones. My home desktop is fine with its current drives.
- At least four SATA ports and ideally more. My office desktop has
two SSDs and a SATA DVD-RW drive (because we still sometimes use
those), and I want to be able to run three SSDs at once while
replacing one of the two SSDs. Six SATA ports would be better,
so perhaps I should say I can live with four SATA ports but I'd
like six.
(My home desktop will also need three SATA ports on a routine basis with a fourth available for drive replacement, but that's for another entry.)
- At least three 1G Ethernet ports for my work desktop. Since I don't
think there are any reasonable desktop motherboards with this
many Ethernet ports, this needs at least a dual-port PCIe card
and perhaps a quad-port card, which I already have at work. It
also needs a suitable PCIe slot to be free and usable given any
other cards in the machine. My home desktop can get by with one
port but I'd probably like to have two or three there too.
(I wouldn't need that many but Linux's native virtualization works best if you give it its own network port.)
Although various desktop motherboards have started offering speeds above 1G (although often not full 10G-T), our work wiring situation is such that there's no real prospect of taking advantage of that any time soon. But if a motherboard comes with '2.5G' or '5G' networking with a chipset that's decent and well supported by Linux, I wouldn't say no.
- At least two DisplayPort and/or HDMI outputs that support at least
4K at 60 Hz, and I'd like more for future-proofing. I would prefer
two DisplayPort outputs to a DisplayPort + HDMI pairing; this is
readily available in GPU cards but not really in motherboards and
integrated graphics. At work I currently have two 27" HiDPI
displays and at home I currently have one; in both locations the
biggest constraint on larger displays or more of them is physical
space.
(I'd love it if we were moving into a bright future of high resolution, high DPI, high refresh rate displays, but I don't think we are, so I don't really expect to want more than dual 4K at 60Hz for the next half decade or more. It's possible this is too pessimistic and there are viable 5K+ monitors that I might want at home in place of my current 27" 4K HiDPI display.)
- Open source friendly graphics, which in practice excludes Nvidia
GPUs (especially if I care about good Wayland support), and
possibly the discrete Intel GPU cards (I'm not sure of their
state). I think anything reasonably modern will support whatever
OpenGL features Wayland needs or is likely to need. The easy way
to get this might well be integrated graphics on a current
generation CPU, assuming I can get the output ports that I want.
On the other hand, the Intel ARC A380 seems to be okay on Linux (from some Internet searches), and while it has a fan it's alleged to be able to operate very quietly. It would give me the multiple DisplayPort outputs and high resolution, high refresh rate support.
- A decent number of both USB-A and USB-C ports. I'd like a reasonable number of USB-A ports because I still have a lot of USB-A things and I'd like not to have a whole collection of USB-A hubs sitting around on my either my office or my home desk. But probably more hubs (or larger ones) is in my future.
I'd like it if the machine still supported old fashioned BIOS MBR booting and didn't require (U)EFI booting (I have my reasons), although UEFI booting is probably better on desktop motherboards than it used to be. The UEFI story for people who want booting from mirrored pairs of drives may be better on Fedora than it used to be, since Ubuntu 22.04 has some support for duplicate UEFI boot partitions.
(I'm absolutely not interested in trying to mirror the EFI System Partition behind the back of the UEFI BIOS.)
It would be nice to get a good CPU performance increase from my current desktops, but on the one hand I sort of assume that any decent desktop CPU today is going to be visibly better than something from more than five years ago, and on the other hand I'm not sure how noticeable the performance improvement is these days, and on the third hand I've been wrong before. If my current (five year old) desktops have reached the point where CPU performance mostly doesn't matter to me, then I'd probably prefer to get a midrange CPU with decent thermal performance and perhaps no funny slow 'efficiency' cores that can give you and Linux's kernel CPU scheduling various sorts of heartburn. On the other hand, my Firefox build times keep getting slower and slower, so I suspect that the world of software just assumes current CPUs and current good performance.
PS: I have no plans to do GPU computation on my desktops, for a variety of reasons including that I don't want to deal with Nvidia GPUs in my machines. If I need to do GPU stuff for work, our SLURM cluster has GPUs, and I don't have to care how much power they use, how noisy they are, and how much heat they put out because they're in the machine room (and I'm not).
2024-02-06
What the max_connect
Linux NFS v4 mount parameter seems to do
Suppose, not hypothetically, that you've converted your fleet from using NFS v3 to using basic Unix security NFS v4 mounts when they mount their hordes of NFS filesystems from your NFS fileservers. When your NFS clients boot or at some other times, you notice that you're getting a bunch of copies of a new kernel message:
SUNRPC: reached max allowed number (1) did not add transport to server: <IP address>
Modern NFS uses TCP, which means that the NFS client needs to make some number of TCP connections to each NFS server. In NFS v3, Linux normally only makes one connection to each server. The same is sort of true in NFS v4 as well, but NFS v4 is more complex about what is 'a server'. In NFS v3, servers are identified by at least their IP address (and perhaps their name; I'm not sure if two different names that map to the same IP will share the same connection). In NFS v4.1+, servers have some sort of intrinsic identity that is visible to clients even if you're talking to them by multiple IP addresses.
This new 'reached max allowed number (<N>) did not add transport to server' kernel message is reporting about this case. You (we) have a single NFS server that for historical reasons has two different IPs, one for most of its filesystems and one for our central administrative filesystem, and now NFS v4 considers these the 'same' server and won't make an extra connection to the second IP.
You might wonder if you can change this, and the answer is that you
can but it gets complex and I'm not quite sure how it all works to
distribute the actual NFS traffic. There appear to be two interlinked
things that you can control; how many connections a NFS v4 client
will make to a single NFS server, and how many different IPs of the
server that NFS v4 client will connect to. How many connections NFS
v4 will make to a single server is mostly controlled by nfs(5)'s nconnect
setting, sort of like nconnect
's behavior with NFS v3. How many connections NFS v4 will make to
separate client IPs is controlled by 'max_connect
'. Both of
these default to 1. However, how they interact is confusing and I'm
not sure I fully understand it.
The easy case is not setting nconnect and setting max_connect to at least as many different IP aliases as you have for each fileserver. In this case you'll get one TCP connection per server IP (although don't ask me what traffic flows over what connection). If you set nconnect without max_connect, you'll get however many connections to the first IP address of each server (well, the first IP address that the client finds), assuming that you mount at least that many NFS filesystems from that server.
However, if you set both nconnect and max_connect, what seems to happen (on Ubuntu 22.04) is that you get nconnect TCP connections to each server's first (encountered) IP address, and then one TCP connection to every other IP address (up to the max_connect limit). This is why I described 'nconnect' as controlling how many connections NFS v4 would make to a single server, instead of a single server IP (or name). It would be a bit more useful if you could set nconnect on a per-IP (or name) basis in NFS v4, or otherwise make it so that the first IP didn't get all of the connections.
(This is apparently called 'trunking' in NFS v4, per RFC 5661 section 2.10.5 (via).)