Wandering Thoughts

2024-02-21

What ZIL metrics are exposed by (Open)ZFS on Linux

The ZFS Intent Log (ZIL) is effectively ZFS's version of a filesystem journal, writing out hopefully brief records of filesystem activity to make them durable on disk before their full version is committed to the ZFS pool. What the ZIL is doing and how it's performing can be important for the latency (and thus responsiveness) of various operations on a ZFS filesystem, since operations like fsync() on an important file must wait for the ZIL to write out (commit) their information before they can return from the kernel. On Linux, OpenZFS exposes global information about the ZIL in /proc/spl/kstat/zfs/zil, but this information can be hard to interpret without some knowledge of ZIL internals.

(In OpenZFS 2.2 and later, each dataset also has per-dataset ZIL information in its kstat file, /proc/spl/kstat/zfs/<pool>/objset-0xXXX, for some hexadecimal '0xXXX'. There's no overall per-pool ZIL information the way there is a global one, but for most purposes you can sum up the ZIL information from all of the pool's datasets.)

The basic background here is the flow of activity in the ZIL and also the comments in zil.h about the members of the zil_stats struct.

The (ZIL) data you can find in the "zil" file (and the per-dataset kstats in OpenZFS 2.2 and later) is as follows:

  • zil_commit_count counts how many times a ZIL commit has been requested through things like fsync().
  • zil_commit_writer_count counts how many times the ZIL has actually committed. More than one commit request can be merged into the same ZIL commit, if two people fsync() more or less at the same time.

  • zil_itx_count counts how many intent transactions (itxs) have been written as part of ZIL commits. Each separate operation (such as a write() or a file rename) gets its own separate transaction; these are aggregated together into log write blocks (lwbs) when a ZIL commit happens.

When ZFS needs to record file data into the ZIL, it has three options, which it calls 'indirect', 'copied', and 'needcopy' in ZIL metrics. Large enough amounts of file data are handled with an indirect write, which writes the data to its final location in the regular pool; the ZIL transaction only records its location, hence 'indirect'. In a copied write, the data is directly and immediately put in the ZIL transaction (itx), even before it's part of a ZIL commit; this is done if ZFS knows that the data is being written synchronously and it's not large enough to trigger an indirect write. In a needcopy write, the data just hangs around in RAM as part of ZFS's regular dirty data, and if a ZIL commit happens that needs that data, the process of adding its itx to the log write block will fetch the data from RAM and add it to the itx (or at least the lwb).

There are ZIL metrics about this:

  • zil_itx_indirect_count and zil_itx_indirect_bytes count how many indirect writes have been part of ZIL commits, and the total size of the indirect writes of file data (not of the 'itx' records themselves, per the comments in zil.h).

    Since these are indirect writes, the data written is not part of the ZIL (it's regular data blocks), although it is put on disk as part of a ZIL commit. However, unlike other ZIL data, the data written here would have been written even without a ZIL commit, as part of ZFS's regular transaction group commit process. A ZIL commit merely writes it out earlier than it otherwise would have been.

  • zil_itx_copied_count and zil_itx_copied_bytes count how many 'copied' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.

  • zil_itx_needcopy_count and zil_itx_needcopy_bytes count how many 'needcopy' writes have been part of ZIL commits and the total size of the file data written (and thus committed) this way.

A regular system using ZFS may have little or no 'copied' activity. Our NFS servers all have significant amounts of it, presumably because some NFS data writes are done synchronously and so this trickles through to the ZFS stats.

In a given pool, the ZIL can potentially be written to either the main pool's disks or to a separate log device (a slog, which can also be mirrored). The ZIL metrics have a collection of zil_itx_metaslab_* metrics about data actually written to the ZIL in either the main pool ('normal' metrics) or to a slog (the 'slog' metrics).

  • zil_itx_metaslab_normal_count counts how many ZIL log write blocks (not ZIL records, itxs) have been committed to the ZIL in the main pool. There's a corresponding 'slog' version of this and all further zil_itx_metaslab metrics, with the same meaning.

  • zil_itx_metaslab_normal_bytes counts how many bytes have been 'used' in ZIL log write blocks (for ZIL commits in the main pool). This is a rough representation of how much space the ZIL log actually needed, but it doesn't necessarily represent either the actual IO performed or the space allocated for ZIL commits.

    As I understand things, this size includes the size of the intent transaction records themselves and also the size of the associated data for 'copied' and 'needcopy' data writes (because these are written into the ZIL as part of ZIL commits, and so use space in log write blocks). It doesn't include the data written directly to the pool as 'indirect' data writes.

If you don't use a slog in any of your pools, the 'slog' versions of these metrics will all be zero. I think that if you have only slogs, the 'normal' versions of these metrics will all be zero.

In ZFS 2.2 and later, there are two additional statistics for both normal and slog ZIL commits:

  • zil_itx_metaslab_normal_write counts how many bytes have actually been written in ZIL log write blocks. My understanding is that this includes padding and unused space at the end of a log write block that can't fit another record.

  • zil_itx_metaslab_normal_alloc counts how many bytes of space have been 'allocated' for ZIL log write blocks, including any rounding up to block sizes, alignments, and so on. I think this may also be the logical size before any compression done as part of IO, although I'm not sure if ZIL log write blocks are compressed.

You can see some additional commentary on these new stats (and the code) in the pull request and the commit itself.

PS: OpenZFS 2.2 and later has a currently undocumented 'zilstat' command, and its 'zilstat -v' output may provide some guidance on what ratios of these metrics the ZFS developers consider interesting. In its current state it will only work on 2.2 and later because it requires the two new stats listed above.

Sidebar: Some typical numbers

Here is the "zil" file from my office desktop, which has been up for long enough to make it interesting:

zil_commit_count                4    13840
zil_commit_writer_count         4    13836
zil_itx_count                   4    252953
zil_itx_indirect_count          4    27663
zil_itx_indirect_bytes          4    2788726148
zil_itx_copied_count            4    0
zil_itx_copied_bytes            4    0
zil_itx_needcopy_count          4    174881
zil_itx_needcopy_bytes          4    471605248
zil_itx_metaslab_normal_count   4    15247
zil_itx_metaslab_normal_bytes   4    517022712
zil_itx_metaslab_normal_write   4    555958272
zil_itx_metaslab_normal_alloc   4    798543872

With these numbers we can see interesting things, such as that the average number of ZIL transactions per commit is about 18 and that my machine has never done any synchronous data writes.

Here's an excerpt from one of our Ubuntu 22.04 ZFS fileservers:

zil_commit_count                4    155712298
zil_commit_writer_count         4    155500611
zil_itx_count                   4    200060221
zil_itx_indirect_count          4    60935526
zil_itx_indirect_bytes          4    7715170189188
zil_itx_copied_count            4    29870506
zil_itx_copied_bytes            4    74586588451
zil_itx_needcopy_count          4    1046737
zil_itx_needcopy_bytes          4    9042272696
zil_itx_metaslab_normal_count   4    126916250
zil_itx_metaslab_normal_bytes   4    136540509568

Here we can see the drastic impact of NFS synchronous writes (the significant 'copied' numbers), and also of large NFS writes in general (the high 'indirect' numbers). This machine has written many times more data in ZIL commits as 'indirect' writes as it has written to the actual ZIL.

linux/ZFSGlobalZILInformation written at 23:44:14; Add Comment

2024-02-20

NetworkManager won't share network interfaces, which is a problem

Today I upgraded my home desktop to Fedora 39. It didn't entirely go well; specifically, my DSL connection broke because Fedora stopped packaging some scripts with rp-pppoe and Fedora's old ifup, which is used by my very old-fashioned setup still requires those scripts. After I got back on the Internet, I decided to try an idea I'd toyed with, namely using NetworkManager to handle (only) my DSL link. Unfortunately this did not go well:

audit: op="connection-activate" uuid="[...]" name="[...]" pid=458524 uid=0 result="fail" reason="Connection '[...]' is not available on device em0 because device is strictly unmanaged"

The reason that em0 is 'unmanaged' by NetworkManager is that it's managed by systemd-networkd, which I like much better. Well, also I specifically told NetworkManager not to touch it by setting it as 'unmanaged' instead of 'managed'.

Although I haven't tested, I suspect that NetworkManager applies this restriction to all VPNs and other layered forms of networking, such that you can only run a NetworkManager managed VPN over a network interface that NetworkManager is controlling. I find this quite unfortunate. There is nothing that NetworkManager needs to change on the underlying Ethernet link to run PPPoE or a VPN over it; the network is a transport (a low level transport in the case of PPPoE).

I don't know if it's theoretically possible to configure NetworkManager so that an interface is 'managed' but NetworkManager doesn't touch it at all, so that systemd-networkd and other things could continue to use em0 while NetworkManager was willing to run PPPoE on top of it. Even if it's possible in theory, I don't have much confidence that it will be problem free in practice, either now or in the future, because fundamentally I'd be lying to NetworkManager and networkd. If NetworkManager really had a 'I will use this interface but not change its configuration' category, it would have a third option besides 'managed or '(strictly) unmanaged'.

(My current solution is a hacked together script to start pppd and pppoe with magic options researched through extrace and a systemd service that runs that script. I have assorted questions about how this is going to interactive with various things, but someday I will get answers, or perhaps unpleasant surprises.)

PS: Where this may be a special problem someday is if I want to run a VPN over my DSL link. I can more or less handle running PPPoE by hand, but the last time I looked at a by hand OpenVPN setup I rapidly dropped the idea. NetworkManager is or would be quite handy for this sort of 'not always there and complex' networking, but it apparently needs to own the entire stack down to Ethernet.

(To run a NetworkManager VPN over 'ppp0', I would have to have NetworkManager manage it, which would presumably require I have NetworkManager handle the PPPoE DSL, which requires NetworkManager not considering em0 to be unmanaged. It's NetworkManager all the way down.)

linux/NetworkManagerDoesNotShare written at 22:55:01; Add Comment

2024-02-19

The flow of activity in the ZFS Intent Log (as I understand it)

The ZFS Intent Log (ZIL) is a confusing thing once you get into the details, and for reasons beyond the scope of this entry I recently needed to sort out the details of some aspects of how it works. So here is what I know about how things flow into the ZIL, both in memory and then on to disk.

(As always, there is no single 'ZFS Intent Log' in a ZFS pool. Each dataset (a filesystem or a zvol) has its own logically separate ZIL. We talk about 'the ZIL' as a convenience.)

When you perform activities that modify a ZFS dataset, each activity creates its own ZIL log record (a transaction in ZIL jargon, sometimes called an 'itx', probably short for 'intent transaction') that is put into that dataset's in-memory ZIL log. This includes both straightforward data writes and metadata activity like creating or renaming files. You can see a big list of all of the possible transaction types in zil.h as all of the TX_* definitions (which have brief useful comments). In-memory ZIL transactions aren't necessarily immediately flushed to disk, especially for things like simply doing a write() to a file. The reason that plain write()s to a file are (still) given ZIL transactions is that you may call fsync() on the file later. If you don't call fsync() and the regular ZFS transaction group commits with your write()s, those ZIL transactions will be quietly cleaned out of the in-memory ZIL log (along with all of the other now unneeded ZIL transactions).

(All of this assumes that your dataset doesn't have 'sync=disabled' set, which turns off the in-memory ZIL as one of its effects.)

When you perform an action such as fsync() or sync() that requests that in-memory ZFS state be made durable on disk, ZFS gathers up some or all of those in-memory ZIL transactions and writes them to disk in one go, as a sequence of log (write) blocks ('lwb' or 'lwbs' in ZFS source code), which pack together those ZIL transaction records. This is called a ZIL commit. Depending on various factors, the flushed out data you write() may or may not be included in the log (write) blocks committed to the (dataset's) ZIL. Sometimes your file data will be written directly into its future permanent location in the pool's free space (which is safe) and the ZIL commit will have only a pointer to this location (its DVA).

(For a discussion of this, see the comments about the WR_* constants in zil.h. Also, while in memory, ZFS transactions are classified as either 'synchronous' or 'asynchronous'. Sync transactions are always part of a ZIL commit, but async transactions are only included as necessary. See zil_impl.h and also my entry discussing this.)

It's possible for several processes (or threads) to all call sync() or fsync() at once (well, before the first one finishes committing the ZIL). In this case, their requests can all be merged together into one ZIL commit that covers all of them. This means that fsync() and sync() calls don't necessarily match up one to one with ZIL commits. I believe it's also possible for a fsync() or sync() to not result in a ZIL commit if all of the relevant data has already been written out as part of a regular ZFS transaction group (or a previous request).

Because of all of this, there are various different ZIL related metrics that you may be interested in, sometimes with picky but important differences between them. For example, there is a difference between 'the number of bytes written to the ZIL' and 'the number of bytes written as part of ZIL commits', since the latter would include data written directly to its final space in the main pool. You might care about the latter when you're investigating the overall IO impact of ZIL commits but the former if you're looking at sizing a separate log device (a 'slog' in ZFS terminology).

solaris/ZFSZILActivityFlow written at 21:58:13; Add Comment

2024-02-18

Even big websites may still be manually managing TLS certificates (or close)

I've written before about how people's soon to expire TLS certificates aren't necessarily a problem, because not everyone manages their TLS certificates through Let's Encrypt like '30 day in advance automated renewal' and perhaps short-lived TLS certificates. For example, some places (like Facebook) have automation but seem to only deploy TLS certificates that are quite close to expiry. Other places at least look as if they're still doing things by hand, and recently I got to watch an example of that.

As I mentioned yesterday, the department outsources its public website to a SaaS CMS provider. While the website has a name here for obvious reasons, it uses various assets that are hosted on sites under the SaaS provider's domain names (both assets that are probably general and assets, like images, that are definitely specific to us). For reasons beyond the scope of this entry, we monitor the reachability of these additional domain names with our metrics system. This only checks on-campus reachability, of course, but that's still important even if most visitors to the site are probably from outside the university.

As a side effect of this reachability monitoring, we harvest the TLS certificate expiry times of these domains, and because we haven't done anything special about it, they get show on our core status dashboard along side the expiry times of TLS certificates that we're actually responsible for. The result of this was that recently I got to watch their TLS expiry times count down to only two weeks away, which is lots of time from one view while also alarmingly little if you're used to renewals 30 days in advance. Then they flipped over to new a new year-long TLS certificate and our dashboard was quiet again (except for the next such external site that has dropped under 30 days).

Interestingly, the current TLS certificate was issued about a week before it was deployed, or at least its Not-Before date is February 9th at 00:00 UTC and it seems to have been put into use this past Friday, the 16th. One reason for this delay in deployment is suggested by our monitoring, which seems to have detected traces of a third certificate sometimes being visible, this one expiring June 23rd, 2024. Perhaps there were some deployment challenges across the SaaS provider's fleet of web servers.

(Their current TLS certificate is actually good for just a bit over a year, with a Not-Before of 2024-02-09 and a Not-After of 2025-02-28. This is presumably accepted by browsers, even though it's bit over 365 days; I haven't paid attention to the latest restrictions from places like Apple.)

web/TLSCertsSomeStillManual written at 22:06:08; Add Comment

2024-02-17

We outsource our public web presence and that's fine

I work for a pretty large Computer Science department, one where we have the expertise and need to do a bunch of internal development and in general we maintain plenty of things, including websites. Thus, it may surprise some people to learn that the department's public-focused web site is currently hosted externally on a SaaS provider. Even the previous generation of our outside-facing web presence was hosted and managed outside of the department. To some, this might seem like the wrong decision for a department of Computer Science (of all people) to make; surely we're capable of operating our own web presence and thus should as a matter of principle (and independence).

Well, yes and no. There are two realities. The first is that a modern content management system is both a complex thing (to develop and to generally to operate and maintain securely) and a commodity, with many organizations able to provide good ones at competitive prices. The second is that both the system administration and the publicity side of the department only have so many people and so much time. Or, to put it another way, all of us have work to get done.

The department has no particular 'competitive advantage' in running a CMS website; in fact, we're almost certain to be worse at it than someone doing it at scale commercially, much like what happened with webmail. If the department decided to operate its own CMS anyway, it would be as a matter of principle (which principles would depend on whether the CMS was free or paid for). So far, the department has not decided that this particular principle is worth paying for, both in direct costs and in the opportunity costs of what that money and staff time could otherwise be used for.

Personally I agree with that decision. As mentioned, CMSes are a widely available (but specialized) commodity. Were we to do it ourselves, we wouldn't be, say, making a gesture of principle against the centralization of CMSes. We would merely be another CMS operator in an already crowded pond that has many options.

(And people here do operate plenty of websites and web content on our own resources. It's just that the group here responsible for our public web presence found it most effective and efficient to use a SaaS provider for this particular job.)

web/OutsourcedWebCMSSensible written at 21:39:20; Add Comment

2024-02-16

Options for genuine ECC RAM on the desktop in (early) 2024

A traditional irritation with building (or specifying) desktop computers is the issue of ECC RAM, which for a long time was either not supported at all or was being used by Intel for market segmentation. First generation AMD Ryzens sort of supported ECC RAM with the right motherboard, but there are many meanings of 'supporting' ECC RAM and questions lingered about how meaningful the support was (recent information suggests the support was real). Here in early 2024 the situation is somewhat better and I'm going to summarize what I know so far.

The traditional option to getting ECC RAM support (along with a bunch of other things) was to buy a 'workstation' motherboard that was built to support Intel Xeon processors. These were available from a modest number of vendors, such as SuperMicro, and were generally not inexpensive (and then you had to buy the Xeon). If you wanted a pre-built solution, vendors like Dell would sell you desktop Xeon-based workstation systems with ECC RAM. You can still do this today.

Update: I forgot AMD Threadripper and Epyc based systems, which you can get motherboards for and build desktop systems around. I think these are generally fairly expensive motherboards, though.

Back in 2022, Intel introduced their W680 desktop chipset. One of the features of this chipset is that it officially supported ECC RAM with 12th generation and later (so far) Intel CPUs (or at least apparently the non-F versions), along with official support for memory overclocking (and CPU overclocking), which enables faster 'XMP' memory profiles than the stock ones (should your ECC RAM actually support this). There are a modest number of W680 based motherboards available from (some of) the usual x86 PC desktop motherboard makers (and SuperMicro), but they are definitely priced at the high end of things. Intel has not yet announced a 'Raptor Lake' chipset version of this, which would presumably be called the 'W780'. At this date I suspect there will be no such chipset.

(The Intel W680 chipset was brought to my attention by Brendan Shanks on the Fediverse.)

As mentioned, AMD support for ECC on early generation Ryzens was a bit lackluster, although it was sort of there. With the current Socket AM5 and Zen 4, a lot of mentions of ECC seem to have (initially) been omitted from documentation, as discussed in Rain's ECC RAM on AMD Ryzen 7000 desktop CPUs, and Ryzen 8000G series APUs don't support ECC at all. However, at least some AM5 motherboards do support ECC with recent enough firmware (provided that you have recent BIOS updates and enable ECC support in the BIOS, per Rain). These days, it appears that a number of current AM5 motherboards list ECC memory as supported (although what supported means is a question) and it will probably work, especially if you find people who already have reported success. It seems that even some relatively inexpensive AM5 motherboards may support ECC.

(Some un-vetted resources are here and here.)

If you can navigate the challenges of finding a good motherboard, it looks like an AM5, Ryzen 7000 system will support ECC at a lower cost than an Intel W680 based system (or an Intel Xeon one). If you don't want to try to thread those rapids and can stand Intel CPUs, a W680 based system will presumably work, and a Xeon based system would be even easier to purchase as a fully built desktop with ECC.

(Whether ECC makes a meaningful difference that's worth paying for is a bit of an open question.)

tech/DesktopECCOptions2024 written at 23:52:09; Add Comment

2024-02-15

(Some) X window managers deliberately use off-screen windows

I mentioned recently that the X Window System allows you to position (X) windows so that they're partially or completely off the screen (when I wrote about how I accidentally put some icons off screen). Some window managers, such as fvwm, actually make significant use of this X capability.

To start with, windows can be off screen in any direction, because X permits negative coordinates for window locations (both horizontally and vertically). Since the top left of the screen is 0, 0 in the coordinate system, windows with a negative X are often said to be off screen to the left, and ones with a negative Y are off screen 'above', to go with a large enough positive X being 'to the right' and a positive Y being 'below'. If a window is completely off the screen, its relative location is in some sense immaterial, but this makes it easier to talk about some other things.

(Windows can also be partially off screen, in which case it does matter that negative Y is 'above' and negative X is 'left', because the bottom or the right part of such a window is what will be visible on screen.)

Fvwm has a concept of a 'virtual desktop' that can be larger than your physical display (or displays added together), normally expressed in units of your normal monitor configuration; for example, my virtual desktop is three wide by two high, creating six of what Fvwm calls pages. Fvwm calls the portion of the virtual desktop that you can see the viewport, and many people (me included) keep the viewport aligned with pages. You can then talk about things like flipping between pages, which is technically moving the viewport to or between pages.

When you change pages or in general move the viewport, Fvwm changes the X position of windows so that they are in the right (pixel) spot relative to the new page. For instance, if you have a 1280 pixel wide display and a window positioned with its left edge at 0, then you move one Fvwm page to your right, Fvwm changes the window's X coordinate to be -1280. If you want, you can then use X tools or other means to move the window around on its old page, and when you flip back to the page Fvwm will respect that new location. If you move the window to be 200 pixels away from the left edge, making it's X position -1080, when you change back to that page Fvwm will put the window's left edge at an X position of 200 pixels.

This is an elegant way to avoid having to keep track of the nominal position of off-screen windows; you just have X do it for you. If you have a 1280 x 1024 display and you move one page to the left, you merely add 1280 pixels to the X position of the (X) windows being displayed. Windows on the old page will now be off screen, while windows on the new page will come back on screen.

I think most X desktop environments and window managers have moved away from this simple and brute force approach to handle windows that are off screen because you've moved your virtual screen or workspace or whatever the environment's term is. I did a quick test in Cinnamon, and it didn't seem to change window positions this way.

(There are other ways in X to make windows disappear and reappear, so Cinnamon is using one of them.)

unix/XOffscreenWindowsUse written at 22:51:41; Add Comment

2024-02-14

Understanding a recent optimization to Go's reflect.TypeFor

Go's reflect.TypeFor() is a generic function that returns the reflect.Type for its type argument. It was added in Go 1.22, and its initial implementation was quite simple but still valuable, because it encapsulated a complicated bit of reflect usage. Here is that implementation:

func TypeFor[T any]() Type {
  return TypeOf((*T)(nil)).Elem()
}

How this works is that it constructs a nil pointer value of the type 'pointer to T', gets the reflect.Type of that pointer, and then uses Type.Elem() to go from the pointer's Type to the Type for T itself. This requires constructing and using this 'pointer to T' type (and its reflect.Type) even though we only what the reflect.Type of T itself. All of this is necessary for reasons to do with interface types.

Recently, reflect.TypeFor() was optimized a bit, in CL 555597, "optimize TypeFor for non-interface types". The code for this optimization is a bit tricky and I had to stare at it for a while to understand what it was doing and how it worked. Here is the new version, which starts with the new optimization and ends with the old code:

func TypeFor[T any]() Type {
  var v T
  if t := TypeOf(v); t != nil {
     return t
  }
  return TypeOf((*T)(nil)).Elem()
}

What this does is optimize for the case where you're using TypeFor() on a non-interface type, for example 'reflect.TypeFor[int64]()' (although you're more likely to use this with more complex things like struct types). When T is a non-interface type, we don't need to construct a pointer to a value of the type; we can directly obtain the Type from reflect.TypeOf. But how do we tell whether or not T is an interface type? The answer turns out to be right there in the documentation for reflect.TypeOf:

[...] If [TypeOf's argument] is a nil interface value, TypeOf returns nil.

So what the new code does is construct a zero value of type T, pass it to TypeOf(), and check what it gets back. If type T is an interface type, its zero value is a nil interface and TypeOf() will return nil; otherwise, the return value is the reflect.Type of the non-interface type T.

The reason that reflect.TypeOf returns nil for a nil interface value is because it has to. In Go, nil is only sort of typed, so if a nil interface value is passed to TypeOf(), there is effectively no type information available for it; its old interface type is lost when it was converted to 'any', also known as the empty interface. So all TypeOf() can return for such a value is the nil result of 'this effectively has no useful type information'.

Incidentally, the TypeFor() code is also another illustration of how in Go, interfaces create a difference between two sorts of nils. Consider calling 'reflect.TypeFor[*os.File]()'. Since this is a pointer type, the zero value 'v' in TypeFor() is a nil pointer. But os.File isn't an interface type, so TypeOf() won't be passed a nil interface and can return a Type, even though the underlying value in the interface that TypeOf() receives is a nil pointer.

programming/GoReflectTypeForOptimization written at 23:12:03; Add Comment

2024-02-13

What is in (Open)ZFS's per-pool "txgs" /proc file on Linux

As part of (Open)ZFS's general 'kstats' system for reporting information about ZFS overall and your individual pools and datasets, there is a per-pool /proc file that reports information about the most recent N transaction groups ('txgs'), /proc/spl/kstat/zfs/<pool>/txgs. How many N is depends on the zfs_txg_history parameter, and defaults to 100. The information in here may be quite important for diagnosing certain sorts of performance problems but I haven't found much documentation on what's in it. Well, let's try to fix that.

The overall format of this file is:

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
5846176  7976255438836187 C     1736704      0            5799936      0        299      5119983470   2707         49115        27910766    
[...]
5846274  7976757197601868 C     1064960      0            4702208      0        236      5119973466   2405         48349        134845007   
5846275  7976762317575334 O     0            0            0            0        0        0            0            0            0           

(This example is coming from a system with four-way mirrored vdevs, which is going to be relevant in a bit.)

So lets take these fields in order:

  1. txg is the transaction group number, which is a steadily increasing number. The file is ordered from the oldest txg to the newest, which will be the current open transaction group.

    (In the example, txg 5846275 is the current open transaction group and 5846274 is the last one the committed.)

  2. birth is the time when the transaction group (txg) was 'born', in nanoseconds since the system booted.

  3. state is the current state of the txg; this will most often be either 'C' for committed or 'O' for open. You may also see 'S' for syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An open transaction group will most likely have 0s for the rest of the numbers, and will be the last txg (there's only one open txg at a time). Any transaction group except the second last will be in state 'C', because you can only have one transaction group in the process of being written out.

    Update: per the comment from Arnaud Gomes, you can have multiple transaction groups at the end that aren't committed. I believe you can only have one that is syncing ('S'), because that happens in a single thread for only one txg, but you may have another that is quiescing or waiting to sync.

    A transaction group's progress through its life cycle is open, quiescing, waiting for sync, syncing, and finally committed. In the open state, additional transactions (such as writing to files or renaming them) can be added to the transaction group; once a transaction group has been quiesced, nothing further will be added to it.

    (See also ZFS fundamentals: transaction groups, which discusses how a transaction group can take a while to sync; the content has also been added as a comment in the source code in txg.c.)

  4. ndirty is how many bytes of directly dirty data had to be written out as part of this transaction; these bytes come, for example, from user write() IO.

    It's possible to have a transaction group commit with a '0' for ndirty. I believe that this means no IO happened during the time the transaction group was open, and it's just being closed on the timer.

  5. nread is how many bytes of disk reads the pool did between when syncing of the txg starts and when it finishes ('during txg sync').
  6. nwritten is how many bytes of disk writes the pool did during txg sync.
  7. reads is the number of disk read IOs the pool did during txg sync.
  8. writes is the number of disk write IOs the pool did during txg sync.

    I believe these IO numbers include at least any extra IO needed to read in on-disk data structures to allocate free space and any additional writes necessary. I also believe that they track actual bytes written to your disks, so for example with two-way mirrors they'll always be at least twice as big as the ndirty number (in my example above, with four way mirrors, their base is four times ndirty).

    As we can see it's not unusual for nread and reads to be zero. However, I don't believe that the read IO numbers are restricted to transaction group commit activities; if something is reading from the pool for other reasons during the transaction group commit, that will show up in nread and reads. They are thus a measure of the amount of read IO going during the txg sync process, not the amount of IO necessary for it.

    I don't know if ongoing write IO to the ZFS Intent Log can happen during a txg sync. If it can, I would expect it to show up in the nwritten and writes numbers. Unlike read IO, regular write IO can only happen in the context of a transaction group and so by definition any regular writes during a txg sync are part of that txg and show up in ndirty.

  9. otime is how long the txg was open and accepting new write IO, in nanoseconds. Often this will be around the default zfs_txg_timeout time, which is normally five seconds. However, under (write) IO pressure this can be shorter or longer (if the current open transaction group can't be closed because there's already a transaction group in the process of trying to commit).

  10. qtime is how long the txg took to be quiesced, in nanoseconds; it's usually small.
  11. wtime is how long the txg took to wait to start syncing, in nanoseconds; it's usually pretty small, since all it involves is that the separate syncing thread pick up the txg and start syncing it.

  12. stime is how long the txg took to actually sync and commit, again in nanoseconds. It's often appreciable, since it's where the actual disk write IO happens.

In the example "txgs" I gave, we can see that despite the first committed txg listed having more dirty data than the last committed txg, its actual sync time was only about a quarter of the last txg's sync time. This might cause you to look at underlying IO activity patterns, latency patterns, and so on.

As far as I know, there's no per-pool source of information about the current amount of dirty data in the current open transaction group (although once a txg has quiesced and is syncing, I believe you do see a useful ndirty for it in the "txgs" file). A system wide dirty data number can more or less be approximated from the ARC memory reclaim statistics in the anon_size kstat plus the arc_tempreserve kstat, although the latter seems to never get very big for us.

A new transaction group normally opens as the current transaction group begins quiescing. We can verify this in the example output by adding the birth time and the otime of txg 5846274, which add up to exactly the birth time of txg 5846275, the current open txg. If this sounds suspiciously exact down to the nanosecond, that's because the code involve freezes the current time at one point and uses it for both the end of the open time of the current open txg and the birth time of the new txg.

Sidebar: the progression through transaction group states

Here is what I can deduce from reading through the OpenZFS kernel code, and since I had to go through this I'm going to write it down.

First, although there is a txg 'birth' state, 'B' in the 'state' column, you will never actually see it. Transaction groups are born 'open', per spa_txg_history_add() in spa_stats.c. Transaction groups move from 'O' open to 'Q' quiescing in txg_quiesce() in txg.c, which 'blocks until all transactions in the group are committed' (which I believe means they are finished fiddling around adding write IO). This function is also where the txg finishes quiescing and moves to 'W', waiting for sync. At this point the txg is handed off to the 'sync thread', txg_sync_thread() (also in txg.c). When the sync thread receives the txg, it will advance the txg to 'S', syncing, call spa_sync(), and then mark everything as done, finally moving the transaction group to 'C', committed.

(In the spa_stats.c code, the txg state is advanced by a call to spa_txg_history_set(), which will always be called with the old state we are finishing. Txgs advance to syncing in spa_txg_history_init_io(), and finish this state to move to committed in spa_txg_history_fini_io(). The tracking of read and write IO during the txg sync is done by saving a copy of the top level vdev IO stats in spa_txg_history_init_io(), getting a second copy in spa_txg_history_fini_io(), and then computing the difference between the two.)

Why it might take some visible time to quiesce a transaction group is more or less explained in the description of how ZFS's implementations of virtual filesystem operations work, in the comment at the start of zfs_vnops_os.c. Roughly, each operation (such as creating or renaming a file) starts by obtaining a transaction that will be part of the currently open txg, then doing its work, and then committing the transaction. If the transaction group starts quiescing while the operation is doing its work, the quiescing can't finish until the work does and commits the transaction for the rename, create, or whatever.

linux/ZFSPoolTXGsInformation written at 22:26:14; Add Comment

2024-02-12

Linux kernel boot messages and seeing if your AMD system has ECC

In general, consumer x86 desktops have generally not supported ECC memory, at least not if you wanted the 'ECC' bit to actually do anything. With Intel this seems to have been an issue of market segmentation, but things with AMD were more confusing. The initial AMD Ryzen series seemed to generally support ECC in the CPU, but the motherboard support was questionable, and even if your motherboard accepted ECC DIMMs there was an open question of whether the ECC was doing anything on any particular motherboard (cf). Later Ryzens have apparently had an even more confusing ECC support story, but I'm out of touch on that.

When we put together my work desktop we got ECC DIMMs for it and I thought that theoretically the motherboard supported ECC, but I've long wondered if it was actually doing anything. Recently I was looking into this a bit for reasons and ran across Rain's ECC RAM on AMD Ryzen 7000 desktop CPUs, which contained some extremely useful information about how to tell from your boot messages on AMD systems. I'm going to summarize this and add some extra information I've dug out of things.

Modern desktop CPUs talk to memory themselves, but not quite directly from the main CPU; instead, they have a separate on-die memory controller. On AMD Zen series CPUs, this is the AMD Unified Memory Controller, and there are special interfaces to talk to it. As I understand things, ECC is handled (or not) in the UMC, where it receives the raw bits from your DIMMs (if your DIMMs are wide enough, which you may or may not be able to tell). Therefor, to have ECC support active, you need ECC DIMMs and for ECC to be enabled in your UMC (which I believe is typically controlled by the BIOS, assuming the UMC supports ECC, which depends on the CPU).

In Linux, reporting and managing ECC is handled through a general subsystem called EDAC, with specific hardware drivers. The normal AMD EDAC driver is amd64_edac, and as covered by Rain, it registers for memory channels only if the memory channel has ECC on in the on-die UMC. When this happens, you will see a kernel message to the effect of:

EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)

It follows that if you do see this kernel message during boot, you almost certainly have fully supported ECC on your system. It's very likely that your DIMMs are ECC DIMMs, your motherboard supports ECC in the hardware and in its BIOS (and has it enabled in the BIOS if necessary and applicable), and your CPU is willing to do ECC with all of this. Since the above kernel message comes from my office desktop, it seems almost certain that it does indeed fully support ECC, although I don't think I've ever seen any kernel messages about detecting and correcting ECC issues.

You can see more memory channels in larger systems and they're not necessarily sequential; one of our large AMD machines has 'MC0' and 'MC2'. You may also see a message about 'EDAC PCI0: Giving out device to [...]', which is about a different thing.

In the normal Linux kernel way, various EDAC memory controller information can be found in sysfs under /sys/devices/system/edac/mc (assuming that you have anything registered, which you may not on a non-ECC system). This appears to include counts of corrected errors and uncorrected errors both at the high level of an entire memory controller and at the level of 'rows', 'ranks', and/or 'dimms' depending on the system and the kernel version. You can also see things like the memory EDAC mode, which could be 'SECDED' (what my office desktop reports) or 'S8ECD8ED' (what a large AMD server reports).

(The 'MC<n>' number reported by the kernel at boot time doesn't necessarily match the /sys/devices/system/edac/mc<n> number. We have systems which report 'MC0' and 'MC2' at boot, but have 'mc0' and 'mc1' in sysfs.)

The Prometheus host agent exposes this EDAC information as metrics, primarily in node_edac_correctable_errors_total and node_edac_uncorrectable_errors_total. We have seen a few corrected errors over time on one particular system.

Sidebar: EDAC on Intel hardware

While there's an Intel memory controller EDAC driver, I don't know if it can get registered even if you don't have ECC support. If it is registered with identified memory controllers, and you can see eg 'SECDED' as the EDAC mode in /sys/devices/system/edac/mc/mcN, then I think you can be relatively confident that you have ECC active on that system. On my home desktop, which definitely doesn't support ECC, what I see on boot for EDAC (with Fedora 38's kernel 6.7.4) is:

EDAC MC: Ver: 3.0.0
EDAC ie31200: No ECC support
EDAC ie31200: No ECC support

As expected there are no 'mcN' subdirectories in /sys/devices/system/edac/mc.

Two Intel servers where I'm pretty certain we have ECC support report, respectively:

EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:64:0a.0 (INTERRUPT)

and

EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)

As we can see here, Intel CPUs have more than one EDAC driver, depending on CPU generation and so on. The first EDAC message comes from a system with a Xeon Silver 4108, the second from a system with a Xeon E3-1230 v5.

linux/AMDWithECCKernelMessages written at 22:37:18; Add Comment

(Previous 10 or go back to February 2024 at 2024/02/11)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.