What is in (Open)ZFS's per-pool "txgs" /proc file on Linux

February 13, 2024

As part of (Open)ZFS's general 'kstats' system for reporting information about ZFS overall and your individual pools and datasets, there is a per-pool /proc file that reports information about the most recent N transaction groups ('txgs'), /proc/spl/kstat/zfs/<pool>/txgs. How many N is depends on the zfs_txg_history parameter, and defaults to 100. The information in here may be quite important for diagnosing certain sorts of performance problems but I haven't found much documentation on what's in it. Well, let's try to fix that.

The overall format of this file is:

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
5846176  7976255438836187 C     1736704      0            5799936      0        299      5119983470   2707         49115        27910766    
5846274  7976757197601868 C     1064960      0            4702208      0        236      5119973466   2405         48349        134845007   
5846275  7976762317575334 O     0            0            0            0        0        0            0            0            0           

(This example is coming from a system with four-way mirrored vdevs, which is going to be relevant in a bit.)

So lets take these fields in order:

  1. txg is the transaction group number, which is a steadily increasing number. The file is ordered from the oldest txg to the newest, which will be the current open transaction group.

    (In the example, txg 5846275 is the current open transaction group and 5846274 is the last one the committed.)

  2. birth is the time when the transaction group (txg) was 'born', in nanoseconds since the system booted.

  3. state is the current state of the txg; this will most often be either 'C' for committed or 'O' for open. You may also see 'S' for syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An open transaction group will most likely have 0s for the rest of the numbers, and will be the last txg (there's only one open txg at a time). Any transaction group except the second last will be in state 'C', because you can only have one transaction group in the process of being written out.

    Update: per the comment from Arnaud Gomes, you can have multiple transaction groups at the end that aren't committed. I believe you can only have one that is syncing ('S'), because that happens in a single thread for only one txg, but you may have another that is quiescing or waiting to sync.

    A transaction group's progress through its life cycle is open, quiescing, waiting for sync, syncing, and finally committed. In the open state, additional transactions (such as writing to files or renaming them) can be added to the transaction group; once a transaction group has been quiesced, nothing further will be added to it.

    (See also ZFS fundamentals: transaction groups, which discusses how a transaction group can take a while to sync; the content has also been added as a comment in the source code in txg.c.)

  4. ndirty is how many bytes of directly dirty data had to be written out as part of this transaction; these bytes come, for example, from user write() IO.

    It's possible to have a transaction group commit with a '0' for ndirty. I believe that this means no IO happened during the time the transaction group was open, and it's just being closed on the timer.

  5. nread is how many bytes of disk reads the pool did between when syncing of the txg starts and when it finishes ('during txg sync').
  6. nwritten is how many bytes of disk writes the pool did during txg sync.
  7. reads is the number of disk read IOs the pool did during txg sync.
  8. writes is the number of disk write IOs the pool did during txg sync.

    I believe these IO numbers include at least any extra IO needed to read in on-disk data structures to allocate free space and any additional writes necessary. I also believe that they track actual bytes written to your disks, so for example with two-way mirrors they'll always be at least twice as big as the ndirty number (in my example above, with four way mirrors, their base is four times ndirty).

    As we can see it's not unusual for nread and reads to be zero. However, I don't believe that the read IO numbers are restricted to transaction group commit activities; if something is reading from the pool for other reasons during the transaction group commit, that will show up in nread and reads. They are thus a measure of the amount of read IO going during the txg sync process, not the amount of IO necessary for it.

    I don't know if ongoing write IO to the ZFS Intent Log can happen during a txg sync. If it can, I would expect it to show up in the nwritten and writes numbers. Unlike read IO, regular write IO can only happen in the context of a transaction group and so by definition any regular writes during a txg sync are part of that txg and show up in ndirty.

  9. otime is how long the txg was open and accepting new write IO, in nanoseconds. Often this will be around the default zfs_txg_timeout time, which is normally five seconds. However, under (write) IO pressure this can be shorter or longer (if the current open transaction group can't be closed because there's already a transaction group in the process of trying to commit).

  10. qtime is how long the txg took to be quiesced, in nanoseconds; it's usually small.
  11. wtime is how long the txg took to wait to start syncing, in nanoseconds; it's usually pretty small, since all it involves is that the separate syncing thread pick up the txg and start syncing it.

  12. stime is how long the txg took to actually sync and commit, again in nanoseconds. It's often appreciable, since it's where the actual disk write IO happens.

In the example "txgs" I gave, we can see that despite the first committed txg listed having more dirty data than the last committed txg, its actual sync time was only about a quarter of the last txg's sync time. This might cause you to look at underlying IO activity patterns, latency patterns, and so on.

As far as I know, there's no per-pool source of information about the current amount of dirty data in the current open transaction group (although once a txg has quiesced and is syncing, I believe you do see a useful ndirty for it in the "txgs" file). A system wide dirty data number can more or less be approximated from the ARC memory reclaim statistics in the anon_size kstat plus the arc_tempreserve kstat, although the latter seems to never get very big for us.

A new transaction group normally opens as the current transaction group begins quiescing. We can verify this in the example output by adding the birth time and the otime of txg 5846274, which add up to exactly the birth time of txg 5846275, the current open txg. If this sounds suspiciously exact down to the nanosecond, that's because the code involve freezes the current time at one point and uses it for both the end of the open time of the current open txg and the birth time of the new txg.

Sidebar: the progression through transaction group states

Here is what I can deduce from reading through the OpenZFS kernel code, and since I had to go through this I'm going to write it down.

First, although there is a txg 'birth' state, 'B' in the 'state' column, you will never actually see it. Transaction groups are born 'open', per spa_txg_history_add() in spa_stats.c. Transaction groups move from 'O' open to 'Q' quiescing in txg_quiesce() in txg.c, which 'blocks until all transactions in the group are committed' (which I believe means they are finished fiddling around adding write IO). This function is also where the txg finishes quiescing and moves to 'W', waiting for sync. At this point the txg is handed off to the 'sync thread', txg_sync_thread() (also in txg.c). When the sync thread receives the txg, it will advance the txg to 'S', syncing, call spa_sync(), and then mark everything as done, finally moving the transaction group to 'C', committed.

(In the spa_stats.c code, the txg state is advanced by a call to spa_txg_history_set(), which will always be called with the old state we are finishing. Txgs advance to syncing in spa_txg_history_init_io(), and finish this state to move to committed in spa_txg_history_fini_io(). The tracking of read and write IO during the txg sync is done by saving a copy of the top level vdev IO stats in spa_txg_history_init_io(), getting a second copy in spa_txg_history_fini_io(), and then computing the difference between the two.)

Why it might take some visible time to quiesce a transaction group is more or less explained in the description of how ZFS's implementations of virtual filesystem operations work, in the comment at the start of zfs_vnops_os.c. Roughly, each operation (such as creating or renaming a file) starts by obtaining a transaction that will be part of the currently open txg, then doing its work, and then committing the transaction. If the transaction group starts quiescing while the operation is doing its work, the quiescing can't finish until the work does and commits the transaction for the rename, create, or whatever.

Written on 13 February 2024.
« Linux kernel boot messages and seeing if your AMD system has ECC
Understanding a recent optimization to Go's reflect.TypeFor »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 13 22:26:14 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.