What is in (Open)ZFS's per-pool "txgs" /proc file on Linux
As part of (Open)ZFS's general 'kstats' system for reporting information about ZFS overall and your individual pools and datasets, there is a per-pool /proc file that reports information about the most recent N transaction groups ('txgs'), /proc/spl/kstat/zfs/<pool>/txgs. How many N is depends on the zfs_txg_history parameter, and defaults to 100. The information in here may be quite important for diagnosing certain sorts of performance problems but I haven't found much documentation on what's in it. Well, let's try to fix that.
The overall format of this file is:
txg birth state ndirty nread nwritten reads writes otime qtime wtime stime 5846176 7976255438836187 C 1736704 0 5799936 0 299 5119983470 2707 49115 27910766 [...] 5846274 7976757197601868 C 1064960 0 4702208 0 236 5119973466 2405 48349 134845007 5846275 7976762317575334 O 0 0 0 0 0 0 0 0 0
(This example is coming from a system with four-way mirrored vdevs, which is going to be relevant in a bit.)
So lets take these fields in order:
txg
is the transaction group number, which is a steadily increasing number. The file is ordered from the oldest txg to the newest, which will be the current open transaction group.(In the example, txg 5846275 is the current open transaction group and 5846274 is the last one the committed.)
birth
is the time when the transaction group (txg) was 'born', in nanoseconds since the system booted.state
is the current state of the txg; this will most often be either 'C' for committed or 'O' for open. You may also see 'S' for syncing, 'Q' (being quiesced), and 'W' (waiting for sync). An open transaction group will most likely have 0s for the rest of the numbers, and will be the last txg (there's only one open txg at a time).Any transaction group except the second last will be in state 'C', because you can only have one transaction group in the process of being written out.Update: per the comment from Arnaud Gomes, you can have multiple transaction groups at the end that aren't committed. I believe you can only have one that is syncing ('S'), because that happens in a single thread for only one txg, but you may have another that is quiescing or waiting to sync.
A transaction group's progress through its life cycle is open, quiescing, waiting for sync, syncing, and finally committed. In the open state, additional transactions (such as writing to files or renaming them) can be added to the transaction group; once a transaction group has been quiesced, nothing further will be added to it.
(See also ZFS fundamentals: transaction groups, which discusses how a transaction group can take a while to sync; the content has also been added as a comment in the source code in txg.c.)
ndirty
is how many bytes of directly dirty data had to be written out as part of this transaction; these bytes come, for example, from userwrite()
IO.It's possible to have a transaction group commit with a '0' for
ndirty
. I believe that this means no IO happened during the time the transaction group was open, and it's just being closed on the timer.nread
is how many bytes of disk reads the pool did between when syncing of the txg starts and when it finishes ('during txg sync').nwritten
is how many bytes of disk writes the pool did during txg sync.reads
is the number of disk read IOs the pool did during txg sync.writes
is the number of disk write IOs the pool did during txg sync.I believe these IO numbers include at least any extra IO needed to read in on-disk data structures to allocate free space and any additional writes necessary. I also believe that they track actual bytes written to your disks, so for example with two-way mirrors they'll always be at least twice as big as the
ndirty
number (in my example above, with four way mirrors, their base is four timesndirty
).As we can see it's not unusual for
nread
andreads
to be zero. However, I don't believe that the read IO numbers are restricted to transaction group commit activities; if something is reading from the pool for other reasons during the transaction group commit, that will show up innread
andreads
. They are thus a measure of the amount of read IO going during the txg sync process, not the amount of IO necessary for it.I don't know if ongoing write IO to the ZFS Intent Log can happen during a txg sync. If it can, I would expect it to show up in the
nwritten
andwrites
numbers. Unlike read IO, regular write IO can only happen in the context of a transaction group and so by definition any regular writes during a txg sync are part of that txg and show up inndirty
.otime
is how long the txg was open and accepting new write IO, in nanoseconds. Often this will be around the default zfs_txg_timeout time, which is normally five seconds. However, under (write) IO pressure this can be shorter or longer (if the current open transaction group can't be closed because there's already a transaction group in the process of trying to commit).qtime
is how long the txg took to be quiesced, in nanoseconds; it's usually small.wtime
is how long the txg took to wait to start syncing, in nanoseconds; it's usually pretty small, since all it involves is that the separate syncing thread pick up the txg and start syncing it.stime
is how long the txg took to actually sync and commit, again in nanoseconds. It's often appreciable, since it's where the actual disk write IO happens.
In the example "txgs" I gave, we can see that despite the first committed txg listed having more dirty data than the last committed txg, its actual sync time was only about a quarter of the last txg's sync time. This might cause you to look at underlying IO activity patterns, latency patterns, and so on.
As far as I know, there's no per-pool source of information about
the current amount of dirty data in the current open transaction
group (although once a txg has quiesced and is syncing, I believe
you do see a useful ndirty
for it in the "txgs" file). A system
wide dirty data number can more or less be approximated from the
ARC memory reclaim statistics in
the anon_size
kstat plus the arc_tempreserve
kstat, although
the latter seems to never get very big for us.
A new transaction group normally opens as the current transaction
group begins quiescing. We can verify this in the example output
by adding the birth time and the otime
of txg 5846274, which add
up to exactly the birth time of txg 5846275, the current open txg.
If this sounds suspiciously exact down to the nanosecond, that's
because the code involve freezes the current time at one point and
uses it for both the end of the open time of the current open txg
and the birth time of the new txg.
Sidebar: the progression through transaction group states
Here is what I can deduce from reading through the OpenZFS kernel code, and since I had to go through this I'm going to write it down.
First, although there is a txg 'birth' state, 'B' in the 'state' column, you will never actually see it. Transaction groups are born 'open', per spa_txg_history_add() in spa_stats.c. Transaction groups move from 'O' open to 'Q' quiescing in txg_quiesce() in txg.c, which 'blocks until all transactions in the group are committed' (which I believe means they are finished fiddling around adding write IO). This function is also where the txg finishes quiescing and moves to 'W', waiting for sync. At this point the txg is handed off to the 'sync thread', txg_sync_thread() (also in txg.c). When the sync thread receives the txg, it will advance the txg to 'S', syncing, call spa_sync(), and then mark everything as done, finally moving the transaction group to 'C', committed.
(In the spa_stats.c code, the txg state is advanced by a call to spa_txg_history_set(), which will always be called with the old state we are finishing. Txgs advance to syncing in spa_txg_history_init_io(), and finish this state to move to committed in spa_txg_history_fini_io(). The tracking of read and write IO during the txg sync is done by saving a copy of the top level vdev IO stats in spa_txg_history_init_io(), getting a second copy in spa_txg_history_fini_io(), and then computing the difference between the two.)
Why it might take some visible time to quiesce a transaction group is more or less explained in the description of how ZFS's implementations of virtual filesystem operations work, in the comment at the start of zfs_vnops_os.c. Roughly, each operation (such as creating or renaming a file) starts by obtaining a transaction that will be part of the currently open txg, then doing its work, and then committing the transaction. If the transaction group starts quiescing while the operation is doing its work, the quiescing can't finish until the work does and commits the transaction for the rename, create, or whatever.
Comments on this page:
|
|