Wandering Thoughts archives

2025-03-18

How ZFS knows and tracks the space usage of datasets

Anyone who's ever had to spend much time with 'zfs list -t all -o space' knows the basics of ZFS space usage accounting, with space used by the datasets, data unique to a particular snapshot (the 'USED' value for a snapshot), data used by snapshots in total, and so on. But today I discovered that I didn't really know how it all worked under the hood, so I went digging in the source code. The answer is that ZFS tracks all of these types of space usage directly as numbers, and updates them as blocks are logically freed.

(Although all of these are accessed from user space as ZFS properties, they're not conventional dataset properties; instead, ZFS materializes the property version any time you ask, from fields in its internal data structures. Some of these fields are different and accessed differently for snapshots and regular datasets, for example what 'zfs list' presents as 'USED'.)

All changes to a ZFS dataset happen in a ZFS transaction (group), which are assigned ever increasing numbers, the 'transaction group number(s)' (txg). This includes allocating blocks, which remember their 'birth txg', and making snapshots, which carry the txg they were made in and necessarily don't contain any blocks that were born after that txg. When ZFS wants to free a block in the live filesystem (either because you deleted the object or because you're writing new data and ZFS is doing its copy on write thing), it looks at the block's birth txg and the txg of the most recent snapshot; if the block is old enough that it has to be in that snapshot, then the block is not actually freed and the space for the block is transferred from 'USED' (by the filesystem) to 'USEDSNAP' (used only in snapshots). ZFS will then further check the block's txg against the txgs of snapshots to see if the block is unique to a particular snapshot, in which case its space will be added to that snapshot's 'USED'.

ZFS goes through a similar process when you delete a snapshot. As it runs around trying to free up the snapshot's space, it may discover that a block it's trying to free is now used only by one other snapshot, based on the relevant txgs. If so, the block's space is added to that snapshot's 'USED'. If the block is freed entirely, ZFS will decrease the 'USEDSNAP' number for the entire dataset. If the block is still used by several snapshots, no usage numbers need to be adjusted.

(Determining if a block is unique in the previous snapshot is fairly easy, since you can look at the birth txgs of the two previous snapshots. Determining if a block is now unique in the next snapshot (or for that matter is still in use in the dataset) is more complex and I don't understand the code involved; presumably it involves somehow looking at what blocks were freed and when. Interested parties can look into the OpenZFS code themselves, where there are some surprises.)

PS: One consequence of this is that there's no way after the fact to find out when space shifted from being used by the filesystem to used by snapshots (for example, when something large gets deleted in the filesystem and is now present only in snapshots). All you can do is capture the various numbers over time and then look at your historical data to see when they changed. The removal of snapshots is captured by ZFS pool history, but as far as I know this doesn't capture how the deletion affected the various space usage numbers.

solaris/ZFSSpaceUsageHowTracked written at 22:44:37;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.