How Linux swap files (and swap partitions) find where to read and write

November 11, 2022

In a comment on my entry about how swap files don't update their modification time when they're used for swapping, Stephen Kitt noted:

Swap files have a specific code path in the kernel: when they are added, the kernel lists the blocks which they contain, and any actual swap I/O is done directly to the underlying block device. [...]

This made me curious how this worked and how sophisticated the kernel's process for it was. The answer turns out to be relatively straightforward at a high level, as covered in a comment in mm/swapfile.c. In a a comment just before setup_swap_extents() (currently here):

A `swap extent' is a simple thing which maps a contiguous range of pages onto a contiguous range of disk blocks. A rbtree of swap extents is built at swapon time and is then used at swap_writepage/swap_readpage time for locating where on disk a page belongs.

If the swapfile is an S_ISBLK block device, a single extent is installed. This is done so that the main operating code can treat S_ISBLK and S_ISREG swap files identically.

(There's more details in the full comment; if you're interested, it's worth reading the whole thing.)

The kernel has two ways to build these swap extents for a swap file. First, a filesystem may support an explicit way of obtaining these swap mappings through a swap_activate method it provides. How these functions work depends; the ext4 one uses a generic 'iomap' system with an ext4-specific callback (cf fs/ext4/inode.c), while the NFS code basically pretends that the NFS swap file is a block device, creating a single file-wide direct mapping (see nfs_swap_activate() in fs/nfs/file.c).

Second, for filesystems that don't have specific support, the kernel will go through every block in the file and attempt to map it to a disk block, then identify contiguous runs of blocks and merge them together in a single extent (see bmap() in fs/inode.c and generic_swapfile_activate() in mm/page_io.c). This only works if the filesystem can map file blocks to disk blocks for you, which not all filesystems can. Presumably it's potentially rather slow, especially for large swap files. I believe that most filesystems that expect (and want) to be used for swap have their own fast swap_activate function because of this.

(Currently btrfs, cifs, ext4, f2fs, NFS, xfs, and 'zonefs' appear to have specific swap activation functions.)

If a filesystem lacks support for both specifically activating a swapfile on it and mapping blocks, then you can't put a swapfile on it; one example is ZFS on Linux. Somewhat surprisingly, btrfs does support swapfiles (despite normally being a copy-on-write filesystem like ZFS). Btrfs goes to some lengths to allow this and using a swapfile on btrfs has a number of restrictions. You can read the gory details in the swap-related portions of fs/btrfs/inode.c, starting with btrfs_swap_activate().

(A significant number of filesystems appear to have support for mapping blocks, too many for me to list them here. Many of them don't seem like filesystems you'd want to use for swapping, and a few of them are read-only, for example 'isofs'.)

Written on 11 November 2022.
« The problem of (Unix) swapfiles and server backups
Questionable TLS Certificate Authorities and Certificate Transparency »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 11 22:37:08 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.