Wandering Thoughts archives


Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)

As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.

Unfortunately, what we've discovered is that pam_systemd apparently doesn't always work right. Specifically, we've seen some user cron @reboot entries create processes that wound up still under cron.service, although other @reboot entries for the same user on the same machine wound up with their processes in user slices. When things fail, pam_systemd doesn't log any sort of errors that I can see in the systemd journal.

(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)

The pam_systemd source code isn't very long and doesn't do very much itself. The most important function here appears to be pam_sm_open_session, and reading the code I can't spot a failure path that doesn't cause pam_systemd to log an error. The good news is that turning on debugging for pam_systemd doesn't appear to result in an overwhelming volume of extra messages, so we can probably do this on the machines where we've seen the problem in the hopes that something useful shows up.

(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)

Looking through what 'systemctl list-dependencies' with various options says for cron.service, it's possible that we need an explicit dependency on systemd-logind.service (although systemd-analyze on one system says that systemd-logind started well before crond). In theory it looks like pam_systemd should be reporting errors if systemd-logind hasn't started, but in practice, who knows. We might as well adopt a cargo cult 'better safe than sorry' approach to unit dependencies, even if it feels like a very long shot.

(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)

linux/SystemdCronUserSlicesII written at 23:58:12; Add Comment

ZFS's recordsize, holes in files, and partial blocks

Yesterday I wrote about using zdb to peer into ZFS's on-disk storage of files, and in particular I wondered if you wrote a 160 Kb file, would ZFS really use two 128 Kb blocks for it. The answer appeared to be 'no', but I was a little bit confused by some things I was seeing. In a comment, Robert Milkowski set me right:

In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).

He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.

The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:

    0  L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368
20000  L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368

     segment [0000000000000000, 0000000000040000) size  256K

These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.

One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.

A more interesting test file has holes that cover an entire recordsize block. Let's make one that has 128 Kb of data, skips the second 128 Kb block entirely, has 32 Kb of data at the end of the third 128 Kb block, skips the fourth 128 Kb block, and has 32 Kb of data at the end of the fifth 128 Kb block. Set up with dd, this is:

dd if=/dev/urandom of=testfile2 bs=128k count=1
dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc
dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc

Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:

# zdb -vv -O ssddata/homes cks/tmp/testfile2
Indirect blocks:
     0 L1  0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016
     0  L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011
 40000  L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015
 80000  L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016

     segment [0000000000000000, 0000000000020000) size  128K
     segment [0000000000040000, 0000000000060000) size  128K
     segment [0000000000080000, 00000000000a0000) size  128K

The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2
     0 L1  DVA[0]=<0:8a2c4e2c00:400> DVA[1]=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...]
     0  L0 DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...]
 40000  L0 DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
 80000  L0 DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]

(That we need to use both -vv and -bbbb here is due to how zdb's code is set up, and it's rather a hack to get what we want. I had to read the zdb source code to work out how to make it work.)

Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.

This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).

It's sometimes possible to dump the block contents of things like L1 indirect blocks, so you can see a more direct representation of them. This is where it's important to know that metadata is compressed, so we can ask zdb to decompress it with a magic argument:

# zdb -R ssddata 0:8a2c4e2c00:400:id
DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44
HOLE [L0 unallocated] size=200L birth=0L

Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)

PS: I'm not using 'zdb -ddddd' today because when I dug deeper into zdb, I discovered that 'zdb -O' would already report this information when given the right arguments, thereby saving me an annoying step.

Sidebar: Why you can't always dump blocks with zdb -R

To decompress a (ZFS) block, you need to know what it's compressed with and its uncompressed size. This information is stored in whatever metadata points to the block, not in the block itself, and so currently zdb -R simply guesses repeatedly until it gets a result that appears to work out right:

# zdb -R ssddata 0:8a2c4e2c00:400:id
Found vdev type: mirror
Trying 00400 -> 00600 (inherit)
Trying 00400 -> 00600 (on)
Trying 00400 -> 00600 (uncompressed)
Trying 00400 -> 00600 (lzjb)
Trying 00400 -> 00600 (empty)
Trying 00400 -> 00600 (gzip-1)
Trying 00400 -> 00600 (gzip-2)
Trying 00400 -> 20000 (lz4)
DVA[0]=<0:8a4afe7e00:20000> [...]

The result that zdb -R gets may or may not actually be correct, and thus may or may not give you the actual decompressed block data. Here it worked; at other times I've tried it, not so much. The last 'Trying' that zdb -R prints is the one it thinks is correct, so you can at least see if it got it right (here, for example, we know that it did, since it picked lz4 with a true logical size of 0x20000 and that's what the metadata we have about the L1 indirect block says it is).

Ideally zdb -R would gain a way of specifying the compression algorithm and the logical size for the d block flag. Perhaps someday.

solaris/ZFSFilePartialAndHoleStorage written at 00:14:11; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.