2017-09-27
Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04)
As part of dealing with our Ubuntu 16.04 shutdown problem, we now have our systems set up to put all user cron jobs into systemd user slices so that systemd will terminate them before it starts unmounting NFS filesystems. Since we made this change, we've rebooted all of our systems and thus had an opportunity to see how it works in practice in our environment.
Unfortunately, what we've discovered is that pam_systemd
apparently doesn't always work right. Specifically, we've seen some
user cron @reboot
entries create processes that wound up still
under cron.service
, although other @reboot
entries for the same
user on the same machine wound up with their processes in user
slices. When things fail, pam_systemd doesn't log any sort of
errors that I can see in the systemd journal.
(Since no failures are logged, this doesn't seem related to the famous systemd issue where pam_systemd can't talk to systemd, eg systemd issue 2863 or this Ubuntu issue.)
The pam_systemd source code
isn't very long and doesn't do very much itself. The most important
function here appears to be pam_sm_open_session
, and reading
the code I can't spot a failure path that doesn't cause pam_systemd
to log an error. The good news is that turning on debugging for
pam_systemd doesn't appear to result in an overwhelming volume
of extra messages, so we can probably do this on the machines where
we've seen the problem in the hopes that something useful shows up.
(It will probably take a while, since we don't reboot these machines very often. I have not seen or reproduced this on test machines, at least so far.)
Looking through what 'systemctl list-dependencies
' with various
options says for cron.service, it's possible that we need an explicit
dependency on systemd-logind.service
(although systemd-analyze
on one system says that systemd-logind started well before crond).
In theory it looks like pam_systemd should be reporting errors
if systemd-logind hasn't started, but in practice, who knows. We
might as well adopt a cargo cult 'better safe than sorry' approach
to unit dependencies, even if it feels like a very long shot.
(Life would be simpler if systemd had a simple way of discovering the relationship, if any, between two units.)
ZFS's recordsize
, holes in files, and partial blocks
Yesterday I wrote about using zdb
to peer into ZFS's on-disk
storage of files, and in particular I
wondered if you wrote a 160 Kb file, would ZFS really use two
128 Kb blocks for it. The answer appeared to be 'no', but I was
a little bit confused by some things I was seeing. In a comment,
Robert Milkowski set me right:
In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).
He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.
The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:
0 L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368 20000 L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368 segment [0000000000000000, 0000000000040000) size 256K
These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.
One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.
A more interesting test file has holes that cover an entire recordsize
block. Let's make one that has 128 Kb of data, skips the second 128 Kb
block entirely, has 32 Kb of data at the end of the third 128 Kb block,
skips the fourth 128 Kb block, and has 32 Kb of data at the end of the
fifth 128 Kb block. Set up with dd
, this is:
dd if=/dev/urandom of=testfile2 bs=128k count=1 dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc
Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:
# zdb -vv -O ssddata/homes cks/tmp/testfile2 [...] Indirect blocks: 0 L1 0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016 0 L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011 40000 L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015 80000 L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016 segment [0000000000000000, 0000000000020000) size 128K segment [0000000000040000, 0000000000060000) size 128K segment [0000000000080000, 00000000000a0000) size 128K
The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2 [...] 0 L1 DVA[0]=<0:8a2c4e2c00:400> DVA[1]=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...] 0 L0 DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...] 40000 L0 DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...] 80000 L0 DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
(That we need to use both -vv
and -bbbb
here is due to how
zdb
's code is set up, and it's rather a hack to get what we want.
I had to read the zdb
source code to work out how to make it work.)
Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.
This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).
It's sometimes possible to dump the block contents of things like
L1 indirect blocks, so you can see a more direct representation
of them. This is where it's important to know that metadata is
compressed, so we can ask zdb
to decompress it with a magic
argument:
# zdb -R ssddata 0:8a2c4e2c00:400:id [...] DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59 HOLE [L0 unallocated] size=200L birth=0L DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b HOLE [L0 unallocated] size=200L birth=0L DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44 HOLE [L0 unallocated] size=200L birth=0L [...]
Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)
PS: I'm not using 'zdb -ddddd
' today because when I dug deeper
into zdb
, I discovered that 'zdb -O
' would already report this
information when given the right arguments, thereby saving me an
annoying step.
Sidebar: Why you can't always dump blocks with zdb -R
To decompress a (ZFS) block, you need to know what it's compressed
with and its uncompressed size. This information is stored in
whatever metadata points to the block, not in the block itself, and
so currently zdb -R
simply guesses repeatedly until it gets a
result that appears to work out right:
# zdb -R ssddata 0:8a2c4e2c00:400:id Found vdev type: mirror Trying 00400 -> 00600 (inherit) Trying 00400 -> 00600 (on) Trying 00400 -> 00600 (uncompressed) Trying 00400 -> 00600 (lzjb) Trying 00400 -> 00600 (empty) Trying 00400 -> 00600 (gzip-1) Trying 00400 -> 00600 (gzip-2) [...] Trying 00400 -> 20000 (lz4) DVA[0]=<0:8a4afe7e00:20000> [...]
The result that zdb -R
gets may or may not actually be correct,
and thus may or may not give you the actual decompressed block data.
Here it worked; at other times I've tried it, not so much. The last
'Trying' that zdb -R
prints is the one it thinks is correct, so
you can at least see if it got it right (here, for example, we know
that it did, since it picked lz4 with a true logical size of 0x20000
and that's what the metadata we have about the L1 indirect block says
it is).
Ideally zdb -R
would gain a way of specifying the compression
algorithm and the logical size for the d
block flag. Perhaps
someday.