Wandering Thoughts archives

2014-08-30

How to change your dm-cache write mode on the fly in Linux

Suppose that you are using a dm-cache based SSD disk cache, probably through the latest versions of LVM (via Lars, and see also his lvcache). Dm-cache is what I'll call an 'interposition' disk read cache, where writes to your real storage go through it; as a result it can be in either writethrough or writeback modes. It would be nice to be able to find out what mode your cache is and also to be able to change it. As it happens this is possible, although far from obvious. This procedure also comes with a bunch of caveats and disclaimers.

(The biggest disclaimer is that I'm fumbling my way around all of this stuff and I am in no way an expert on dm-cache. Instead I'm just writing down what I've discovered and been able to do so far.)

Let's assume we're doing our caching through LVM, partly because that's what I have and partly because if you're dealing directly with dm-cache you probably already know this stuff. Our cache LV is called testing/test and it has various sub-LVs under it (the original LV, the cache metadata LV, and the cache LV itself). Our first job is to find out what the device mapper calls it.

# dmsetup ls --tree
[...]
testing-test (253:5)
 |-testing-test_corig (253:4)
 |  `- (9:10)
 |-testing-cache_data_cdata (253:2)
 |  `- (8:49)
 `-testing-cache_data_cmeta (253:3)
    `- (8:49)

We can see the current write mode with 'dmsetup status' on the top level object, although the output is what they call somewhat hard to interpret:

# dmsetup status testing-test
0 10485760 cache 8 259/128000 128 4412/81920 16168 6650 39569 143311 0 0 0 1 writeback 2 migration_threshold 2048 mq 10 random_threshold 4 sequential_threshold 512 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 8

I have bolded the important bit. This says that the cache is in the potentially dangerous writeback mode. To change the writeback mode we must redefine the cache, which is a little bit alarming and also requires temporarily suspending and unsuspending the device; the latter may have impacts if, for example, you are actually using it for a mounted filesystem at the time.

DM devices are defined through what the dmsetup manpage describes as 'a table that specifies a target for each sector in the logical device'. Fortunately the tables involved are relatively simple and better yet, we can get dmsetup to give us a starting point:

# dmsetup table testing-test
0 10485760 cache 253:3 253:2 253:4 128 0 default 0

To change the cache mode, we reload an altered table and then suspend and resume the device to activate our newly loaded table. For now I am going to just present the new table with the changes bolded:

# dmsetup reload --table '0 10485760 cache 253:3 253:2 253:4 128 1 writethrough default 0' testing-test
# dmsetup suspend testing-test
# dmsetup resume testing-test

At this point you can rerun 'dmsetup status' to see that the cache device has changed to writethrough.

So let's talk about the DM table we (re)created here. The first two numbers are the logical sector range and the rest of it describes the target specification for that range. The format of this specification is, to quote the big comment in the kernel's devices/md/dm-cache-target.c:

cache <metadata dev> <cache dev> <origin dev> <block size>
      <#feature args> [<feature arg>]*
      <policy> <#policy args> [<policy arg>]*

The original table's ending of '0 default 0' thus meant 'no feature arguments, default policy, no policy arguments'. Our new version of '1 writethrough default 0' is a change to '1 feature argument of writethrough, still the default policy, no policy arguments'. Also, if you're changing from writethrough back to writeback you don't end the table with '1 writeback default 0' because it turns out that writeback isn't a feature, it's just the default state. So you write the end of the table as '0 default 0' (as it was initially here).

Now it's time for the important disclaimers. The first disclaimer is that I'm not sure what happens to any dirty blocks on your cache device if you switch from writeback to writethrough mode. I assume that they still get flushed back to your real device and that this happens reasonably fast, but I can't prove it from reading the kernel source or my own tests and I can't find any documentation. At the moment I would call this caveat emptor until I know more. In fact I'm not truly confident what happens if you switch between writethrough and writeback in general.

(I do see some indications that there is a flush and it is rapid, but I feel very nervous about saying anything definite until I'm sure of things.)

The second disclaimer is that at the moment the Fedora 20 LVM cannot set up writethrough cache LVs. You can tell it to do this and it will appear to have succeeded, but the actual cache device as created at the DM layer will be writeback. This issue is what prompted my whole investigation of this at the DM level. I have filed Fedora bug #1135639 about this, although I expect it's an upstream issue.

The third disclaimer is that all of this is as of Fedora 20 and its 3.15.10-200.fc20 kernel (on 64-bit x86, in specific). All of this may change over time and probably will, as I doubt that the kernel people consider very much of this to be a stable interface.

Given all of the uncertainties involved, I don't plan to consider using LVM caching until LVM can properly create writethrough caches. Apart from the hassle involved, I'm just not happy with converting live dm-cache setups from one mode to the other right now, not unless someone who really knows this system can tell us more about what's really going on and so on.

(A great deal of my basic understanding of dmsetup usage comes from Kyle Manna's entry SSD caching using dm-cache tutorial.)

Sidebar: forcing a flush of the cache

In theory, if you want to be more sure that the cache is clean in a switch between writeback and writethrough you can explicitly force a cache clean by switching to the cleaner policy first and waiting for it to stabilize.

# dmsetup reload --table '0 10485760 cache 253:3 253:2 253:4 128 0 cleaner 0' testing-test
# dmsetup wait testing-test

I don't know how long you have to wait for this to be safe. If the cache LV (or other dm-cache device) is quiescent at the user level, I assume that you should be okay when IO to the actual devices goes quiet. But, as before, caveat emptor applies; this is not really well documented.

Sidebar: the output of 'dmsetup status'

The output of 'dmsetup status' is mostly explained in a comment in front of the cache_status() function in devices/md/dm-cache-target.c. To save people looking for it, I will quote it here:

<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty>
<#features> <features>*
<#core args> <core args>
<policy name> <#policy args> <policy args>*

Unlike with the definition table, 'writeback' is considered a feature here. By cross-referencing this with the earlier 'dmsetup status' output we can discover that mq is the default policy and it actually has a number of arguments, the exact meanings of which I haven't researched (but see here and here).

DmCacheChangeWriteMode written at 00:55:18; Add Comment

2014-08-13

Bind mounts with systemd and non-fstab filesystems

Under normal circumstances the way you deal with Linux bind mounts on a systemd based system is the same as always: you put them in /etc/fstab and systemd makes everything work just like normal. If you can deal with your bind mounts this way, I recommend that you do it and keep your life simple. But sometimes life is not simple.

Suppose, not entirely hypothetically, that you are dealing with base filesystems that aren't represented in /etc/fstab for one reason or another; instead they appear through other mechanisms. For example, perhaps they appear when you import a ZFS pool. You want to use these filesystems as the source of bind mounts.

The first thing that doesn't work is leaving your bind mounts in /etc/fstab. There is no way to tell systemd to not create them until something else happens (eg your zfs-mount.service systemd unit finishes or their source directory appears), so this is basically never going to do the right thing. If you get bind mounts at all they are almost certainly not going to be bound to what you want. At this point you might be tempted to think 'oh, systemd makes /etc/fstab mounts into magic <name>.mount systemd units, I can just put files in /etc/systemd/system to add some extra dependencies to those magic units'. Sadly this doesn't work; the moment you have a real <name>.mount unit file it entirely replaces the information from /etc/fstab and systemd will tell you that your <name>.mount file is invalid because it doesn't specify what to mount.

In short, you need real .mount units for your bind mounts. You also need to force the ordering, and here again we run into something that would be nice but doesn't work. If you run 'systemctl list-units -t mount', you will see that there are units for all of your additional non-fstab mounts. It's tempting to make your bind mount unit depend on an appropriate mount unit for its source filesystem, eg if you have a bind mount from /archive/something you'd have it depend on archive.mount. Unfortunately this doesn't work reliably because systemd doesn't actually know about these synthetic mount units before the mount appears. Instead you can only depend on whatever .service unit actually does the mounting, such as zfs-mount.service.

(In an extreme situation you could create a service unit that just used a script to wait for the mounts to come up. With a Type=oneshot service unit, systemd won't consider the service successful until the script exits.)

The maximally paranoid set of dependencies and guards is something like this:

[Unit]
After=zfs-mount.service
Requires=zfs-mount.service
RequiresMountsFor=/var
ConditionPathIsDirectory=/local/var/local

(This is for a bind mount from /local/var/local to /var/local.)

We can't use a RequiresMountsFor on /local/var, because as far as systemd is concerned it's on the root filesystem and so the dependency would be satisfied almost immediately. I don't think the Condition will cause systemd to wait for /local/var/local to appear, just stop the bind mount from trying to be done if ZFS mounts happened but they didn't managed to mount a /local/var for some reason (eg a broken or missing ZFS pool).

(Since my /var is actually on the root filesystem, the RequiresMountsFor is likely gilding the lily; I don't think there's any situation where this unit can even be considered before the root filesystem is mounted. But if it's a separate filesystem you definitely want this and so it's probably a good habit in general.)

I haven't tested using local-var.mount in just the Requires here but I'd expect it to fail for the same reason that it definitely doesn't work reliably in an After. This is kind of a pity, but there you go and the Condition is probably good enough.

(If you don't want to make a bunch of .mount files, one for each mount, you could make a single .service unit that has all of the necessary dependencies and runs appropriate commands to do the bind mounting (either directly or by running a script). If you do this, don't forget to have ExecStop stuff to also do the unmounts.)

Sidebar: the likely non-masochistic way to do this for ZFS on Linux

If I was less stubborn, I would have set all of my ZFS filesystems to have 'mountpoint=legacy' and then explicitly mentioned and mounted them in /etc/fstab. Assuming that it worked (ie that systemd didn't try to do the mounts before the ZFS pool came up), this would have let me keep the bind mounts in fstab too and avoided this whole mess.

SystemdAndBindMounts written at 23:19:00; Add Comment

How you create a systemd .mount file for bind mounts

One of the types of units that systemd supports is mount units (see 'man systemd.mount'). Normally you set up all your mounts with /etc/fstab entries and you don't have to think about them, but under some specialized circumstances you can wind up needing to create real .mount service files for some mounts.

How to specify most filesystems is pretty straightforward, but it's not quite clear how you specify Linux bind mounts. Since I was just wrestling repeatedly with this today, here is what you need to put in a systemd .mount file to get a bind mount:

[Mount]
What=/some/old/dir
Where=/the/new/dir
Type=none
Options=bind

This corresponds to the mount command 'mount --bind /some/old/dir /the/new/dir' and an /etc/fstab line of '/some/old/dir /some/new/dir none bind'. Note that the type of the mount is none, not bind as you might expect. This works because current versions of mount will accept arguments of '-t none -o bind' as meaning 'do a bind mount'.

(I don't know if you can usefully add extra options to the Options setting or if you'd need an actual script if you need to, eg, make a bind mountpoint read-only. If you can do it in /etc/fstab you can probably do it here.)

A fully functioning .mount unit will generally have other stuff as well. What I've wound up using on Fedora 20 (mostly copied from the standard tmp.mount) is:

[Unit]
DefaultDependencies=no
Conflicts=umount.target
Before=local-fs.target umount.target

[Mount]
[[ .... whatever you need ...]]

[Install]
WantedBy=local-fs.target

Add additional dependencies, documentation, and so on as you need or want them. For what it's worth, I've also had bind mount units work without the three [Unit] bits I have here.

Note that this assumes a 'local' filesystem, not a network one. If you're dealing with a network filesystem or something depending on one, you'll need to change bits of the targets (systemd documentation suggests to remote-fs.target).

SystemdBindMountUnits written at 00:33:41; Add Comment

2014-08-11

Copying GPT partition tables from disk to disk

There's a number of situations where you want to replicate partition tables from one disk to another disk; for example, if you are setting up mirroring or (more likely) replacing a dead disk in a mirrored setup with a new one. If you're using old fashioned MBR partitioning, the best tool for this is sfdisk and it's done as follows:

sfdisk -d /dev/OLD | sfdisk /dev/NEW

Under some situations you may need 'sfdisk -f'.

If you're using new, modern GPT partitioning, the equivalent of sfdisk is sgdisk. However it gets used somewhat differently and you need two operations:

sgdisk -R=/dev/NEW /dev/OLD
sgdisk -G /dev/NEW

For obvious reasons you really, really don't want to accidentally flip the arguments. You need sgdisk -G to update the new disk's partitions to have different GUIDs from the original disk, because GUIDs should be globally unique even if the partitioning is the same.

The easiest way to see if your disks are using GPT or MBR partitioning is probably to run 'fdisk -l /dev/DISK' and look at what the 'Disklabel type' says. If it claims GPT partitioning, you can then run 'sgdisk -p /dev/DISK' to see if sgdisk likes the full GPT setup or if it reports problems. Alternately you can use 'gdisk -l /dev/DISK' and pay careful attention to the 'Partition table scan' results, but this option is actually kind of dangerous; under some situations gdisk will stop to prompt you about what to do about 'corrupted' GPTs.

Unfortunately sgdisk lacks any fully supported way of dumping and saving a relatively generic dump of partition information; 'sgdisk -b' explicitly creates something which the documentation says should not be restored on anything except the original disk. This is a hassle if you want to create a generic GPT based partitioning setup which you will exactly replicate on a whole fleet of disks (not that we use GPT partitioning on our new iSCSI backends, partly for this reason).

(I suspect that in practice you can use 'sgdisk -b' dumps for this even if it's not officially supported, but enhh. Don't forget to run 'sgdisk -G' on everything afterwards.)

(This is the kind of entry that I write so I have this information in a place where I can easily find it again.)

CopyingGPTPartitioning written at 13:41:09; Add Comment

2014-08-10

What I want out of a Linux SSD disk cache layer

One of the suggestions in response to my SSD dilemma was a number of Linux kernel systems that are designed to add a caching layer on top of regular disks; the leading candidates here seem to be dm-cache and bcache. I looked at both of them and unfortunately I don't like either one because they don't work in the way I want.

Put simply, what I want is the ability to attach a SSD read accelerator to my filesystems or devices without changing how they are currently set up. What I had hoped for was some system where you told things 'start caching traffic from X, Y, and Z' and it would all transparently just happen; your cache would quietly attach itself to the rest of the system somehow and that would be that. Later you could say 'stop caching traffic from X', or 'stop entirely', and everything would go back to how it was before. Roughly speaking this is the traditional approach taken by local disks used to cache and accelerate NFS reads in a few systems that implemented that.

Unfortunately this isn't what dm-cache and bcache do. Both of them function as an additional, explicit layer in the Linux storage stack, and as explicit layers you don't mount, say, your filesystem from its real device, you mount it from the dm-cache or bcache version of it. Among other things, this makes moving between using a cached version and a non-cached version of your objects a somewhat hair raising exercise; for example, bcache explicitly needs to change an existing underlying filesystem. Want to totally back out from using bcache or dm-cache? You're probably going to have a headache.

(This is especially annoying because there are two cache options in Linux today and who knows which one will be better for me.)

Both dm-cache and bcache are probably okay for a large deployment where they are planned from the start. In a large deployment you will evaluate each in your scenario, determine which one you want and what sort of settings you want, and then install machines with the caching layer configured from the start. You expect to never remove your chosen caching layer; generally you'll have specifically configured your hardware fleet around the needs of the caching layer.

None of this describes the common scenario of 'I have an existing machine with a bunch of existing data, and I have enough money for a SSD. I'd like to speed up my stuff'. That is pretty much my scenario (at least to start with). I rather expect it's very much the scenario of any number of people with existing desktops.

(It's also effectively the scenario for new machines for people who do not buy their desktops in bulk. I'm not going to spec out and buy a machine configuration built around the assumption that some Linux caching layer will turn out to work great for me; among other things, it's too risky.)

PS: if I've misunderstood how dm-cache or bcache work, my apologies; I have only skimmed their documentation. Bcache at least has a kind of scary FAQ about using (or not using) it on existing filesystems.

SSDDiskCacheDesire written at 00:47:14; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.