My views on using LVM for your system disk and root filesystem

May 4, 2017

In a comment on my entry about perhaps standardizing the size of our server root filesystems, Goozbach asked a good question:

Any reason not to put LVM on top of raid for OS partitions? (it's saved my bacon more than once both resizing and moving disks)

First, let's be clear what we're talking about here. This is the choice between putting your root filesystem directly into a software RAID array (such as /dev/md0) or creating a LVM volume group on top of the software RAID array and then having your root filesystem be a logical volume in it. In a root-on-LVM-on-MD setup, I'm assuming that the root filesystem would still use up all of the disk space in the LVM volume group (for most of the same reasons outlined for the non-LVM case in the original entry).

For us, the answer is that there is basically no payoff for routinely doing this, because in order to need LVM for this we'd need a number of unusual things to be true all at once:

  • we can't just use space in the root filesystem; for some reason, it has to be an actual separate filesystem.
  • but this separate filesystem has to use space from the system disks, not from any additional disks that we might add to the server.
  • and there needs to be some reason why we can't just reinstall the server from scratch with the correct partitioning and must instead go through the work of shrinking the root filesystem and root LVM logical volume in order to make up enough spare space for the new filesystem.

Probably an important part of this is that our practice is to reinstall servers from scratch when we repurpose them, using our install system that makes this relatively easy. When we do this we get the option to redo the partitioning (although it's generally easier to keep things the same, since that means we don't even have to repartition, just tell the installer to use the existing software RAIDs). If we had such a special need for a separate filesystem, it's probably a sufficiently unique and important server that we would want to start it over from scratch, rather than awkwardly retrofitting an existing server into shape.

(One problem with a retrofitted server is that you can't be entirely sure you can reinstall it from scratch if you need to, for example because the hardware fails. Installing a new server from scratch does help a great deal to assure that you can reinstall it too.)

We do have servers with unusual local storage needs. But those servers mostly use additional disks or unusual disks to start with, especially now that we've started moving to small SSDs for our system disks. With small SSDs there just isn't much space left over for a second filesystem, especially if you want to leave a reasonable amount of space free on both it and the root filesystem in case of various contingencies (including just 'more logs got generated than we expected').

I also can't think of many things that would need a separate filesystem instead of just being part of the root filesystem and using up space there. If we're worried about this whatever it is running the root filesystem out of space, we almost certainly want to put in big, non-standard system disks in the first place rather than try to wedge it into whatever small disks the system already has. Leaving all the free space in a single (root) filesystem that everything uses has the same space flexibility as ZFS, and we're lazy enough to like that. It's possible that I'm missing some reasonably common special case here because we just don't do whatever it is that really needs a separate local filesystem.

(We used to have some servers that needed additional system filesystems because they needed or at least appeared to want special mount options. Those needs quietly went away over the years for various reasons.)

Sidebar: LVM plus a fixed-size root filesystem

One possible option to advance here is a hybrid approach between a fixed size root partition and a LVM setup: you make the underlying software RAID and LVM volume group as big as possible, but then you assign only a fixed and limited amount of that space to the root filesystem. The remaining space is left as uncommitted free space, and then is either allocated to the root if it needs to grow or used for additional filesystems if you need them.

I don't see much advantage to this setup, though. Since the software RAID array is maximum-sized, you still have the disk replacement problems that motivated my initial question. You add the chance of the root filesystem running out of space if you don't keep an eye on it and make the time to grow it as needed, and in order for this setup to pay off you still have to need the space in a separate filesystem for some reason, instead of as part of the root filesystem. What you save is the hassle of shrinking the root filesystem if you ever need to make that additional filesystem with its own space.


Comments on this page:

There's another option that I want you to be aware of, even if you decide not to go with it. (Informed decisions and all....)

LVM does (and has for nearly 10 years) support RAID at the LV level. In this case, you don't use (traditional) software RAID (mdadm). Instead you add each disk as a PV to the VG. Then you create the (root) LV with the option to specify it as a RAID (0|1|5) LV. - LVM will then use software RAID (mdadm) internally / behind the scenes for the LV, but not the entire VG.

If you keep your LV(s) fairly small and only grow when necessary (say < 20% free space) and then only enough to create a comfortable amount of head room.

This has an added advantage that you can replace a 1 TB PV with a 2 TB PV with the typical vgextend myVG /dev/newDisk followed by a vgreduce myVG /dev/oldDisk.

LVM will then do the grunt work behind the scenes to migrate data between disks.

You aren't tied to an underlying software RAID and all the shackles that it bring with it.

You can even be creative if you want and have critical file systems (LVs) use RAID 1 mirroring while using RAID 5 for the bulk and then something like RAID 0 for /var/spool/news. ;-)

I used to (prior to migrating to ZFS on Linux) have a small (~1G) /boot partition that I would mirror (/dev/md0) and then add the remaining disk space of all the disks as PVs to the VG. Then carve out LVs as I wanted to.

I have also successfully moved the root LV from a disk in the notebook to an external USB (or eSATA? for speed) disk so that I could re-partition the internal disk on the fly. - Add the external disk to the VG, vacate the internal PV with pvmove, remove the internal PV with vgreduce, do what ever I want to (change LUKS, etc), and then do the process in reverse to bring the running root file system / LV back to the freshly partitioned drive.

It works quite well. Surprisingly well.

Feel free to poke me, @DrScriptt, if you want to talk more details.

By Ewen McNeill at 2017-05-04 02:04:36:

And there's another, another, option which you might want to be aware of, that I traditionally used to do on larger slower disks (where the rebuild time was Distinctly Non-Trivial): divide the disk up into smaller chunks (eg, N * 512GB/1TB/2TB chunks), and RAID 1 each chunk with its match on the second disk. So you have, say, /dev/mdN for N = 1 to 8. Then use LVM to glue these all back together again, into one logical volume. And put the file systems on top of that, but perhaps don't make the file systems take up the entire volume immediately.

This complexity gives you:

  • potentially faster RAID rebuild time if only portions of the disk are dirty when the machine loses power/crashes hard/etc, since potentially only some of the /dev/mdN chunks need rebuilding

  • incremental RAID rebuild over multiple power cycles/reboots, rather than, eg, needing 6-8TB to all remirror again from scratch if it is restarted

  • the ability to trivially drop one or more of the /dev/mdN chunks off the end of the volume group (particularly if you don't add them until they're needed), to replace one disk with a smaller disk

  • more flexibility in "different size" disk RAID mirrors (since you just need to find a matching "chunk" to pair with).

If your disks are, eg, 1TB and thus so small that you'd make them one chunk anyway, there's probably no point. But if your disks are "OMG, remirroring takes forever" sized it's worth considering. This is of course another thing that "hardware RAID" controllers won't do.

Ewen

PS: Historically I still put / straight on its own /dev/mdN chunk, but then mounted a bunch of other LVs at various convenient points. These days I would probably put / on the VG as well, and usually use a single partition + /boot + swap setup.

By Miksa at 2017-05-11 10:48:01:

@Ewen

I have a similar setup on my home server, biggest difference is that I use partitions that are only 80GB large, which means a whole lot of partitions on a 4TB drive. The size is mainly because I originally built the setup with 250-300GB harddrives and 40GB partitions, and increase the partition size is a hassle. I was forced to increase it to 80GB because old Linux kernels didn't support more than 16 partitions on a SATA drive, IDE could handle more.

It is a complicated setup, but I certainly appreciate the flexibility. It is wonderful for people who can't afford to just buy half a dozen new harddrives when they run out of space.

Most recent change was the addition of second 4TB drive to replace oldest 1TB drive. In the process the new drive replaced the 1TB from a 61TB RAID-6 array. It also enabled me to upgrade a 41TB RAID-6 array made of 32TB and 14TB drives to 51TB RAID-6, and since I now had two 4TB drives I also upgraded the 2TB JBOD to 22TB RAID-1.

Except for plugging in the new drive, the whole process was done online. No shutting down or rebooting, no volume unmounted, all the data was available and in use for the one or two weeks it took. Just a lot of 'pvmove', 'mdadm --stop' and 'mdadm --create'.

The LVM volume group has existed about a decade and has gone through a many hard drive addition and replacement, and many conversions from RAID-1 to RAID-5 and RAID-6.

The next evolution for the system that I would want is data checksumming with ZFS or Btrfs, but without sacrificing the flexibility. The straigth forward method would be to format a LVM logical volume with ZFS, but I'm not sure if that would cause any problems.

Written on 04 May 2017.
« Sometimes, chmod can fail for interesting reasons
Digging into BSD's choice of Unix group for new directories and files »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 4 00:18:43 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.