My current dilemma: is it worth putting the root filesystem on a SSD

November 30, 2015

Due mostly to an extremely good deal at the end of last week, I'm getting a mirrored pair of 250 GB SSDs to add to my work machine. As 250 GB is not enough to hold all of the data that's currently on the machine, I'm going to have to be selective about what I put on the SSDs. Which leads to the question I'm currently considering; whether or not it's worth putting my machine's system filesystem on the SSDs, or whether I should use all of the space for a ZFS pool.

(As is the modern way, my system filesystem has /, /usr, and /var all in a single large filesystem.)

The largest downside of putting my system filesystem on the SSDs is that it would consume at least 60 GB of the limited SSD space for good, and probably more so I had a safety margin for various sorts of space usage there (my current root filesystem is about 80 GB, but I'd likely offload some of the things that normally use space there). A secondary downside is that I would have to actively partition the SSDs instead of just giving them to ZFS as whole disks (even on Linux, ZFS is a bit happier to be handed whole disks). I'm using enough disk space for my own data that I'd like to move to the SSDs that losing 60-80 GB of space hurts at least a bit.

(In theory one might worry about /var/log and /var/log/journal seeing constant write traffic. In practice write endurance doesn't seem to be a big issue with SSDs these days, and these particular ones have a good reputation.)

The largest downside of not putting my system filesystem on the SSDs is of course that I don't get SSD performance for it. The big question to me is how much this matters on my system. On the one hand, certainly one of the things I do is to compile code, which requires a bunch of programs and header files from /usr and so on, and I would like this to be fast. On the other hand, my office machine already has 32 GB of RAM so I would hope that the compiler, the headers, and so on are all in the kernel's buffer cache before too long, at which point the SSD speed shouldn't matter. On the third hand, I don't have any actual numbers for how often I'm currently reading things off disk for / as opposed to already having them in memory. I can certainly believe that a modern system loads random scattershot programs and shared libraries and so on from / and /usr on a routine basis, all with relatively random IO, and that this would be accelerated by an SSD.

If I was really determined enough, I suppose I would first try out a SSD root filesystem to see how much activity it saw while I did various activity on the system. If it was active, I'd keep it; if it wasn't, I'd move the root filesystem back to HDs and give the whole SSDs to ZFS. The problem with this approach is that it involves several shifts of the root filesystem, each of which is a disruptive pain in the rear (I probably want to boot off the SSDs if the root filesystem is there, for example). I'm not sure I'm that enthused.

(I'm not interested in trying to make root-on-ZFS work for me, and thus have the root filesystem sharing space flexibly with the rest of the ZFS pool.)

(What I should really do here is watch IO stats on my current software RAID mirror of the root filesystem to see how active it is at various times. If it's basically flatlined, well, I've got an answer. But it sure would be handy if someone had already investigated this (yes, sometimes I'm lazy).)

Sidebar: the caching alternative

In theory Linux now has several ways to add SSDs as caches of HD based filesystems; there's at least bcache and dm-cache. I've looked at these before and my reaction from then still mostly stands; each of them would require me to migrate my root filesystem in order to set them up. They both do have the advantage that I could use only a small amount of disk space on the SSDs and get a relatively significant benefit.

They also both have the disadvantage that the thought of trying to use either of them for my root filesystem and getting the result to be reliable in the face of problems gives me heartburn (for example, I don't know if Fedora's normal boot stuff supports either of them). I'm pretty sure that I'd be a pioneer in the effort, and I'd rather not go there. Running ZFS on Linux is daring enough for me.


Comments on this page:

By Ewen McNeill at 2015-12-01 00:36:20:

I'm struggling to imagine what you have in your ZFS filesystem(s), outside your root disk (including /usr and /var), which is more performance sensitive/intensive than anything on your root disk and would not also benefit from the "well I have 32GB of RAM" rationale that you are considering to not put your root on SSD.

SSDs basically offer two performance boosts:

1. random reads are significantly faster (no waiting on mechanics)

2. writes flush to disk significantly faster

The first (random reads) is basically relevant at boot time, and any time you access something that is not already in your cache. Given a large enough cache, and infrequent enough reboots it should tend towards "relevant at boot time" after a while. So if your RAM is larger than the union of your maximum working set, after N days this may be irrelevant. But the getting there might take a while (at least "run all the usual applications in all the usual ways with all the usual data").

The second (writes commit faster) is relevant any time anything is waiting on fsync before proceeding. Which is surprisingly often for a wide range of tasks (database-like tasks being an obvious one, but by no means the only one -- eg, even writing from an editor will typically fsync() the file for safety).

If it were me, my null hypothesis would be "mirror SSDs with MD to provide 80GB root disk, and put root on there", and then give the remainder of each disk to ZFS. AFAICT (eg, this ZFS performance FAQ) the main considerations for ZFS on partitions are making sure they're erase-block-aligned (which you also want to do with MD RAID and extN/xfs filesystems), and that ZFS won't turn on write caching within the device (which should matter much less with a SSD).

Possibly you'd just want to give ZFS the remainder of the SSD for the ZIL and/or L2ARC, and leave the (presumably larger) underlying data on the existing disks. In which case the capacity loss should be less of a concern. (IIRC ZFS will acknowledge writes as "on stable storage" once they hit the ZIL and/or L2ARC so both of those on SSD should give you the write boost of the SSD; and the L2ARC will cache more data for reading, and for longer, than RAM -- including over reboots.)

Ewen

PS: About the only thing I can think of is if you have a large database on your ZFS now which is sufficiently large and performance critical to justify a dedicated SSD set. But you seem to be talking about a desktop rather than a transaction server...

PPS: It seems to be generally accepted now that if you purchase a reasonable quality SSD and don't constantly thrash it with writes (eg, busy database/logging) you can expect it to last at least as long as the warranty -- probably much longer if it's lightly written to (eg, desktop rather than server). For the write-intensive case the rationale seems to be to treat the SSD like "racing tires" which are known to wear out quicker, but worth it for the performance; most modern SSDs can give some estimate of write lifetime left -- which is basically a function of reserve capacity left for when blocks reach their write limits. (And larger SSDs benefit from having proportionally less writes for the same write traffic -- eg, double the disk size with the same writes, and you're writing half as many "full disk writes" per period.)

Have you considered using the SSDs as ZIL / SLOG (?) for ZFS? I think you would get a LOT of benefit from the SSDs for all of your ZFS pool if you did that.

(I'm relatively new to ZFS on Linux, so I may have the incorrect terms, as I've not done this yet myself.)

I did not realize that you were running ZFS on Linux until the end of the article. What file system(s) are you using for your root? Do they have any FS specific methods to benefit from the SSD?

By The Col at 2015-12-01 01:14:34:

I think that this is an absolute No Brain-er. For the amount of Logs that Linux writes to the disk and the amount of IO generated... Why wouldn't you want this running as fast as possible. I have not seen a HDD outperform a SSD in ages. The only possible exception would be for sensor monitoring, where you get allot of little writes sequentially.

The only caveat would be if you are using a HW raid, how much overhead that raid controller places on your mirrored pair (assuming raid 1).

But if you are in any doubt. Do an iostat on your OS and see how much disk activity there is with no workload and how much disk activity there is under load.

By Alan at 2015-12-01 08:53:20:

You've mentioned bloating / (or /var) by caching RPMs. That's a good example of something that could live on a hard drive. In theory one can use bind mounts to play more fine-grained games here. Personally I found they're quite annoying, e.g. they pollute the output of df. dpkg is explicitly happy for symlinks to be used in this case instead. Unfortunately I don't know if it's supported in rpm / Fedora; I just know some people are doing it anyway.

---

Systemd does now supports separate /usr, and mounts it in the initramfs. Looks like it was a nasty transition :( finally fixed in F18. I'm sure they don't consider it deprecated, because it's useful for the work on "stateless systems" etc. (Lennart blog post). Given how powerful that solution is I don't think systemd has any problem with separate /var.

The /var issue you mentioned on Fedora seems unrelated to systemd. It's an issue with yum upgrade from a live cd style upgrader (and neglecting to mount that fs). It was considered a bug and (eventually?) a workaround provided. Fedora upgrades now run on the host (as a special boot target), starting with fedup and now dnf-plugin-system-upgrade. So all OS filesystems would be mounted, including /var.

The /var/run insanity you debugged on 2006 Ubuntu is clearly pre-systemd. Hopefully the new /run has sorted all that out. (It solved a real problem as everyone was using random tmpfs's like /dev/.mystuff before / becomes writable).

I understand this is not necessarily reassuring enough to try it again :). Personally I like the initramfs mounting, merged /usr, new /run, and the organisation for "stateless systems". But I've never had to deal with separate /usr or /var. I agree separate /usr partitions are currently unloved (I certainly don't have any use for one). I would really have expected separate /var to work properly, but it sounds like distros aren't always getting it right.

The bright side is that where/when systemd gets it right, that applies to all distros :).

By cks at 2015-12-01 09:56:04:

What I have in ZFS filesystems is things like my home directory and all of the source code that I build things from (and the build areas where the object files go and so on). I may also put some virtual machine images on the SSD, depending on how much disk space I wind up with. I believe that all of these are significantly 'hotter' than the root filesystem and the collective working set exceeds my RAM, although I may be wrong.

(Building software also involves writing things to disk, which SSDs help with.)

I already have an L2ARC attached to the pool, but it's only a 60 GB SSD. I suppose a simple first step would actually be to swap that out for one of the 250 GB SSDs and see what happens. If I'm daring I could partition the disk as a split L2ARC/ZIL and experiment with how well it performs. This is of course less exciting than a ZFS pool on the SSDs, but it's a lot easier to set up.

(The one annoying thing about using a L2ARC instead of an actual pool is that L2ARC is not persistent over reboots, which means a long process of reloading the L2ARC every time you reboot. Persistent L2ARC is a feature that's coming, well, sometime.)

By cks at 2015-12-01 10:57:04:

The other issue with an L2ARC is that an L2ARC eats ZFS ARC memory in order to store the L2ARC metadata in a way that a native SSD pool does not. See eg this message from Richard Elling. A back of the envelope calculation suggests that a 250 GB L2ARC could easily take a GB or more of RAM for metadata.

(On the other hand, I currently have 49.6 GB in my L2ARC and I'm using ~76 MB of RAM for L2ARC headers. This is the l2_hdr_size stat in /proc/spl/kstat/zfs/arcstats for ZFS on Linux.)

Grant: My current root filesystem is ext3 (on a software RAID mirror) with no special tuning. I don't know if ext3 does anything particularly special on SSDs.

(I'm deliberately conservative with my root filesystem for obvious reasons, and this is an old root filesystem anyways. I'd probably make a new one a native ext4 filesystem but not change anything else.)

By Anon at 2015-12-01 15:11:46:

If you can't stay cutting edge I'd give bcache a miss for now... - http://news.gmane.org/gmane.linux.kernel.bcache.devel

By Anon at 2015-12-01 15:12:28:

If you can't stay cutting edge I'd give bcache a miss for now... - http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3097/focus=3098

By Anon at 2015-12-02 01:53:34:

(Apologies for the previous double comment)

If you don't mind getting technical and your system is new enough you can the perf based technique described on http://www.brendangregg.com/blog/2014-12-31/linux-page-cache-hit-ratio.html to work out page cache hit rates. You might also be able to work out which parts of files are currently cached using pccstat (https://github.com/tobert/pcstat ).

Written on 30 November 2015.
« A new piece of my environment: xcape, an X modifier key modifier
Red Hat has really doubled down on being email spammers »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 30 22:36:01 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.