Making my machine stay responsive when writing to USB drives

January 17, 2017

Yesterday I talked about how writing things to USB drives made my machine not very responsive, and in a comment Nolan pointed me to LWN's The pernicious USB-stick stall problem. According to LWN's article, the core problem is an excess accumulation of dirty write buffers, and they give some VM system sysctls that you can use to control this.

I was dubious that this was my problem, for two reasons. First, I have a 16 GB machine and I barely use all that memory, so I thought that allowing a process to grab a bit over 3 GB of them for dirty buffers wouldn't make much of a difference. Second, I had actually been running sync frequently (in a shell loop) during the entire process, because I have sometimes had it make a difference in these situations; I figured frequent syncs should limit the amount of dirty buffers accumulating in general. But I figured it couldn't hurt to try, so I used the dirty_background_bytes and dirty_bytes settings to limit this to 256 MB and 512 MB respectively and tested things again.

It turns out that I was wrong. With these sysctls turned down, my machine stayed quite responsive for once, despite me doing various things to the USB flash drive (including things that had had a terrible effect just yesterday). I don't entirely understand why, though, which makes me feel as if I'm doing fragile magic instead of system tuning. I also don't know if setting these down is going to have a performance impact on other things that I do with my machine; intuitively I'd generally expect not, but clearly my intuition is suspect here.

(Per this Bob Plankers article, you can monitor the live state of your system with egrep 'dirty|writeback' /proc/vmstat. This will tell you the number of currently dirty pages and the thresholds (in pages, not bytes). I believe that nr_writeback is the number of pages actively being flushed out at the moment, so you can also monitor that.)

PS: In a system with drives (and filesystems) of vastly different speeds, a global dirty limit or ratio is a crude tool. But it's the best we seem to have on Linux today, as far as I know.

(In theory, modern cgroups support the ability to have per-cgroup dirty_bytes settings, which would let you add extra limits to processes that you knew were going to do IO to slow devices. In practice this is only supported on a few filesystems and isn't exposed (as far as I know) through systemd's cgroups mechanisms.)


Comments on this page:

By Anon at 2017-01-17 01:53:35:

It's something of a bufferbloat situation where reads get trapped behind slow writes.

You start doing a large amount of writes to a USB device. The USB device can't keep up with the writes being done to it. With normal I/O the you do your write and if there's buffer space the kernel tells the program "got your writes" immediately (hence the need to fsync if you care about whether the data really is on disk) which just makes the application send more quickly. The Linux kernel continues to buffer the writes up to some limit at which point the write call will block in the original application. A sync comes along that says "you must flush all writes everywhere now". At that point new reads (which can't be deferred and will make a program waiting on them hang) aren't allowed to be reordered past any writes and anything doing those reads is made to wait while the writeback is drained (and draining happens slowly because the USB device is slow). Once the writeback is empty those new reads are allowed past again but the system quickly accumulates a large amount of writeback/dirty data again because the USB device is still slow and the next time a sync comes along...

Your dd oflag=direct works because you are saying you want the I/O to bypass Linux's caches forcing only the application that issued them to wait directly rather than filling a multi-megabyte buffer in the kernel to fullness and then waiting. Setting the maximum accumulated writeback to be smaller works because when the sync comes along you have less writes to flush before those new reads can be serviced

By cks at 2017-01-17 12:03:32:

What puzzles me about my situation is that there are completely different devices involved. The only thing that was doing IO to the USB flash drive was an rsync, yet everything lurched to a pause, even though at most they wanted to do IO to drives that were only being read from by rsync. The only things I can think of is either there was way more memory eviction happening than I expected or all of the pending writes got merged into one big pool across all of the devices, and the writes to the USB flash drive caused other writes to stall or be (significantly) delayed.

Descriptions of dirty write buffering and ways to deal with it that I've read (eg) generally seem to talk about it in per-device terms. Per-device limits and operation are clearly what you want in general; if I'm doing IO to device A and you flood completely unrelated device B, I shouldn't be affected by your activity. But I'm now unclear if the Linux kernel actually operates this way.

(If writeback is global, I'm not sure I understand why a smaller dirty pool helps. My USB flash drive writes at roughly 10 MB/sec, so even a 512 MB pool is going to rapidly fill up and force everyone to throttle on writebacks, even with background writes starting at 256 MB of dirty data.)

By Guus Snijders at 2017-01-17 15:35:33:

What puzzles me about my situation is that there are completely different devices involved. The only thing that was doing IO to the USB flash drive was an rsync, yet everything lurched to a pause, even though at most they wanted to do IO to drives that were only being read from by rsync.

After reading your entry (and the comments), I got curious and did some reading (luckily, I had a slow day at $work ;)).

As I understand it, there is a global limit on dirty memory, once that limit is reached, the kernel has to drain some of it. As I understand it, the really nasty bit comes from sync()'s. The kernel /has/ to write all of it's dirty memory, including what is destined for the slow USB stick.

From that point on, it's a simple calculation; the bigger the buffers, the longer it takes to write out. Not much of a problem for fast HDD/SSD storage, but when you include slow storage in the mix...

If writeback is global, I'm not sure I understand why a smaller dirty pool helps

AFAIK: smaller dirty pool means less data to write, meaning less time waiting on the device to complete. It looks like the (optional) per-device limits are a fairly recent addition. Looks like a nice target for a UDEV rule.

Why the whole system slows down; well, you're much better qualified to explain than me ;).

By Anon at 2017-01-17 15:38:14:

While there are per device queues for the IOs below the block layer each of these individual queues does not extended all the way to the program doing buffered writes.

The buffered writes queue in memory and only through writeback do they make their way to disk. In this case there's a shared initial queue (dirty pages) that later feeds other queues so if you overwhelm any intermediary queue everyone who has to pass I/O through them is punished too. If one of these queues can't service any more writes, then writes for any device can't be serviced until they are are less full. What's worse, reads can become trapped behind the writes too if a sync takes place because a flush has to follow all those writes before the new read is allowed through. Another problem is that the program writing to the USB disk can probably dirty the page cache faster than anything else (it's not reading anything so why does it have to slow down?) so soon many queues will be clogged with its data the moment there's a space that no one else needs... yet. Once data destined for the USB disk gets into one of these queues that queue will drain slowly because the underlying device is slow applying back pressure. You need a way to quickly apply the pressure all the way back to the source so as to throttle the speed the initial program dirties pages and allow everything else a look in...

A smaller writeback helps because it means that no program can dirty many pages before being throttled which in turn means that it can't fill later queues so deeply without others getting a look in and said queues will drain faster (at a cost of throughput) as there is less in flight at any given time. Ultimately if the queues are shallow and a program is writing to a slow device then the program will be throttled sooner thanks to back pressure than if the queues are deep.

Any more convincing?

By Anon at 2019-06-25 08:45:38:

Chris: Did this issue ever get any better "out of the box" (i.e. without hand tuning) with more recent kernels (e.g. see https://unix.stackexchange.com/questions/526124/what-are-the-outstanding-problems-stalls-which-might-be-mitigated-by-limiti )?

By cks at 2019-06-26 17:35:54:

I'm afraid that I don't know. I stuck sysctl settings for this into /etc/sysctl.d back in 2017 and on top of that I don't write to USB stuff very much any more, so I don't have any recent experience with how this goes.

I actually found your blog back in 2016 when we were observing something similar in a scenario where the main job of our application was copying large amounts of data between devices. (Not sure which article I found back then, but I think you had some older ones about dirty_ratio and similar from before 2016)

On some servers, we had some fast NFS-mounted storage and some slower USB-attached drives to copy data from. In that scenario, we also saw systems almost grinding to a halt. When it happened we generated system-wide kernel stack traces (maybe using `echo l > /proc/sysrq-trigger`?). What we saw was that once you've got to the point where there was an excess of dirty pages, waiting for those being flushed could get into basically any call path that dealt with allocating pages. Unfortunately, I don't have the stack traces around any more, but iirc it was basically "you want to allocate a page? Too many pages are dirty, lets clean them up before handing out a new one. Ah, someone is already flushing, let's wait for that to finish". That would explain why almost everything can be affected even if a process is not really accessing the slow device or writing to disk at all.

Written on 17 January 2017.
« Linux is terrible at handling IO to USB drives on my machine
Exim, IPv6, and hosts that MX to localhost »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Jan 17 00:36:09 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.