Wandering Thoughts archives

2014-09-05

A DTrace script to help figure out what process IO is slow

I recently made public a dtrace script I wrote, which gives you per file descriptor IO breakdowns for a particular process. I think it's both an interesting, useful tool and probably not quite the right approach to diagnose this sort of problem, so I want to talk about both the problem and what it tells you. To start with, the problem.

Suppose, not entirely hypothetically, that you have a relatively complex multi-process setup with data flowing between the various processes and the whole thing is (too) slow. Somewhere in the whole assemblage is a bottleneck. Basic monitoring tools for things like disk IO and network bandwidth will give you aggregate status over the entire assemblage, but they can only point out the obvious bottlenecks (total disk IO, total network bandwidth, etc). What we'd like to do here is peer inside the multi-process assemblage to see which data flows are fast and which are slow. This per-data-flow breakdown is why the script shows IO on a per file descriptor basis.

What the DTrace script's output looks like is this:

s fd   7w:   10 MB/s  waiting ms: 241 / 1000   ( 10 KB avg *   955)
p fd   8r:   10 MB/s  waiting ms:  39 / 1000   ( 10 KB avg *   955)
s fd  11w:    0 MB/s  waiting ms:   0 / 1000   (  5 KB avg *     2)
p fd  17r:    0 MB/s  waiting ms:   0 / 1000   (  5 KB avg *     2)
s fd  19w:   12 MB/s  waiting ms: 354 / 1000   ( 10 KB avg *  1206)
p fd  21r:   12 MB/s  waiting ms:  43 / 1000   ( 10 KB avg *  1206)
  fd 999r:   22 MB/s  waiting ms:  83 / 1000   ( 10 KB avg *  2164)
  fd 999w:   22 MB/s  waiting ms: 595 / 1000   ( 10 KB avg *  2164)
IO waits:  read:  83 ms  write: 595 ms  total: 679 ms 

(This is a per-second figure averaged over ten seconds and file descriptor 999 is for the total read and write activity. pfiles can be used to tell what each file descriptor is connected to if you don't already know.)

Right away we can tell a fair amount about what this process is doing; it's clearly copying two streams of data from inputs to outputs (with a third one not doing much). It's also spending much more of its IO wait time writing the data rather than waiting for there to be more input, although the picture here is misleading because it's also making pollsys() calls and I wasn't tracking the time spent waiting in those (or the time spent in other syscalls).

(The limited measurement is partly an artifact of what I needed to diagnose our problem.)

What I'm not sure about this DTrace script is if it's the most useful and informative way to peer into this problem. Its output points straight to network writes being the bottleneck (for reasons that I don't know) but that discovery seems indirect and kind of happenstance, visible only because I decided to track how long IO on each file descriptor took. In particular it feels like there are things I ought to be measuring here that would give me more useful and pointed information, but I can't think of what else to measure. It's as if I'm not asking quite the right questions.

(I've looked at Brendan Gregg's Off-CPU Analysis; an off-cpu flamegraph analysis actually kind of pointed in the direction of network writes too, but it was hard to interpret and get too much from. Wanting some degree of confirmation and visibility into this led me to write fdrwmon.d.)

solaris/DTraceFDIOVolScript written at 23:34:57; Add Comment

Some uses for SIGSTOP and some cautions

If you ask, many people will tell you that Unix doesn't have a general mechanism for suspending processes and later resuming them. These people are correct in general, but sometimes you can cheat and get away with a good enough substitute. That substitute is SIGSTOP, which is at the core of job control. Although processes can catch and react to other job control signals, SIGSTOP is a non-blockable signal like SIGKILL (aka 'kill -9'). When a process is sent it, the kernel stops the process on the spot and suspends it until the process gets a SIGCONT (more or less). You can thus pause processes and continue them by manually sending them SIGSTOP and SIGCONT as appropriate and desired.

(Since it's a regular signal, you can use a number of standard mechanisms to send SIGSTOP to an entire process group or all of a user's processes at once.)

There are any number of uses for this. Do you have too many processes banging away on the disk (or just think you might)? You can stop some of them for a while. Is a process saturating your limited network bandwidth? Pause it while you get a word in edgewise. And so on. Basically this is more or less job control for relatively arbitrary user processes, as you might expect.

Unfortunately there are some cautions and limitations attached to use of SIGSTOP on arbitrary processes. The first one is straightforward: if you SIGSTOP something that is talking to the network or to other processes, its connections may break if you leave it stopped too long. The other processes don't magically know that the first process has been suspended and so they should let it be, and many of them will have limits on how much data they'll queue up or how long they'll wait for responses and the like. Hit the limits and they'll assume something has gone wrong and cut your suspended process off.

(The good news is that it will be application processes that do this, and only if they go out of their way to have timeouts and other limits. The kernel is perfectly happy to leave things be for however long you want to wait before a SIGCONT.)

The other issue is that some processes will detect and react to one of their children being hit with a SIGSTOP. They may SIGCONT the child or they may kill the process outright; in either case it's probably not what you wanted to happen. Generally you're safest when the parent of the process you want to pause is something simple, like a shell script. In particular, init (PID 1) is historically somewhat touchy about SIGSTOP'd processes and may often either SIGCONT them or kill them rather than leave them be. This is especially likely if init inherits a SIGSTOP'd process because its original parent process died.

(This is actually relatively sensible behavior to avoid init having a slowly growing flock of orphaned SIGSTOP'd processes hanging around.)

These issues, especially the second, are why I say that SIGSTOP is not a general mechanism for suspending processes. It's a mechanism and on one level it always works, but the problem is the potential side effects and aftereffects. You can't just SIGSTOP an arbitrary process and be confident that it will still be there to be continued ten minutes later (much less over longer time intervals). Sometimes or often you'll get away with it but every so often you won't.

unix/SIGSTOPUsesAndCautions written at 01:01:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.