Your system's performance is generally built up in layers

January 13, 2016

There are many facets and many approaches to troubleshooting performance issues, but there are also some basic principles that can really help to guide your efforts. One of them, one so fundamental that it often doesn't get mentioned, is that your system and its performance is built up of layers and thus to troubleshoot system performance you want to test and measure each layer, working upwards from the base layers (whatever they are).

(A similar 'working upwards' process can be used to estimate the best performance possible in any particular environment. This too can be useful, for example to assess how close to it you are or if the best possible performance can possibly meet your needs.)

To make this more concrete, suppose that you have an iSCSI based fileserver environment and the filesystems on your fileservers are performing badly. There are a lot of moving parts here; you have the physical disks on the iSCSI targets, the network link(s) between the fileservers and the iSCSI targets, the iSCSI software stack on both sides, and then the filesystem that's using the disks on the fileserver (and perhaps a RAID implementation on the iSCSI targets). Each of these layers in the stack is a chance for a performance problem to sneak in, so you want to test them systematically:

  • how fast is a single raw disk on the iSCSI targets, measured locally on a target?
  • how fast are several raw disks on the iSCSI targets when they're all operating at once?
  • if the iSCSI targets are doing their own RAID, how fast can that go compared to the raw disk performance?

  • how fast is the network between the fileserver and the iSCSI targets?

  • how fast is the iSCSI stack on the initiator and targets? Some iSCSI target software supports 'dummy' targets that don't do any actual IO, so you can test raw iSCSI speed. Otherwise, perhaps you can swap in a very fast SSD or the like for testing purposes.

  • how fast can the fileserver talk to a single raw disk over iSCSI? To several of them at once? To an iSCSI target's RAID array, if you're using that?

By working through the layers of the stack like this, you have a much better chance of identifying where your performance is leaking out. Not all performance problems are neatly isolated to a single layer of the stack (there can be all sorts of perverse interactions across multiple layers), but many are and it's definitely worth checking out first. If nothing else you'll rule out obvious and easily identified problems, like 'our network is only running at a third of the speed we really ought to be getting'.

Perhaps you think that this layering approach should be obvious, but let me assure you that I've seen people skip it. I've probably skipped it myself on occasion, when I felt I was in too much of a hurry to really analyze the problem systematically.

PS: when assessing each layer, you probably want to look at something like Brendan Gregg's USE Method in addition to measuring the performance you can get in test situations.


Comments on this page:

By liam at unc edu at 2016-01-14 09:17:10:

Do you perceive a significant advantage in working upwards (from raw hardware on up) over working downwards (from the top layer of the stack). I would typically work down from the client view of the world, then stripping off layers.

By cks at 2016-01-14 12:27:31:

I don't have strong feelings about it, but I feel that in practice working upwards winds up with you making less assumptions about what performance is good or bad and what you 'should' be getting. And unless you can determine that you're getting the maximum performance possible, you're going to wind up going through all of your layers sooner or later; the only question is which order.

(I also tend to feel that low level benchmarking and testing is simpler and has less moving parts, so it's easier to do. Starting with easy stuff and then gradually moving up to harder and harder things makes me happier. And if it turns up issues right away, well, you haven't spent a lot of time working out how to do good testing on a complex high level thing.)

Written on 13 January 2016.
« The drawback of setting an explicit mount point for ZFS filesystems
What I want out of backups for my home machine (in the abstract) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 13 01:49:17 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.