Your system's performance is generally built up in layers
There are many facets and many approaches to troubleshooting performance issues, but there are also some basic principles that can really help to guide your efforts. One of them, one so fundamental that it often doesn't get mentioned, is that your system and its performance is built up of layers and thus to troubleshoot system performance you want to test and measure each layer, working upwards from the base layers (whatever they are).
(A similar 'working upwards' process can be used to estimate the best performance possible in any particular environment. This too can be useful, for example to assess how close to it you are or if the best possible performance can possibly meet your needs.)
To make this more concrete, suppose that you have an iSCSI based fileserver environment and the filesystems on your fileservers are performing badly. There are a lot of moving parts here; you have the physical disks on the iSCSI targets, the network link(s) between the fileservers and the iSCSI targets, the iSCSI software stack on both sides, and then the filesystem that's using the disks on the fileserver (and perhaps a RAID implementation on the iSCSI targets). Each of these layers in the stack is a chance for a performance problem to sneak in, so you want to test them systematically:
- how fast is a single raw disk on the iSCSI targets, measured locally on a target?
- how fast are several raw disks on the iSCSI targets when they're all operating at once?
- if the iSCSI targets are doing their own RAID, how fast can that go
compared to the raw disk performance?
- how fast is the network between the fileserver and the iSCSI targets?
- how fast is the iSCSI stack on the initiator and targets? Some iSCSI
target software supports 'dummy' targets that don't do any actual IO,
so you can test raw iSCSI speed. Otherwise, perhaps you can swap in a
very fast SSD or the like for testing purposes.
- how fast can the fileserver talk to a single raw disk over iSCSI? To several of them at once? To an iSCSI target's RAID array, if you're using that?
By working through the layers of the stack like this, you have a much better chance of identifying where your performance is leaking out. Not all performance problems are neatly isolated to a single layer of the stack (there can be all sorts of perverse interactions across multiple layers), but many are and it's definitely worth checking out first. If nothing else you'll rule out obvious and easily identified problems, like 'our network is only running at a third of the speed we really ought to be getting'.
Perhaps you think that this layering approach should be obvious, but let me assure you that I've seen people skip it. I've probably skipped it myself on occasion, when I felt I was in too much of a hurry to really analyze the problem systematically.
PS: when assessing each layer, you probably want to look at something like Brendan Gregg's USE Method in addition to measuring the performance you can get in test situations.