Wandering Thoughts archives

2013-10-31

Our likely future backend and fileserver hardware

At this point we've finalized almost all of the hardware for renewing the hardware for our fileserver infrastructure, unless something terribly bad turns up (which is always possible). So today I feel like talking about what hardware we're choosing and the size and scope of our project, partly because it seems uncommon to do this sort of thing.

The base backend hardware is a SuperMicro X9SRH-7TF motherboard with 8G of (ECC) RAM, an Intel E5-2603 CPU, an extra LSI 9207-8i SAS controller card, and some additional networking. This gives us a single-socket motherboard with dual Intel 10G-T ports, which is the important thing for us because it makes 10G-T cheap enough that we can afford it. We need the LSI card because we're talking to SATA disks so we want to avoid SAS expanders and the motherboard only has 8 SAS ports onboard. All of this is in a SuperMicro SC 836BA-R920 case, which gives us 16 3.5" front panel drive bays for iSCSI data disks and two rear 2.5" drive bays for mirrored SSDs as the system disks.

(For backends the additional networking is likely to be a cheap Realtek 1G card. For fileservers it'll be a dual Intel 1G or 10G-T, depending on what we can afford. Fileservers will also have other hardware variance, such as a lot more memory and probably no LSI card.)

Our current plans for disks for backends is twelve 2TB WD Se drives (7200 RPM SATA drives with a five year warranty) plus four SSDs for ZFS ZILs; we haven't selected the SSDs yet. It's possible that we'll shift to one or two more HDs and less ZIL SSDs. The system SSDs will be a pair of semi-random 60 GB SSDs, since you don't need more than that for your system disks (well, you hardly need even that).

At the moment we have three primary HD-based fileservers with two backends each, one SSD-based fileserver with three backends, one further fileserver which now doesn't need to be on separate hardware, a hot spare backend (with disks) and fileserver, and some test hardware that I'm going to ignore. The most urgent things to replace are the HD based fileservers because our current disks are starting to die at an accelerating rate and you can't really get SATA drives with 512b sectors any more.

Thus a full scale replacement of the HD side requires eleven units (assuming we use the same case for fileservers and backends) and at least 84 WD Se drives. Replacing the SSD-based fileserver requires three units but no new data drives; our current SSDs are new enough to last us for a while. Due to the ZFS 4K sector mess we have to replace hardware in units of 'one fileserver and its backends', ie three units and 24 HDs. I'd like three units of test hardware (a fileserver and two backends), but I suspect we can't afford that.

(The current SSD-based fileserver has three backends for reasons that boil down to hardware issues with our current SSD enclosures. We wouldn't need to replicate this with new hardware.)

I'm going to skip doing a tentative costing out of all of this for fuzzy reasons. Interested parties can use the item and quantity counts here to do it for themselves.

Disclaimer: We haven't gone through any sort of competitive evaluation process to select this particular set of hardware out of the vast universe of possible hardware that meets our general specifications. We've just found hardware that meets our needs, has prices that seem sane, and that works in our testing (so far). As such I can't say anything about whether or not this would be your best and/or cheapest option in this area. We've also deliberately chosen not to put too many disks in one single physical unit (or to use disks that are too large, partly because of a desire to keep up our IOPs).

Sidebar: software and other details

We'll use some Linux with our usual iSCSI target software on the backends. The frontends will run OmniOS (and use ZFS). Using a single CPU core on the fileservers may strike some people as eye-raising, but we aren't going to be touching ZFS dedup at all and after thinking about some of the issues involved I don't think we want compression either. This makes me feel that dual-core would be overkill.

(I've tested both Linux and OmniOS on this hardware and they work, although tuning 10G performance is clearly going to be interesting.)

FutureFileserverHardware written at 23:18:06; Add Comment

2013-10-30

An open question: part uniformity versus unit cost

We're in the process of renewing the hardware for our fileserver infrastructure and at this point we've basically settled on the motherboard we'll use for both the iSCSI backends and the fileservers and the hardware for the iSCSI backends; the backends will be built around a 3U case with 16x 3.5" drive bays on the front and 2x 2.5" drive bays on the back (and dual power supplies and various other bits). As we were discussing things today, we wound up asking a question: does it make sense to specify out a different case for the fileservers?

Because they don't need the drive count of iSCSI backends, the fileservers don't need a 3U case; they could certainly fit in a 2U case without problems, even with room for the hack of local L2ARC SSDs. Based on some preliminary investigation I just did, using a 2U case instead of a backend 3U case could save us perhaps $200 (maybe more, maybe less, it's hard to tell). This is not pocket change to us; if nothing else, that would buy that much more RAM for each fileserver.

But there are two advantages of using the current 3U case for fileservers as well as backends. First, it means that we don't have to search out, specify out, price out, and probably get in an initial evaluation unit of a 2U fileserver setup. Instead we can simply buy, right now, as many servers as we want. This points out the second advantage: parts uniformity. If the only hardware difference between backends and fileservers is how much memory they have and maybe what expansion network card they have, we simplify our lives as far as parts and spares go. We're only buying and deploying one thing instead of two and our spares become more flexible and possibly cheaper.

(There's also extra flexibility that we're unlikely to ever use, for example we'd have the ability to build a low(er)-cost ZFS fileserver that uses local disks instead of iSCSI at the cost of no failover and so on.)

I honestly don't know if this parts uniformity is worth the cost. It's attractive, though (partly because I've already spent enough of the past few months trying to get vendors to talk to me and take our money).

(True parts uniformity would require more than just the case, we'd have to throw in an extra SAS card for each fileserver.)

UniformityVsUnitCost written at 00:55:13; Add Comment

2013-10-21

Thinking about how I want to test disk IO on an iSCSI backend

We're in the process of renewing the hardware for our fileserver infrastructure and we've just got in the evaluation unit for what we hope will be the new backend hardware. One important part of evaluating it will be assessing how it does disk IO, so this entry is me thinking out loud about a high level view of what I want to test there.

In general we need to find out two things: how well does the hardware perform and whether it explodes under high load or other abnormal conditions. Since the disks will ultimately be exported as iSCSI targets, the local filesystem performance is broadly uninteresting; I might as well test with raw disk access when possible to remove filesystem level effects.

In performance tests:

  • Streaming read and write bandwidth to an individual drive. I should test drives in all slots of the enclosure to check for slot-dependent performance impacts. (Ideally with the same drive, but that may be too much manual work.)

  • Streaming read and write bandwidth to multiple drives at once. What aggregate performance can we get, where does it seem to level off, and what are the limiting factors? I would expect at least part of this to correlate with controller topology; since we have two controllers in the system, I should also make sure that they perform more or less the same.

  • Single-drive random IOPS rates, then how the IOPS rate scales up to multiple drives being driven simultaneously. In theory I may need some SSDs to really test this but on the other hand we don't really care what the real limit is if we can drive all of the HDs at their full IOPs rate.

I should probably also test for an IOPS decay on one drive when other drives are being driven at full speed with streaming IO, in case there are controller limits (or OS limits) in effect there.

The above tests should also test for the system continuing to work right in the face of basic high IO load, but there are other aspects of this. For proper functionality, what I can think of now is:

  • Basic hotplugging of drives. Both inserted and removed drives should be recognized, and promptly. Insertion and removal of multiple drives should work.

  • Test the effects of hotplugging drives on IO being done to other drives at the same time. The hardware topology involved should (I believe) make this a non-issue but we want to test this.

  • Test the effects of flushing write caches under high load, both to the same disk and to other disks. Again this should be a non-issue, but, well, 'should' is the important word here.

  • As a trivial test, make sure that a fully dead disk doesn't cause any controller problems.

  • Test how the controller behaves for a 'failing but not dead' disk, one that gives erratic results or read errors or both. IO to other disks should continue working without problems while we should get clear errors on the affected disk.

(I think that we have such failing disks around to test with, since we've had a run of failures and semi-failures recently. Hopefully I can find a properly broken disk without too much work.)

I'm probably missing some useful things that I'll come up with later, but just writing this list down now has made me realize that I want to do tests with a failing disk.

(Note that this is not testing network level things or how the iSCSI software will work on this hardware and so on. That will get tested later; for the first pass I'm interested only in the low-level disk performance because everything else depends on that.)

Sidebar: the rough test hardware

This evaluation unit has 8x SAS on the motherboard and we've added another 8x SAS via an LSI board (I don't have the exact model number handy right now). The (data) disks are 7200 RPM 2TB SATA HDs, directly connected without a SAS expander. One obvious choke point is the single PCIE board with 8 drives on it; another one may be how the motherboard SAS ports are connected up. This time around I should actually work out the PCIE bandwidth limits as best I can (well, assuming that eight drives going at once deliver less than the expected full bandwidth).

(The system disks are separate from all of this.)

DiskIOTestingThoughts written at 00:58:28; Add Comment

2013-10-20

Thoughts inspired by the abstract idea of Docker-like things

One of the things that's been in the 'devops' ambient lately is Docker. The vague and probably partially inaccurate things I've heard about it and the general idea of 'containers' have inspired some thoughts, which I will summarize by saying that I'm all for the abstract idea.

The thing is, I don't really want to be managing machines. It's a pain in the rear. Nor do I really want to have virtualization because that actually means managing more machines (now I get to manage not just the virtual servers but also some more real ones as the hosts). By contrast containers seem like a great abstraction to me if they work.

The way I like to imagine this is that I have a a substrate of container-running machines, which are all generic and effectively identical (although you might have ones with more or less CPU power and RAM, depending). Containers get rolled on to these generic building block machines, one or more to a machine. Containers are lightweight and minimal (so they capture only what is important to the system they implement) and they're isolated from the underlying machine so I can maintain them separately (and maintain machines without worrying about the containers on them). Installing a machine to do X becomes 'install base machine; run command to roll container X on to machine'. If I'm lucky, most containers might even be basically indifferent of the underlying distribution and OS release we run them on.

(There are a bunch of issues here, of course. For example, I suspect that containers need to be able to come with their own IP aliases.)

And while I'm imagining lovely features, another nice advantage of a well done container system is that I could manage things in larger abstractions than 'a daemon'. Again I'd love it if I could just say 'start container X' or 'stop container X' and not have to remember the exact daemons that are needed for it and so on.

(I couldn't expect to do all management at the container level precisely because it is generic. For the foreseeable future there will still be daemon-specific management that needs to be done.)

This is a future that would simplify my life. It'd do so by decoupling two things that are currently entangled together (base OS and the applications on top of it) and allowing me to handle them separately.

To answer a potential question: I don't think you can get quite this with automation tools like Puppet or Chef. You can get close but the problem is that those tools are still intrinsically entangled in the host machine state because they have to install and manage things in the host machine, not in something that is deliberately isolated from it the way that containers should be.

ContainerThoughts written at 01:25:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.