2007-07-30
What we want out of our new fileserver design
We're still in the process of designing our next generation of fileserver architecture, so I've decided to write down what we want out of the design.
- good long term storage management, as we are likely to keep the architecture for at least five years and probably more.
- we'd like be able to do highly available storage that has no single
point of failure, because we have some filesystems that are crucial
to our entire infrastructure.
- we should be able to add special purpose capacity without huge
expense. While SATA RAID-5 or RAID-6 bulk storage meets our general
needs, we may someday really need some fast RAID-1 storage or the like.
- people (and groups) should be able to incrementally buy chunks of
additional storage for a relatively modest cost, on the order of a few
hundred dollars.
In a sense, charging people for it is an unfair way of allocating storage, but we have to allocate it somehow and we know that not all groups want or need the same amount of disk space. Keeping the price low is necessary so that no one gets frozen out; a professor can always afford to buy another unit of space if they really need it.
(The cost is a one-time cost for similar reasons.)
- expanding the overall storage pool should be cheap, not expensive.
This implies that both the storage units and the infrastructure
needed to hook a new one up shouldn't cost lots of money.
- groups should be able to buy large amounts of storage with grant
funding.
It turns out that the easiest way to buy storage with grant funding is to buy commodity servers with disks from a large vendor that is set up to do the special grant funding dances. This means that we want to be able to plug general servers into our environment, even if we're not building our own storage pool out of them.
(Disclaimer: when I saw 'we' here I really mean it. I owe a big debt to my co-workers for educating me about a lot of the issues involved, including things like the peculiar constraints of grant funding.)
2007-07-24
Using iSCSI and AOE to create artificial disk errors
One of the nice things that you can do with iSCSI and AOE is use them to test how your system (volume management, filesystem, programs, etc) really deal with low level disk errors. All sorts of interesting issues can come crawling out of the woodwork when you do this; it is very educational and occasionally rather alarming.
(Testing this sort of thing is otherwise fairly difficult, because few people have controllable error-producing hard drives sitting around, especially hard drives that will repeatedly run fine for a while and then start spewing errors. Since there are all-software implementations of iSCSI and AOE, they are much more controllable.)
In my experience, it's easier to do this with AOE than with iSCSI.
Neither set of target drivers directly support this, but most AOE target
drivers are small and run in user space, so it is easy to understand,
modify, and run them. (My current tool of choice is something called
aoedisk, which comes with the beta Solaris drivers you can get from
Coraid.)
There's lots of interesting things to test:
- turn a disk read-only
- start returning errors on all IO, or all read IO, or all write IO.
- start returning errors randomly, or only for requests for some sectors, or the like.
- return corrupted data without reporting an error, either consistently or randomly.
- change a disk's serial number, either while live or while idle, optionally zeroing the contents; this simulates swapping a physical disk (without the upper layers getting any disk changed hotswap events).
While you can also test what happens when a target device goes away or when requests start timing out, it's less useful because it's hard to be sure that the iSCSI or AOE initiator driver is behaving in the same way that the driver for the physical disk would. Of course, test away if you plan to run iSCSI or AOE, because then it's directly relevant; you may someday have a target device drop off the net or the like.
Of these, the Linux iSCSI target driver I'm familiar with can only change serial numbers and make a disk go read-only. None of the AOE tools have direct support for introducing errors after the disk has been running for a while, but it's relatively easy to add to their code. You can always force disk corruption by scribbling on bits of the backing store on the target's host, and you can always make the disk or the entire target host go away.
(Disks that fail immediately when the system tries to look at them are less interesting than disks that work long enough for the initiator to mount the filesystem and start doing IO.)
2007-07-19
A safety tip: keep your different sorts of source trees separate
Like many places we are slowly moving out of an era where we ran Unixes that came from the vendor with very limited amounts of packages, and so we had to build and install all sorts of them ourselves. Some of the collection we just compiled as-is, and some of it we had to modify, and some of them we wrote ourselves from scratch.
And we put the source code for all of them in the same local source tree.
Allow me to suggest that you not do this. If you keep local source code, separate it out into (at least) packages that you just (re)compiled, packages from elsewhere that you had to modify, and entirely local programs. A few years from now, this will make it much easier to figure out what you can gleefully throw out, what you might want to look through, and what you need to keep at least for reference to figure out just what it did.
Note that this applies just as much if you are building .debs or RPMs
or whatever instead of just doing 'make install'. I also believe that
it applies even if you are building your own entire private copy of, for
example, a Ruby on Rails environment.
2007-07-09
How not to set up your DNS (part 16)
Sort of presented in the traditional illustrated format:
; sdig ns ibc.com.au.
ns1.ibc.com.au.
ns2.ibc.com.au.
; dig cname ibc.com.au. @ns1.ibc.com.au.
[...]
;; flags: qr aa; QUERY: 1, ANSWER: 1, [...]
[...]
;; ANSWER SECTION:
ibc.com.au. IN SOA ns1.ibc.com.au. \
hostmaster.localdomain. [....]
(The TTL has been omitted and the line wrapped for clarity.)
This is not how you are supposed to say 'I do not have a CNAME record'. What ibc.com.au should be doing is returning a reply with nothing in the 'answer' section and their SOA record in the 'additional authority' section.
The net result of this issue is that a number of resolving nameservers will return SERVFAIL when asked to see if ibc.com.au is a CNAME, which has various interesting downstream consequences.
(Technically the com.au zone says that ibc.com.au has two other nameservers, however a) ibc.com.au disagrees, since the extras are not in the NS records that the first two return and b) the extra two are non-authoritative anyways.)
2007-07-06
What the flags on DNS query responses mean
Responses from DNS servers come with various useful and informative flags. Since I just looked them up while figuring out just what was going on with a peculiar nameserver, I'm going to write it down for my future reference.
qr |
Yes, this is really a DNS response that dig
is printing. |
aa |
The server is authoritative for the domain. |
rd |
You asked for recursive resolution of your query. |
ra |
The server is willing to do recursive queries for you. |
tc |
The response was truncated because it was too big to fit in a UDP packet. |
These come from RFC1035 section 4.1.1, which is worth reading in full (it's short).
Every nameserver for a domain should be an authoritative server for the
domain and so its responses about the domain should always have the aa
bit set. These days, seeing ra from a domain's nameserver should make
you nervous, especially if the nameserver does not report itself as
authoritative (ie, doesn't set aa).
(Real secondary servers for a domain are authoritative for the domain
and know it, even though they do not hold a permanent local copy of
the domain's DNS records. Informal secondaries, where you just list a
nameserver that will do recursive queries for the Internet as one of
your NS records, are not authoritative and will not set aa on replies.
Yes, people really do that.)
How not to set up your DNS (part 15)
This is one of those interesting little DNS glitches:
- the nameservers for the pk country domain say that gem.net.pk lists as nameservers sooraj.gem.net.pk and chand.gem.net.pk.
- if you ask sooraj what gem.net.pk's nameservers are, it gives you a non-authoritative reply saying that they are sooraj, chand, and ns1.gem.net.pk.
- ns1.gem.net.pk doesn't respond.
- if you ask chand what gem.net.pk's nameservers are, you sometimes get a reply without any actual data but with an 'additional authority' section that says that chand and sooraj are the nameservers, as if chand wasn't actually an authoritative nameserver for gem.net.pk.
The net result seems to be that every so often, our nameservers can't resolve anything to do with gem.net.pk because they have decided to query chand and have gotten answers back that make them throw up their hands in disgust.
What seems to be going on is that sooraj and chand are actually general recursive nameservers (for example, neither claim to be authoritative on any answers) that can also talk to ns1, which is presumably an internal-only machine. For some reason sooraj has a local copy of the data (for example, its TTLs on gem.net.pk results never count down) but chand does not; if you query chand during a time when it doesn't have things in its cache, you get useless results.
2007-07-03
How not to set up your DNS (part 14)
In the traditional illustrated format:
; sdig cname scrubber2.dom1.com @ns1.dom1.com mta1.otherdom.com mta2.otherdom.com
This is a well-intentioned and noble attempt to do round-robin CNAMEs.
Unfortunately it doesn't work, because you can't have multiple CNAME
records; you can have either one CNAME record or any number of other
sorts of records. For what this domain is trying to do, they need to
get the other domain to set up an mta-cluster.otherdom.com record
with all of the IP addresses of their MTAs, and then CNAME to that.
The effects on caching DNS servers are actually pretty interesting. Some DNS servers will refuse entirely to deal with this, returning server failure messages. Other DNS servers will give both CNAMEs on an initial query but only cache one of the two CNAME records (picking which one at random) and thereafter only give you that one back for the record's TTL.
(The domains involved have been anonymized at the request of the person who showed this to me.)
2007-07-01
The optimization rule for systems
One of the rules of optimizing program performance is that there is no point in optimizing code that is not run very often; you optimize the performance critical code, and then move on to other work. There is a similar principle in system design and system administration: improving an area that is not a concern to people does not really excite them. To significantly improve people's lives, you have to work on something that they care about, something that is hurting them now.
An important consequence of people not caring is that they aren't willing to put up with pain to get your improvement. Since your improvement has to make their lives better (or at least no worse), dealing with your improvement has to be less of a pain than what they're feeling now.
(In fact, this crops up all over in various forms.)
One problem that bedevils boosters of new technologies is that they over-estimate the degree of pain people are feeling from the area that they're improving. Put crudely, they're so close to their area that they wind up thinking that everyone cares about it as much as they do. (A common side effect is for such people to also underestimate how much of a pain their new technology is in practice.)
Note that sysadmins are as much at risk of this as boosters of new technology; when we create bright shining solutions to some issue, it's always important to find out if the users actually care about it. Similarly, finding out what things are pains for the users is the first step in figuring out what's important to work on.