2016-06-29
Modern DNS servers (especially resolvers) should have query logging
Since OpenBSD has shifted to using Unbound as their resolving DNS server, we've been in the process of making this shift ourselves as we upgrade, for example, our local OpenBSD-based resolver machines. One of the things this caused me to look into again is what Unbound offers for logging, and this has made me just a little bit grumpy.
So here is my opinion:
Given the modern Internet environment, every DNS server should be capable of doing compact query logging.
By compact query logging, I mean something that logs a single line with the client IP, the DNS lookup, and the resolution result for each query. This logging is especially important for resolving nameservers, because they're the place where you're most likely to want this data.
(How the logging should be done is an interesting question. Sending it to syslog is probably the easiest way; the best is probably to provide a logging plugin interface.)
What you want this for is pretty straightforward: you want to be able to spot and find compromised machines that are trying to talk to their command & control nodes. These machines leave traces in their DNS traffic, so you can use logs of that traffic to try to pick them out (either at the time or later, as you go back through log records). Sometimes what you want to know and search is the hosts and domains being looked up; other times, you want to search and know what IP addresses are coming back (attackers may use fast-flux host names but point them all at the same IPs).
(Quality query logging will allow you relatively fine grained control over what sort of queries from who get logged. For example, you might decide that you're only interested in successful A record lookups and then only for outside domains, not your own ones.)
Query logging for authoritative servers is probably less useful, but I think that it should still be included. You might not turn it on for your public DNS servers, but there are other cases such as internal DNS.
As for Unbound, it can sort of do query logging but it's rather verbose. Although I haven't looked in detail, it seems to be just the sort of thing you'd want when you have to debug DNS name resolution problems, but not at all what you want to deal with if you're trying to do this sort of DNS query monitoring.
2016-06-26
How not to maintain your DNS (part 22)
Much as the previous installment, this example of bad DNS setup is sufficiently complicated that it's best illustrated in text instead of trying to show DNS output.
We start with the domain zshine.com. At the moment its WHOIS registration says that it has the DNS servers ns1.gofreeserve.com and ns2.gofreeserve.com. If you query the nameservers for .com, they will agree with this and give you an IP for each nameserver, 192.196.158.106 and 192.196.159.106 respectively.
According to WHOIS, gofreeserve.com's registered nameservers are (ns1 ns2 ns11 ns12).lampnetworks.com, and the .com nameservers agree with this. All of these nameservers report themselves as authoritative for gofreeserve.com. None of them know about either ns1.gofreeserve.com or ns2.gofreeserve.com; in fact they authoritatively claim that neither exist.
As the capstone, neither 192.196.158.106 nor 192.196.159.106 respond to DNS requests, so even if you accept the glue records from the .com nameservers you can't actually resolve anything about zshine.com. Nor do the lampnetworks.com nameservers have any information about zshine.com.
The results of this are somewhat interesting. Obviously, zshine.com essentially doesn't exist in DNS; you can't look up an A or MX record for it. Working out why can be a little bit tricky, though. With at least some resolving DNS servers, all you get is a timeout when you query for even just zshine.com's NS records. In order to hunt things down I had to go digging in WHOIS data and then looking at gofreeserve.com's own DNS data.
As far as I can guess, this is a version of glue record hell. Gofreeserve does appear to offer DNS handling as one of their services, and at some point clearly it was done through those ns1 and ns2 DNS names. However, things have changed since and not all domains that used them have had their WHOIS data updated. In fact, perhaps some domains have been dropped entirely by Gofreeserve but haven't changed anything. Without glue records in the DNS, we'd probably get a failure to resolve the listed nameservers. With glue records, well, clearly some of the time we get a timeout trying to query them.
(Some casual Internet searches suggest that there are any number
of domains still using ns[12].gofreeserve.com as their DNS servers.
I won't speculate why the people behind these domains don't seem
to have noticed that they don't work any more, although this case
may have a relatively sensible reason, namely that this is probably
a secondary domain name for a firm with their primary domain name
in .cn.)
PS: Since the occasion for me noticing this issue with zshine.com is something claiming to be it trying to send email to my spamtraps, I'm not too upset about its DNS issues.
2016-06-24
Our new plan for creating our periodic long term backups
Our ordinary backups are done on the usual straightforward rolling basis, where we aim to have about 60 days worth of backups. We also try to make an additional set of long term backups every so often, currently roughly three times a year, and keep these for as long as possible. Every so often this makes people very happy because we can restore something they deleted six months ago without noticing.
Our long term backups are done with the same basic system as our regular disk-based backups. We have some additional Amanda servers that are used only for these long term backups, we load them up with disks, and then we have them do full backups of all of our filesystems to the spare disks. Obviously this requires careful scheduling and managing, since we don't want to collide with the regular backups (which take priority). This is a simple approach and it works, but unfortunately over time it's become increasingly difficult and time consume to actually do a long term backup run. The long term backups can only run during the day and require hand attention, sometimes the regular backups of our largest fileserver run into the day and block long term backups that day entirely, the daytime backups go very slowly in general because our systems are actively in use, and so on. And many of these problems are only going to get worse in the future, as people use more space and are more active on our machines.
Recently, one of my co-workers had a great idea on how to deal with all of these problems: copy filesystem backups out of our existing Amanda servers. Instead of using additional Amanda servers to do additional backups, we can just make copies of the full filesystem backups from our existing regular backup system. When you do Amanda backups to 'tapes' that are actually disks, Amanda just writes each filesystem backup to a regular file. Want an extra copy, say for long term backups? Just copy it somewhere, say to the disks we're using for those long term backups. This copying doesn't bog down our fileservers, can easily be done when the Amanda servers are otherwise idle, and can be done any time we want, even days after the filesystem full backup was actually made. Effectively we've turned building the long term backups from a synchronous process into an asynchronous one.
The drawback of abandoning Amanda is that we lose all of the Amanda
infrastructure for tracking where filesystems have been saved and
restoring filesystems (and files). It's entirely up to us to keep
track of which disk has which filesystem backup (and when it was
made) and to save per-filesystem index files. And any restores will
have to be entirely done by hand with raw dd and tar commands,
which makes them rather less convenient. But we think we can live
with all of this in exchange for it being much easier to make the
long term backups.
Right now this is just a plan. We haven't done a long term backup run with it; the next one is likely to happen in September or October. We may find out that there are some unexpected complications or annoyances when we actually try it, although we haven't been able to think of any.
(In retrospect this feels like an obvious idea, but much like the last time spotting it requires being able to look at things kind of sideways. All of my thoughts about the problem were focused on 'how can we speed up dumping filesystems for the long term backups' and 'how can we make them work more automatically' and so on, which were all stuck inside the existing overall approach.)
2016-06-19
A lesson to myself: know your emergency contact numbers
Let's start with my tweets:
@thatcks: There's nothing quite like getting a weekend alert that a machine room we have network gear in is at 30C and climbing. Probably AC failure.
@thatcks: @isomer There is approximately nothing I can do, too. I'm not even sure who to potentially call, partly because it's not our machine room.
(This is the same machine room that got flooded because of an AC failure, which certainly added a degree of discomfort to the whole situation.)
In some organizations the answer here is 'go to the office and see about doing something, anything'. That is not how we work, for various reasons. It might be different if it was one of our main machine rooms, but an out of hours AC failure in a machine room we only have switches in is not a crisis sufficiently big to drag people to the office.
But, of course, there is a failure and a learning experience here, which is that I don't have any information written down about who to call to get the AC situation looked at by the university's Facilities and Services people. I've been through past machine room AC failures, and at the time I either read the signs we have on machine room doors or worked out (or heard) who to call to get it attended to, but I didn't write it down. Probably I thought that it was either obvious or surely I wouldn't forget it for next time around. Today I found out how well that went.
So, my lessons learned from this incident is that I should fix my ignorance problem once and for all. I should make a file with both in-hours and out-of-hours 'who to contact and/or notify' information for all of the machine rooms we're involved in. Probably we call the same people for a power failure as for an AC failure or another incident, but I should find out for sure and note this down too. Then I should replicate the file to at least my home machine, and probably keep a printout in the office (in case there's a failure in our main machine room, which would take our entire environment down).
(It would be sensible to also have contact information for, say, a failure in our campus backbone connection. I think I know who to try to call there, but I'm not sure and if it fails I won't exactly be able to look things up in the campus directory.)
2016-06-11
I accept that someday I'll give up MH and move to IMAP mail clients
My current email tooling is strongly built around MH, using both command line tools and exmh. MH assumes a traditional Unix mail environment where your inbox can be accessed through the filesystem, and more than that it fundamentally assumes that it entirely owns your email. As many people who try it out find out to their regret, MH's only interaction with the regular Unix mail ecosystem is to get your mail out of your Unix inbox as fast as possible.
So far I've been able to use and keep on using MH because I've
worked (and had my personal email)
on Unix systems that handled email in the traditional Unix way,
with your inbox directly accessible through the filesystem in
/var/mail and so on. However, these are a vanishing breed, for
reasonably good reasons, and in the modern world the generic way
you get at your email is IMAP. IMAP is not very Unixy, but it's
what we've got and it's better than being stuck with proprietary
network mail protocols.
MH and IMAP not so much don't get along as don't interact with each other. As far as I know, MH has no protocol support for IMAP, which is not surprising; IMAP is designed to keep all of your email on the IMAP server, which is completely opposite to how MH operates. It might be nice to have an 'IMH' system that was a collection of command line programs to manipulate IMAP mail and folders, but no such thing exists that I know of and it's unlikely that anyone will ever write one.
Some day I will have to use a mail system that can only be accessed over IMAP. In theory I could deal with this by using a program to pull all of my email out of IMAP and hand it over to MH as local mail; there are a number of things that will do variants of this job. In practice my feeling is that doing this is swimming upstream against a pretty powerful current, and thus generally a mistake. Certainly I expect that I won't be energetic and annoyed enough to do it. By that point MH will have had an extremely good multi-decade run for me, and very few programs last forever. I can change.
(Also, by that point I expect that I will be really tired of how MH and exmh don't really deal all that well with HTML email, because I expect that HTML email is only going to get more and more common from now onwards.)
PS: The clever reader will have guessed that I don't currently have a smartphone or other limited secondary device that I want to read my email from, because all of those really want you to use IMAP (or something even less MH-cooperative, like GMail). That may change someday, at which point I may have to re-think all of this.
Sidebar: I don't see this happening any time soon
Locally we're still very strongly attached to filesystem accessible
inboxes and procmail and all of the other artifacts of a traditional
Unix mail system. There would be quite a lot of upset users if we
took that away from them, so I don't expect it to happen short of
something truly drastic happening with our overall mail system.
(Nor is the department likely to give up running its own email system any time soon.)
As for my personal email, well, that's tangled but certainly my attachment to MH complicates my life. There are lots of places that I could get IMAP mail, probably even IMAP mail for a custom domain, so if I was happy with IMAP alone life would be quite a bit easier. Until I de-tangle my personal email it gets a free ride on work email's MH friendliness; after I de-tangle it, I will probably still run my own servers for it and so I could run MH there if I wanted to.
(At that point I might want to switch to IMAP for various reasons.)
2016-06-10
An email mistake I've made as a long-term university sysadmin
I've been a sysadmin here at the University of Toronto for quite a long time, which has enabled me to make a natural and probably common (university) mistake with my email. Namely, I have by now totally commingled my work email and my personal email by the simple mechanism of just using my normal university account for everything. I get random technical mailing lists I'm interested in sent to my address here, any number of people who correspond with me use my address here, and so on.
This is unfortunate for various reasons, including that it makes taking a break away from work email and work systems much harder. I can't just not log in, because if I do I'll miss personal email, and the moment I log in it starts being tempting to take a peek at various work things.
Some people would and could deal with this by moving their personal email to an outside provider (GMail is obviously a popular choice). This is completely possible, and in fact I got people to change the personal email they used for me once (I moved it between systems here). Unfortunately I'm very attached to how I handle my email today. The tools that I use intrinsically require a Unix system with a local mail spool, and running my own Unix system for this is (still) enough of a hassle that I haven't taken a deep breath and just got to it.
(The problem with running my own Unix system is not the basic work, it's all of the additional things I'd have to worry about and spend my limited free time on. There's backups and maintenance and monitoring and keeping track of how to rebuild everything and so on and so forth. All of this is simple at work because we've built an entire infrastructure to make it that way. And then there's the whole issue of anti-spam filtering, where I currently get to lean on a commercial package.)
I know it's less than ideal to keep my email commingled, but it's just easier to let this situation go on. Everything works today and I don't have to worry about any number of things and most of the time the commingling barely matters. Inertia is a powerful force, as is little incremental steps; they're how I've wound up in this situation in the first place. Big bang things like setting up a new mail system et al from scratch are hard, because there's so much work before you get anywhere.
(I should do it anyways. Someday.)
Sidebar: My past efforts at this
At one point I attempted to have at least a personal versus work email address split here, but that went down in flames many years ago because spammers got their hands on my then personal email address (partly because the address predated spammers so I did not take the extensive array of precautions that I do today). Today that old address lives on basically only as a way of getting information about spammer behavior (eg).
Almost ten years ago when I shifted to Computer Science I had the opportunity to (re)split personal email from work email (since I was changing my primary email address anyways), but at the time I was so busy with other aspects of the transition that I didn't really have either time or energy to even think about setting up a new Unix system with a new email setup and so on. Anyway, ten years ago one didn't have the modern wide variety of inexpensive hosting options, at least as I imperfectly remember it.
2016-06-06
I work in what is increasingly a pretty different sysadmin environment
I've written before about what our sysadmin environment is like but that description doesn't really convey how and why our environment is increasingly different from what the rest of the world seems to be moving to, with the resulting very different needs. Today I'm going to describe our environment from another perspective .
In our environment, we broadly do three different things as far as computing goes. First off, we provide a number of standard services to people in the department, things like Samba file service, printing, email with IMAP (and webmail), DNS, and so on. This sort of internal service provision is probably still quite common in reasonable sized organizations (trendy ones have no doubt outsourced it to GMail, Dropbox, et al). Users are not exposed to the backend details of what software stacks power these services, and although our current stacks are stable we have shuffled them around in the past and could in the future. We definitely feel no need to run the latest and greatest software stacks and versions, and generally prefer to leave these services alone for as long as possible (this too is probably common in organizations doing this).
Second, we provide general multiuser computing to our users in several forms (general login service, compute login service, and various forms of web service ranging from plain HTML pages through completely custom web servers that they run). Naturally, how people use these services and what they run varies widely; we frequently get requests to install various bits of open source software that people want, for example. Our users obviously are pretty exposed to what OS and software we're running, and we couldn't make significant changes in it without serious disruption (even mild changes like Ubuntu version upgrades can be disruptive). Our users also care a fair bit about having current or relatively current software in this environment (for a wide variety of open source software). My impression is that we are one of the few environments left that provides this sort of computing.
Finally, the large collective we run a certain amount of custom services and applications, both for people inside the department and for people outside it. Some of these services are developed by the same sysadmins who run the hosts they're on, but others are increasingly going to have separate developers and sysadmin people (these are generally the complicated applications). Users of these systems aren't exposed to the backend details, but obviously the developers are since they have to write code for some deployment environment. The developers probably care (to some extent) about working with commonly chosen environments (eg Linux and Apache) and with current or reasonably current versions of things like databases, web servers, and so on. This sort of thing is closest to ordinary 'operations' or 'devops' work in the outside world but is generally less demanding for various reasons (there is very little here that could be described as 'business critical', for example).
(I wouldn't be surprised if some day we wind up with developers who want to deliver their applications as Docker containers or the like, rather than dancing around with asking us to set up a database this way and an Apache/PHP web environment that way and so on.)
So far, nothing in our environment faces high load or high demand, neither in our standard services nor our custom services. When unusual surge demand descends, so far it is not really our problem; we have some responsibility to keep the overall system up and responding, but very little to make sure the specific service affected does not collapse under the load it's experiencing.
In theory we could run these three different sorts of environments using different operating systems and software stacks. In practice we have limited staff and so the needs of the multiuser computing wind up spilling over to affect what we run for the other environments; provided that it works reasonably well, it's simply less work to have a uniform setup across all three environments. Today most of that uniform environment is Ubuntu LTS, because Ubuntu LTS remains the best environment for providing the multiuser computing part of things.