Wandering Thoughts archives


In practice, Go's slices are two different data structures in one

As I've seen them in Go code and used them myself, Go's slices are generally used in two pretty distinctly separate situations. As a result, I believe that many people have two different conceptual models of slices and their behavior, depending on which situation they're using slices in.

The first model and use of slices is as views into a concrete array (or string) that you already have in your code. You're taking an efficient reference to some portion of the array and saying 'here, deal with this chunk' to some piece of code. This is the use of slices that is initially presented in A Tour of Go here and that is implicitly used in, for example, io.Reader and io.Writer, both of which are given a reference to an underlying concrete byte array.

The second model and use of slices is as dynamically resizable arrays. This is the usage where, for example, you start with 'var accum []string', and then add things to it with 'accum = append(accum, ...)'. In general, any code using append() is using slices this way, as is code that uses explicit slice literals ('[]string{a, b, ..}'). Dynamically resizable arrays are a very convenient thing, so this sort of slice shows up in lots of Go code.

(Part of this is that Go's type system strongly encourages you to use slices instead of arrays, especially in arguments and return values.)

Slices as dynamically resizable arrays actually have an anonymous backing store behind them, but you don't normally think about it; it's materialized, managed, and deallocated for you by the runtime and you can't get a direct reference to it. As we've seen, it's easy to not remember that the second usage of slices is actually a funny, GC-driven special case of the first sort of use. This can lead to leaking memory or corrupting other slices.

(It's not quite fair to call the anonymous backing array an implementation detail, because Go explicitly documents it in the language specification. But I think people are often going to wind up working that way, with the slice as the real thing they deal with and the backing array just an implementation detail. This is especially tempting since it works almost all of the time.)

This distinct split in usage and conceptual model (and the glitches that result at the edges of it) are why I've wound up feeling that in practice, Go's slices are two different data structures in one. The two concepts may be implemented with the same language features and runtime mechanisms, but people treat them differently and have different expectations and beliefs about them.

programming/GoSlicesTwoViews written at 00:26:37; Add Comment


Some notes on using Go to check and verify SSH host keys

For reasons beyond the scope of this entry, I recently wrote a Go program to verify the SSH host keys of remote machines, using the golang.org/x/crypto/ssh package. In the process of doing this, I found a number of things in the package's documentation to be unclear or worth noting, so here are some notes about it.

In general, you check the server's host key by setting your own HostKeyCallback function in your ClientConfig structure. If you only want to verify a single host key, you can use FixedHostKey(), but if you want to check the server key against a number of them, you'll need to roll your own callback function. This includes the case where you have both a RSA and an ed25519 key for the remote server and you don't necessarily know which one you'll wind up verifying against.

(You can and should set your preferred order of key types in HostKeyAlgorithms in your ClientConfig, but you may or may not wish to accept multiple key types if you have them. There are potential security considerations because of how SSH host key verification works, and unless you go well out of your way you'll only verify one server host key.)

Although it's not documented that I can see, the way you compare two host keys to see if they're the same is to .Marshal() them to bytes and then compare the bytes. This is what the code for FixedHostKey() does, so I consider it official:

type fixedHostKey struct {
  key PublicKey

func (f *fixedHostKey) check(hostname string, remote net.Addr, key PublicKey) error {
  if f.key == nil {
    return fmt.Errorf("ssh: required host key was nil")
  if !bytes.Equal(key.Marshal(), f.key.Marshal()) {
    return fmt.Errorf("ssh: host key mismatch"
  return nil

In a pleasing display of sanity, your HostKeyCallback function is only called after the crypto/ssh package has verified that the server can authenticate itself with the asserted host key (ie, that the server knows the corresponding private key).

Unsurprisingly but a bit unfortunately, crypto/ssh does not separate out the process of using the SSH transport protocol to authenticate the server's host keys and create the encrypted connection from then trying to use that encrypted connection to authenticate as a particular user. This generally means that when you call ssh.NewClientConn() or ssh.Dial(), it's going to fail even if the server's host key is valid. As a result, you need your HostKeyCallback function to save the status of host key verification somewhere where you can recover it afterward, so you can distinguish between the two errors of 'server had a bad host key' and 'server did not let us authenticate with the "none" authentication method'.

(However, you may run into a server that does let you authenticate and so your call will actually succeed. In that case, remember to call .Close() on the SSH Conn that you wind up with in order to shut things down neatly; otherwise you'll have some resource leaks in your Go code.)

Also, note that it's possible for your SSH connection to the server to fail before it gets to host key authentication and thus to never have your HostKeyCallback function get called. For example, the server might not offer any key types that you've put in your HostKeyAlgorithms. As a result, you probably want your HostKeyCallback function to have to affirmatively set something to signal 'server's keys passed verification', instead of having it set a 'server's keys failed verification' flag.

(I almost made this mistake in my own code, which is why I'm bothering to mention it.)

As a cautious sysadmin, it's my view that you shouldn't use ssh.Dial() but should instead net.Dial() the net.Conn yourself and then use ssh.NewClientConn(). The problem with relying on ssh.Dial() is that you can't set any sort of timeout for the SSH authentication process; all you have control over is the timeout of the initial TCP connection. You probably don't want your check of SSH host keys to hang if the remote server's SSH daemon is having a bad day, which does happen from time to time. To avoid this, you need to call .SetDeadline() with an appropriate timeout value on the net.Conn after it's connected but before you let the crypto/ssh code take it over.

The crypto/ssh package has a convenient function for iteratively parsing a known_hosts file, ssh.ParseKnownHosts(). Unfortunately this function is not suitable for production use by itself, because it completely gives up the moment it encounters even a single significant error in your known_hosts file. This is not how OpenSSH ssh behaves, for example; by and large ssh will parse all valid lines and ignore lines with errors. If you want to duplicate this behavior, you'll need to split your known_hosts file up into lines with bytes.Split(), then feed each non-blank, non-comment line to ParseKnownHosts (if you get an io.EOF error here, it means 'this line isn't formatted like a SSH known hosts line'). You'll want to think about what you do about errors; I accumulate them all, report up to the first N of them, and then only abort if there's been too many.

(In our case we want to keep going if it looks like we've only made a mistake in a line or two, but if looks like things are badly wrong we're better off giving up entirely.)

Sidebar: Collecting SSH server host keys

If all you want to do is collect SSH server host keys for hosts, you need a relatively straightforward variation of this process. You'll repeatedly connect to the server with a different single key type in HostKeyAlgorithms each time, and your HostKeyCallback function will save the host key it gets called with. If I was doing this, I'd save the host key in its []byte marshalled form, but that's probably overkill.

programming/GoSSHHostKeyCheckingNotes written at 23:41:41; Add Comment


Some notes and considerations on SSH host key verification

Suppose, not entirely hypothetically, that you want to verify the SSH host keys of a server and that you're doing so with code that's reasonably under your control (instead of relying on, say, OpenSSH's ssh program). Then there are a number of things that you're going to want to think about because of how the SSH protocol works and how it interacts with security decisions.

The first thing to know is that you can only verify one type of host key in a single connection. As covered in RFC 4253 section 7.1, the client (you) and the server (the remote end) send each other a list of supported host key algorithms, and then the two of you pick one of the supported algorithms and verify the server's key in that algorithm. If you know multiple types of host keys for a server and you want to verify that the server knows all of them, you need to verify each type of key in a separate connection.

In theory, the client controls the preference order of the SSH host key types; you can say that you prefer ed25519 keys to RSA keys and the server should send its ed25519 key instead of its RSA key. In practice, a server can get away with sending you any type of host key that you said you'd accept, even if it's not your preference, because a server is allowed to claim that it doesn't have your preferred sort of host key (but good servers should be obedient to your wishes, because that's what the protocol requires). As a result, if you're verifying host keys you have a security decision to make: are you willing to accept any type of host key you have on file, or if you have your preferred type of host key on file, do you insist that the server present that type of key?

To be concrete, suppose that you have ed25519 and RSA keys for a server, you prefer ed25519 keys, and when you try to verify the server it offers you its RSA key instead of its ed25519 key. You could reject this on the grounds that either the server does not have the ed25519 key it should or that it's not following the protocol specification, or you could accept it because the server has a SSH host key that you have on file for it.

(As far as I can tell, OpenSSH's ssh command behaves the second way; it'll accept an RSA key even if you also have an ed25519 key for the server in your known_hosts.)

If you pick the first approach, you want to configure your SSH connection to the server to only accept a single key type, that being the best key type you have on file for the server. If you pick the second approach, you'll want to list all key types you have, in preference order (I prefer ed25519 to RSA and skip (EC)DSA keys entirely, while the current OpenSSH ssh_config manpage prefers ECDSA to ed25519 to RSA).

Under normal circumstances, the server will present only a single host key to be verified (and it certainly can only present a single type of key). This means that if you reject the initial host key the server presents, you will never be called on to verify another type of host key. If the server presents an ed25519 key and you reject it, you'll never get asked to verify an RSA key; the connection just fails. If you wanted to fall back to checking the RSA key in this case, you would have to make a second connection (during which you would only ask for RSA keys). In other words, if the server presents a key it must be correct. With straightforward code, your condition is not 'the server passes if it can eventually present any key that you know', your condition is 'the server passes if the first and only key it presents is one you know'.

PS: If you want to match the behavior of OpenSSH's ssh command, I think you're going to need to do some experimentation with how it actually behaves in various situations. I'm sure that I don't fully understand it myself. Also, you don't necessarily want to imitate ssh here; it doesn't necessarily make the most secure choices. For instance, ssh will happily accept a known_hosts file where a server has multiple keys of a given type, and pass the server if it presents a key that matches any one of them.

Sidebar: How a server might not know some of its host keys

The short version is re-provisioning servers. If you generate or record a server's host key of a given type, you need to also make sure that the server is (re-)provisioned with that key when it gets set up. If you miss a key type, you'll wind up with the server generating and presenting a new key of that type. This has happened to us every so often; for example, we missed properly re-provisioning ed25519 keys on Ubuntu 14.04 machines for a while.

tech/SSHHostKeyVerificationNotes written at 23:38:02; Add Comment

My new Linux office workstation for fall 2017

My past two generations of office Linux desktops have been identical to my home machines, and when I wrote up my planned new home machine I expected that to be the case for my next work machine as well (we have some spare money and my work machine is six years old, so replacing it was always in the plans). It turns out that this is not going to be the case this time around; to my surprise and for reasons beyond the scope of this entry, my next office machine is going to be AMD Ryzen based.

(It turns out that I was wrong on how long it's been since I used AMD CPUs. My current desktop is Intel, but my previous 2006-era desktop was AMD based.)

The definitive parts list for this machine is as follows. Much of it is based on my planned new home machine, but obviously the switch from Intel to AMD required some other changes, some of which are irritating ones.

AMD Ryzen 1800X
Even though we're not going to overclock it, this is still the best Ryzen CPU. I figure that I can live with the 95W TDP and the cooling it requires, since that's what my current desktop has (and this time I'm getting a better CPU cooler than the stock Intel one, so it should run both cooler and quieter).

ASUS Prime X370-Pro motherboard
We recently got another Ryzen-based machine with this motherboard and it seems fine (as a CPU/GPU compute server). The motherboard has a decent assortment of SATA ports, USB, and so on, and really there's not much to say about it. I also looked at the slightly less expensive X370-A, but the X370-Pro has more than enough improvements to strongly prefer it (including two more SATA ports and onboard Intel-based networking instead of Realtek-based).

It does come with built in colourful LED lighting, which looks a bit odd in the machine in our server room. I'll live with it.

(This motherboard is mostly an improvement on the Intel version since it has more SATA ports, although I believe it has one less M.2 NVME port. But with two x16 PCIE slots, you can fix that with an add-on card.)

2x16 GB DDR4-2400 Kingston ECC ValueRAM
Two DIMMs is what you want on Ryzens today. We're using ECC RAM basically because we can; it's available and is only a bit more expensive than non-ECC RAM, runs fast enough, and is supported to at least some degree by the motherboard. We don't know if it will correct any errors, but probably it will.

(You can't get single-rank 16GB DIMMs, so that this ECC RAM is double-rank is not a drawback.)

The RAM speed issues with Ryzen is one of the irritations of building this machine around an AMD CPU instead of an Intel one. It may never be upgraded to 64 GB RAM over its lifetime (which will probably be at least five years).

Noctua NH-U12-SE-AM4 CPU cooler
We need some cooler for the Ryzen 1800X (since it doesn't come with one). These are well reviewed as both effective and quiet, and the first Ryzen machine we got has a Noctua cooler as well (although a different one).

Gigabyte Radeon RX 550 2GB video card
That I need a graphics card is one of the irritations of Ryzens. Needing a discrete graphics card means an AMD/ATI card right now, and I wanted one with a reasonably modern graphics architecture (and I needed one with at least two digital video outputs, since I have dual monitors). I sort of threw darts here, but reviewers seem to say that this card should be quiet under normal use.

As a Linux user I don't normally stress my graphics, but I expect to have to run Wayland by the end of the lifetime of this machine and I suspect that it will want something better than a vintage 2011 chipset. A modern Intel integrated GPU would likely have been fine, but Ryzens don't have integrated graphics so I have to go with a separate card.

(The Prime X370-Pro has onboard HDMI and DisplayPort connectors, but a footnote in the specifications notes that they only do anything if you have an Athlon CPU with integrated graphics. This disappointed me when I read it carefully, because at first I thought I was going to get to skip a separate video card.)

Commentary on my planned home machine pushed me to a better PSU than I initially put in that machine's parts list. Going to 550W buys me some margin for increased power needs for things like a more powerful GPU, if I ever need it.

(There are vaguely plausible reasons I might want to temporarily put in a GPU capable of running things like CUDA or Tensorflow. Some day we may need to know more about them than we currently do, since our researchers are increasingly interested in GPU computing.)

Fractal Design Define R5 case
All of the reasons I originally had for my home machine apply just as much for my work machine. I'm actively looking forward to having enough drive bays (and SATA ports) to temporarily throw hard drives into my case for testing purposes.

This is an indulgence, but it's an inexpensive one, I do actually burn DVDs at work every so often, and the motherboard has 8 SATA ports so I can actually connect this up all the time.

Unlike my still-theoretical new home machine (which is now unlikely to materialize before the start of next year at the earliest), the parts for my new office machine have all been ordered, so this is final. We're going to assemble it ourselves (by which I mean that I'm going to, possibly with some assistance from my co-workers if I run into problems).

On the bright side of not doing anything about a new home machine, now I'm going to get experience with a bunch of the parts I was planning to use in it (and with assembling a modern PC). If I decide I dislike the case or whatever for some reason, well, now I can look for another one.

(However, there's not much chance that I'll change my mind on using an Intel CPU in my new home machine even if this AMD-based one goes well. The 1800X is a more expensive CPU, although not as much so as I was expecting, and then there's the need for a GPU and the whole issues with memory and so on. Plus I remain more interested in single-thread CPU performance in my home usage. Still, I could wind up surprising myself here, especially if ECC turns out to be genuinely useful. Genuinely useful ECC would be a bit disturbing, of course, since that implies that I'd be seeing single-bit RAM errors far more than I think I should be.)

linux/WorkMachine2017 written at 01:13:27; Add Comment


I'm basically giving up on syslog priorities

I was recently writing a program where I was logging things to syslog, because that's our default way of collecting and handling logs. For reasons beyond the scope of this entry I was writing my program in Go, and unfortunately Go's standard syslog package makes it relatively awkward to deal with varying syslog priorities. My first pass at the program dutifully slogged through the messy hoops to send various different messages with different priorities, going from info for routine events, to err for reporting significant but expected issues, and ending up at alert for things like 'a configuration file is broken and I can't do anything'. After staring at the resulting code for a while with increasingly unhappy feelings, I ripped all of it out in favour of a much simpler use of basic Go logging that syslogged everything at priority info.

At a theoretical level, this is clearly morally wrong. Syslog priorities have meanings and the various sorts of messages my program can generate are definitely of different importance to us; for example, we care far more about 'a configuration file is broken' than 'I did my thing with client machine <X>'. At a practical level, though, syslog priorities have become irrelevant and thus unimportant. For a start, we make almost no attempt to have our central syslog server split messages up based on their priority. The most we ever look at is different syslog facilities, and that's only because it helps reduce the amount of messages to sift through. We have one file that just gets everything (we call it allmessages), and often we just go look or search there for whatever we're interested in.

In my view there are two pragmatic reasons we've wound up in this situation. First, the priority that a particular message of interest is logged at is something we'd have to actively remember in order for it to be of use. Carefully separating out the priorities into different files only actually helps us if we can remember that we want to look at, say, all.alert for important messages from our programs. In practice we can barely remember which syslog facility most things use, which is one reason we often just look at allmessages.

More importantly, we're mostly looking at syslog messages from software we didn't write and it turns out that what syslog priorities get used are both unpredictable and fairly random. Some programs dump things we want to know all the way down at priority debug; others spray unimportant issues (or what we consider unimportant) over nominally high priorities like err or even crit. This effectively contaminates most syslog priorities with a mixture of messages we care about and messages we don't, and also makes it very hard to predict what priority we should look at. We're basically down to trying to remember that program <X> probably logs the things we care about at priority <Y>. There are a bunch of program <X>s and in practice it's not worth trying to remember how they all behave (and they can change their minds from version to version, and we may have both versions on our servers on different OSes).

(There is a similar but somewhat smaller issue with syslog facilities, which is one reason we use allmessages so much. A good illustration of this is trying to predict or remember which messages from which programs will wind up in facility auth and which wind up in authpriv.)

This whole muddle of syslog priority usage is unfortunate but probably inevitable. The end result is that syslog priorities have become relatively meaningless and so there's no real harm in me giving up on them and logging everything at one level. It's much more important to capture useful information that we'll want for troubleshooting than to worry about what exact priority it should be recorded at.

(There's also an argument that fine-grained priority levels are the wrong approach anyway and you have maybe three or four real priority levels at most. Some people would say even less, but I'm a sysadmin and biased.)

sysadmin/SyslogPrioritiesGivingUp written at 23:23:03; Add Comment


We're broadly switching to synchronizing time with systemd's timesyncd

Every so often, simply writing an entry causes me to take a closer look at something I hadn't paid much attention to before. I recently wrote a series of entries on my switch from ntpd to chrony on my desktops and why we don't run NTP daemons but instead synchronize time through a cron entry. Our hourly crontab script for time synchronization dates back to at least 2008 and perhaps as early as 2006 and our first Ubuntu 6.06 installs; we've been carrying it forward ever since without thinking about it very much. In particular, we carried it forward into our standard 16.04 installs. When we did this, we didn't really pay attention to the fact that 16.04 is different here, because 16.04 is systemd based and includes systemd's timesyncd time synchronization system. Ubuntu installed and activated systemd-timesyncd (with a stock setup that got time from ntp.ubuntu.com), we installed our hourly crontab script, and nothing exploded so we didn't really pay attention to any of this.

When I wrote my entries, they caused me to start actually noticing systemd-timesyncd and paying some attention to it, which included noticing that it was actually running and synchronizing the time on our servers (which kind of invalidates my casual claim here that our servers were typically less than a millisecond out in an hour, since that was based on ntpdate's reports and I was assuming that there was no other time synchronization going on). Coincidentally, one of my co-workers had also had timesyncd come to his attention recently for reasons outside of the scope of this entry. With timesyncd temporarily in our awareness, my co-workers and I talked over the whole issue and decided that doing time synchronization the official 16.04 systemd way made the most sense.

(Part of it is that we're likely to run into this issue on all future Linuxes we deal with, because systemd is everywhere. CentOS 7 appears to be just a bit too old to have timesyncd, but a future CentOS 8 very likely will, and of course Ubuntu 18.04 will and so on. We could fight city hall, but at a certain point it's less effort to go with the flow.)

In other words, we're switching over to officially using systemd-timesyncd. We were passively using it before without really realizing it since we didn't disable timesyncd, but now we're actively configuring it to use our time local servers instead of Ubuntu's and we're disabling and removing our hourly cron job. I guess we're now running NTP daemons on all our servers after all; not because we need them for any of the reasons I listed, but just because it's the easiest way.

(At the moment we're also using /etc/default/ntpdate (from the Ubuntu ntpdate package) to force an initial synchronization at boot time, or technically when the interface comes up. We'll probably keep doing this unless timesyncd picks up good explicit support for initially force-setting the system time; when our machines boot and get on the network, we want them to immediately jump their time to whatever we currently think it is.)

linux/SwitchingToTimesyncd written at 21:37:12; Add Comment

The cost of memory access across a NUMA machine can (probably) matter

We recently had an interesting performance issue reported to us by a researcher here. We have a number of compute machines, none of them terribly recent; some of them are general access and some of them can be booked for exclusive usage. The researcher had a single-core job (I believe using R) that used 50 GB or more of RAM. They first did some computing on a general-access compute server with Xeon E5-2680s and 96 GB of RAM, then booked one of our other servers with Xeon X6550s and 256 GB of RAM to do more work on (possibly work that consumed significantly more RAM). Unfortunately they discovered that the server they'd booked was massively slower for their job, despite having much more memory.

We don't know for sure what was going on, but our leading theory is NUMA memory access effects because the two servers have significantly different NUMA memory hierarchies. In fact they are the two example servers from my entry on getting NUMA information from Linux. The general access server had two sockets for 48 GB of RAM per socket, while the bookable compute server with 256 GB of RAM had eight sockets and so only 32 GB of RAM per socket. To add to the pain, the high-memory server also appears to have a higher relative cost for access to the memory of almost all of the other sockets. So on the 256 GB machine, memory access was likely going to other NUMA nodes significantly more frequently and then being slower to boot.

Having said that, I just investigated and there's another difference; the 96 GB machine has DDR3 1600 MHz RAM, while the 256 GB machine has DDR3 RAM at 1333 Mhz (yes, they're old machines). This may well have contributed to any RAM-related slowdown and makes me glad that I checked; I don't usually even consider RAM module speeds, but if we think there's a RAM-related performance issue it's another thing to consider.

I found the whole experience to be interesting because it pointed out a blind spot in my usual thinking. Before the issue came up, I just assumed that a machine with more memory and more CPUs would be better, and if it wasn't better it would be because of CPU issues (here they're apparently generally comparable). That NUMA layout (and perhaps RAM speed) made the 'big' machine substantially worse was a surprise. I'm going to have to remember this for the future.

PS: The good news is that we had another two-socket E5-2680 machine with 256 GB that the researcher could use, and I believe they're happy with its performance. And with 128 GB of RAM per socket, they can fit even quite large R processes into a single socket's memory.

tech/NUMAMemoryCanMatter written at 00:07:52; Add Comment


Sometimes the right thing to do about a spate of spam is nothing (probably)

We have a program to capture information about what sort of email attachments our users get. As part of its operation it tries to peer inside various types of archive files, because you can find suspicious things there (and, of course, outright bad things, some of them surprising). This program is written in Python, which means that its ability to peer inside types of archive files is limited to what I can find in convenient Python packages. One of the archive formats that it can't look inside right now is .7z files.

Stretching through August, September, and October we received a drizzle of email messages with .7z attachments that our commercial anti-spam system labeled as various sorts of 7z-based malware (a typical identification was CXmail/7ZDl-B). I rather suspect that if our program had the ability to peer into 7z archives it would have found one of the file types that we now block, such as .exe files. While the drizzle was coming in, it was frustrating to sit there with no visibility inside these 7z archives. I was definitely tempted by the idea of using one somewhat complicated option to add 7z support to the program.

I didn't, though. I sat on my hands. And now the drizzle of these 7z-attachment malware emails has gone away (we haven't seen any for weeks now). I'm pretty sure that I made the right decision when I decided on inaction. There are a number of reasons for this, but the narrow tactical one is simply that this format of malware appears to have been temporary (and our commercial anti-spam system was doing okay against it). Waiting it out meant that I spent no time and effort on building an essentially permanent feature to deal with a temporary issue.

(I admit that I was also concerned about security risks in libarchive, since parsing file formats is a risk area.)

It's become my view that however tempting it is to jump to doing something every time I see spam, it's a bad practice in the long run. At a minimum I should make sure that whatever I'm planning to do will have a long useful lifetime, which requires that the spam it's targeting to also have a long lifetime. Adding features in response to temporary spammer and malware behaviors is not a win.

(The other thought is that although some things are technically nifty, maybe there is a better way, such as just rejecting all .7z files. This would need some study to see how prevalent probably legitimate .7z files are, but that's why we're gathering attachment the information in the first place.)

PS: Yes, this malware could come back too; that's why this is only probably the right decision.

spam/NotReactingToTemporarySpam written at 01:31:29; Add Comment


Code stability in my one Django web application

We have one Django web application, a system for automating the handling of much of our new Unix account requests. It was started in early 2011 (using Django 1.2) and I did a retrospective at the end of 2014 where I called it a faithful web app, one that had just kept on quietly working without problems. That's continued through to today; the app needs no routine attention, although every so often I tweak it to better handle an obscure situation.

One of the interesting aspects of that quiet stability is the relative stability of the application's Python code over those nearly six years so far. There are web frameworks where in six years you'd need to significantly rework and restructure your code to deal with changing APIs and approaches. For us, Django hasn't been one of them. Although we're not quite current on Django versions, we're not that far back, yet much of the code is basically the same (or literally the same) as it started out all those years ago. I'm pretty sure that almost all of our model and view code is untouched over that time, and I think a lot of our templates are untouched or only minorly changed.

However, this is not a complete picture of code churn in our app, because there have been Django changes over that time in areas such as routing, command argument processing, template processing, and project structure. These changes have forced code changes in the areas of our app that deal with such things (and the change in project structure eventually forced a massive renaming of files when we went to Django 1.9). While this sounds kind of bad, I've wound up considering all of them to be relatively peripheral. In a way, all of the code involved is plumbing and glue. None of it really touches the heart of our web application, which (for us) lives mostly in the models and views and somewhat in the core logic of the templates. Django has been very good about keeping that core code from needing any substantive changes. We still validate form submissions and generate views and process model data in basically the same way we did in 2011, and all of that is what I think of as the hard stuff.

(Although I haven't measured, I think also it's most of the app's code by line count.)

This code stability is one reason why Django upgrades have been somewhat painful but not deeply painful. If we'd needed major code restructuring, well, I'd probably have done it eventually because we might have had no choice, but we'd have likely updated Django versions more sporadically than we have so far.

PS: Although Django is going from version 1.11 to version 2.0 in the next release, the Django people say that this shouldn't be any more of an upgrade than usual. And speaking of that. I should get working on updating us to 1.11, since security updates for 1.10 will end soon (if they haven't already).

python/DjangoAppCodeStability written at 23:13:02; Add Comment

The dig program now needs some additional options for useful DNS server testing

I've been using the venerable dig program for a long time as my primary tool to diagnose odd name server behavior. Recently, I've discovered that I need to start using some additional options in order for it to make useful tests, where by 'useful tests' I mean that dig's results correspond to results I would get through a real DNS server such as Unbound.

(Generally my first test with DNS issues is just to query my local Unbound server, but if I want to figure out why that failed I need some tool that will let me find out specific details about what didn't work.)

For some time now I've known that some nameservers reject your queries if you ask for recursive lookups, so I try to use +norecurs. In exploring an issue today, I discovered that some nameservers also don't respond if you ask for some EDNS options, which it turns out that dig apparently now sets by default. Specifically they don't respond to DNS queries that include an EDNS COOKIE option, although they will respond to queries that are merely EDNS ones without the COOKIE option or any other options.

(Some experimentation with dig suggests that including any EDNS option causes these DNS servers to not respond. I tried both +nocookie +nsid and +nocookie +expire, and neither got a response.)

This means that for testing I now want to use 'dig +norecurs +nocookie', at least. It's possible that I want to go all the way to 'dig +norecurs +noedns', although that may be sufficiently different from what modern DNS servers send that I'll get failures when a real DNS server would succeed. I expect that I'm going to want to wrap all of this in a script, because otherwise I'll never remember to set all of the switches all of the time and I'll sometimes get mysterious failures.

(Some experimentation suggests that my Unbound setup sends EDNS0 queries with the 'DNSSEC Okay' bit set and no EDNS options, which would be 'dig +norecurs +nocookie +dnssec' if I'm understanding the dig manpage correctly. These options appear to produce DNS queries that the balky DNS server will respond to. With three options, I definitely want to wrap this in a script.)

What this suggests to me in general is that dig is not going to be the best tool for this sort of thing in the future. The Dig people clearly feel free to change its default behavior, and in ways that don't necessarily match what DNS servers do; future versions may include more such changes, causing more silent failures or behavior differences until I notice and carefully read the manpage to find what to turn off in this new version.

(A casual search turns up drill, which is another thing from NLNet Labs, the authors of Unbound and NSD. Like dig, it defaults to 'recursion allowed' queries, but that's probably going to be a common behavior. Drill does have an interesting -T option to do a full trace from the root nameservers on down, bypassing whatever your local DNS resolver may have cached. Unfortunately it doesn't have an option to report the IP address of the DNS server it gets each set of answers from; you have to go all the way to dumping the full queries with -V 5.)

sysadmin/DigOptionsForUsefulTests written at 02:20:26; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.