Wandering Thoughts

2017-12-06

In practice, Go's slices are two different data structures in one

As I've seen them in Go code and used them myself, Go's slices are generally used in two pretty distinctly separate situations. As a result, I believe that many people have two different conceptual models of slices and their behavior, depending on which situation they're using slices in.

The first model and use of slices is as views into a concrete array (or string) that you already have in your code. You're taking an efficient reference to some portion of the array and saying 'here, deal with this chunk' to some piece of code. This is the use of slices that is initially presented in A Tour of Go here and that is implicitly used in, for example, io.Reader and io.Writer, both of which are given a reference to an underlying concrete byte array.

The second model and use of slices is as dynamically resizable arrays. This is the usage where, for example, you start with 'var accum []string', and then add things to it with 'accum = append(accum, ...)'. In general, any code using append() is using slices this way, as is code that uses explicit slice literals ('[]string{a, b, ..}'). Dynamically resizable arrays are a very convenient thing, so this sort of slice shows up in lots of Go code.

(Part of this is that Go's type system strongly encourages you to use slices instead of arrays, especially in arguments and return values.)

Slices as dynamically resizable arrays actually have an anonymous backing store behind them, but you don't normally think about it; it's materialized, managed, and deallocated for you by the runtime and you can't get a direct reference to it. As we've seen, it's easy to not remember that the second usage of slices is actually a funny, GC-driven special case of the first sort of use. This can lead to leaking memory or corrupting other slices.

(It's not quite fair to call the anonymous backing array an implementation detail, because Go explicitly documents it in the language specification. But I think people are often going to wind up working that way, with the slice as the real thing they deal with and the backing array just an implementation detail. This is especially tempting since it works almost all of the time.)

This distinct split in usage and conceptual model (and the glitches that result at the edges of it) are why I've wound up feeling that in practice, Go's slices are two different data structures in one. The two concepts may be implemented with the same language features and runtime mechanisms, but people treat them differently and have different expectations and beliefs about them.

GoSlicesTwoViews written at 00:26:37; Add Comment

2017-12-04

Some notes on using Go to check and verify SSH host keys

For reasons beyond the scope of this entry, I recently wrote a Go program to verify the SSH host keys of remote machines, using the golang.org/x/crypto/ssh package. In the process of doing this, I found a number of things in the package's documentation to be unclear or worth noting, so here are some notes about it.

In general, you check the server's host key by setting your own HostKeyCallback function in your ClientConfig structure. If you only want to verify a single host key, you can use FixedHostKey(), but if you want to check the server key against a number of them, you'll need to roll your own callback function. This includes the case where you have both a RSA and an ed25519 key for the remote server and you don't necessarily know which one you'll wind up verifying against.

(You can and should set your preferred order of key types in HostKeyAlgorithms in your ClientConfig, but you may or may not wish to accept multiple key types if you have them. There are potential security considerations because of how SSH host key verification works, and unless you go well out of your way you'll only verify one server host key.)

Although it's not documented that I can see, the way you compare two host keys to see if they're the same is to .Marshal() them to bytes and then compare the bytes. This is what the code for FixedHostKey() does, so I consider it official:

type fixedHostKey struct {
  key PublicKey
}

func (f *fixedHostKey) check(hostname string, remote net.Addr, key PublicKey) error {
  if f.key == nil {
    return fmt.Errorf("ssh: required host key was nil")
  }
  if !bytes.Equal(key.Marshal(), f.key.Marshal()) {
    return fmt.Errorf("ssh: host key mismatch"
  }
  return nil
}

In a pleasing display of sanity, your HostKeyCallback function is only called after the crypto/ssh package has verified that the server can authenticate itself with the asserted host key (ie, that the server knows the corresponding private key).

Unsurprisingly but a bit unfortunately, crypto/ssh does not separate out the process of using the SSH transport protocol to authenticate the server's host keys and create the encrypted connection from then trying to use that encrypted connection to authenticate as a particular user. This generally means that when you call ssh.NewClientConn() or ssh.Dial(), it's going to fail even if the server's host key is valid. As a result, you need your HostKeyCallback function to save the status of host key verification somewhere where you can recover it afterward, so you can distinguish between the two errors of 'server had a bad host key' and 'server did not let us authenticate with the "none" authentication method'.

(However, you may run into a server that does let you authenticate and so your call will actually succeed. In that case, remember to call .Close() on the SSH Conn that you wind up with in order to shut things down neatly; otherwise you'll have some resource leaks in your Go code.)

Also, note that it's possible for your SSH connection to the server to fail before it gets to host key authentication and thus to never have your HostKeyCallback function get called. For example, the server might not offer any key types that you've put in your HostKeyAlgorithms. As a result, you probably want your HostKeyCallback function to have to affirmatively set something to signal 'server's keys passed verification', instead of having it set a 'server's keys failed verification' flag.

(I almost made this mistake in my own code, which is why I'm bothering to mention it.)

As a cautious sysadmin, it's my view that you shouldn't use ssh.Dial() but should instead net.Dial() the net.Conn yourself and then use ssh.NewClientConn(). The problem with relying on ssh.Dial() is that you can't set any sort of timeout for the SSH authentication process; all you have control over is the timeout of the initial TCP connection. You probably don't want your check of SSH host keys to hang if the remote server's SSH daemon is having a bad day, which does happen from time to time. To avoid this, you need to call .SetDeadline() with an appropriate timeout value on the net.Conn after it's connected but before you let the crypto/ssh code take it over.

The crypto/ssh package has a convenient function for iteratively parsing a known_hosts file, ssh.ParseKnownHosts(). Unfortunately this function is not suitable for production use by itself, because it completely gives up the moment it encounters even a single significant error in your known_hosts file. This is not how OpenSSH ssh behaves, for example; by and large ssh will parse all valid lines and ignore lines with errors. If you want to duplicate this behavior, you'll need to split your known_hosts file up into lines with bytes.Split(), then feed each non-blank, non-comment line to ParseKnownHosts (if you get an io.EOF error here, it means 'this line isn't formatted like a SSH known hosts line'). You'll want to think about what you do about errors; I accumulate them all, report up to the first N of them, and then only abort if there's been too many.

(In our case we want to keep going if it looks like we've only made a mistake in a line or two, but if looks like things are badly wrong we're better off giving up entirely.)

Sidebar: Collecting SSH server host keys

If all you want to do is collect SSH server host keys for hosts, you need a relatively straightforward variation of this process. You'll repeatedly connect to the server with a different single key type in HostKeyAlgorithms each time, and your HostKeyCallback function will save the host key it gets called with. If I was doing this, I'd save the host key in its []byte marshalled form, but that's probably overkill.

GoSSHHostKeyCheckingNotes written at 23:41:41; Add Comment

2017-11-24

Shooting myself in the foot by using exec in a shell script

Once upon a time, we needed a simple shell script that did some setup stuff and then ran a program that did all of the work; this shell script was going to run from crontab once a day. Because I'm sometimes very me, I bummed the shell script like so:

#!/bin/sh
setup
yadda

exec realprog --option some-args

Why not use exec? After all, it's one less process, even if it doesn't really matter.

Later, we needed to run the program a second time with some different arguments, because we'd added some more things for it to process. So we updated the script:

#!/bin/sh
setup
yadda

exec realprog --option some-args
exec realprog --switch other-args

Yeah, there's a little issue here. It probably looks obvious written up here, but I was staring at this script just last week, knowing that for some reason it hadn't generated the relatively infrequent bit of email we'd expected, and I didn't see it then; my eyes and my mind skipped right over the implications of the repeated exec. I only realized what I was seeing today, when I took another, more determined look and this time it worked.

(The implication is that anything after the first exec will never get run, because the first exec replaces the shell that's running the shell script with realprog.)

This was a self-inflicted injury. The script in no way needed the original exec; putting it in there anyway in the name of a trivial optimization led directly to the error in the updated script. If I hadn't tried to be clever, we'd have been better off.

(I'm not going to call this premature optimization, because it wasn't. I wasn't trying to optimize the script, not particularly; I was just being (too) clever and bumming my shell script for fun because I could.)

PS: In the usual way of things, I felt silly for not having spotted the repeat exec issue last week. But we see what we're looking for, and I think that last week my big question before I checked the script was 'have we updated the script to work on the other-args'. We had a line to run the program on them, so the answer was 'yes' and I stopped there. Today I knew things had once again not happened, so I asked myself 'what on earth could be stopping the second command from running?' and the light suddenly dawned.

BourneMyTooCleverExec written at 00:24:13; Add Comment

2017-11-23

Understanding a tricky case of Bourne shell redirection and command parsing

I recently saw this tweet by Dan Bornstein (via) that poses a Bourne shell puzzle, or more specifically a Bash puzzle:

Shell fun time! Do you know what the following script does? Did you have to run it?

#!/bin/bash
function boop {
  echo "out $1"
  echo "err $1" 1>&2
}

 x='2';   boop A >& $x
 x='2';   boop B >& "$x"
 x=' 2 '; boop C >& $x
 x=' 2 '; boop D >& "$x"

There's a number of interesting things going on here, including a Bash feature I didn't know about and surprising (to me) difference between Bourne shells, and even between Bash run as bash and run as /bin/sh. So let's talk about what's going on and where Bourne shells differ in their interpretations of this (and why).

(If you want a version of this tweet that you can test in other Bourne shells, change the 'function boop' to 'boop()'.)

To start with, we can mostly ignore the specifics of boop; its job is to produce some output to standard output and to standard error. The '1>&2' just says 'redirect file descriptor 1 to file descriptor 2' (technically it means 'make file descriptor 1 be a duplicate of file descriptor 2'). For standard output (fd 1), the number is optional, so this can also be written as '>&2', which should look sort of familiar given the last four lines.

Now we need to break down each of the last four lines, one by one, starting with the first.

x='2';   boop A >& $x

After expanding $x, this is simply 'boop A >& 2', and so all Bourne shells redirect boop's standard output to standard error. The only mild surprise here is that Bash processes redirections after doing variable expansions, which isn't really a surprise if you read the very large Bash manpage (because it documents the order; see here). I believe that this order is in fact the historical Bourne shell behavior and it also seems to be the POSIX-required order.

x='2';   boop B >& "$x"

After variable expansion, this is 'boop B >& "2"'. In some languages, putting the 2 in quotes would cause the language to see it purely as a string, blocking its interpretation as a file descriptor number. In the Bourne shell, this is not what quotes do; they mostly just block word splitting, which there isn't any of here anyway. So this is the same as 'boop B >& 2', which is the same as the first line and has the same effect.

Now things start getting both interesting and different between shells.

x=' 2 '; boop C >& $x

In Bash, invoked in full Bash mode (and not as /bin/sh), this expands $x without any particular protection from the usual word splitting on whitespace. The application of word splitting trims off the spaces at the start and the end, leaving this as basically 'boop C >& 2 ', which is simply the same as the first two lines. Dash behaves the same way as Bash does.

(The value of $x has spaces in it, but in general when $x is used unquoted those spaces get removed by word splitting after the expansion happens. This reinterpretation of $x's value is required partly because the Bourne shell doesn't have lists.)

If you run Bash as /bin/sh (or in POSIX mode in general), it doesn't do word splitting on the expanded value of $x in this context. This leaves it with what is effectively boop C >& ' 2 '. Bash then interprets this as meaning to write standard output and standard error to a file called ' 2 ', for reasons I'll cover later.

(This mode difference is documented in Chet Ramey's page on Bash's POSIX mode; see number 11.)

Most other shells (apart from Bash and Dash) interpret this as an error, reporting some variant of 'illegal file descriptor name'. These shells don't do word splitting here either, and without word splitting and whitespace trimming we don't have a digit, we have a digit surrounded by spaces, which is not a valid file descriptor number.

(Not doing word splitting here appears to be the correct POSIX behavior, based on a careful reading of portions of 2.7 Redirection. See the third-last paragraph.)

x=' 2 '; boop D >& "$x"

At this point, the expansion of $x is fully protected from having its whitespace stripped; we wind up with effectively boop D >& ' 2 '. Bash in both normal and /bin/sh mode writes both standard output and standard error to a file called ' 2 ', which is probably a good part of the surprise that Dan Bornstein had in mind. Bash interprets things this way because it has a specific feature of its own for redirecting standard output and standard error. Bash recommends writing this as '&>word' but accepts '>&word' as a synonym, with the dry note that if word is a number, 'other redirection operators apply'. Here word is no longer a number (it's a number with spaces on either side), so Bash's special >& feature takes over.

(This interpretation of >&word is permitted by the Single Unix Standard, which says that the behavior in this case is unspecified. Since it's unspecified, Bash is free to do whatever it wants.)

Every other Bourne shell that I have handy to test with (Dash, FreeBSD /bin/sh, OmniOS /bin/sh, FreeBSD pdksh, and official versions of ksh) report the same 'illegal file descriptor name' as most of them did for the third line (and for the same reason; ' 2 ' is not a file descriptor number). This too is allowed by the Single Unix Standard; since the behavior is unspecified, we're allowed to make it an error.

Sidebar: The oddity of Dash's behavior

I was going to say that Dash is not POSIX compliant here, but that's wrong. POSIX specifically says that '>& word' where the word is not a plain number is unspecified, so basically anything goes in both the third and the fourth line. However, Dash is inconsistent and behaves oddly. The most clear oddity is the results of the following:

x=' 2 1 '; echo hi >& $x

This produces a 'hi' on standard error and nothing else; the trailing '1' in $x has simply vanished.

BourneRedirectionAndQuoting written at 00:40:34; Add Comment

2017-11-20

Major version changes are often partly a signal about support

Recently (for some value of recently) I read The cargo cult of versioning (via). To simplify, one of the things that it advocates is shifting the major version number to being part of the package name:

Next, move the major version to part of the name of a package. "Rails 5.1.4" becomes "Rails-5 (1, 4)". By following Rich Hickey's suggestion above, we also sidestep the question of what the default version should be. There's just no way to refer to a package without its major version.

I am biased to see versioning as ultimately a social thing. Looking at major version number changes through that lens, I feel that one of their social uses is being overlooked by this suggested change.

Many projects take what is effectively the default approach of having a single unified home page, a single repository, and so on, not separate home pages, repositories, and so on for each major version. In these setups, incrementing the major version number of your project and advertising the shift is a strong social signal that past major versions are no longer being supported unless you say otherwise. If you move from mypackage version 1.2.3 to mypackage version 2.0.0, you're signalling not just an API change but also that by default there will not be a mypackage 1.2.4. In fact you're further signalling that people should not build (new) projects using mypackage version 1.2.3; if they do so anyway, they're going to extra work to fish an old tag out of your repository or an old tarball out of your 'archived past versions' download area.

As I see it, this works because people think of there being a single project 'mypackage', not two separate projects 'mypackage version 1' and 'mypackage version 2'. Since there is only one project, it's relatively clear that by default it only has one actively developed major version, the latest, since I don't think people expect most projects to be working in several directions at once.

(Where projects have diverged from this model the result has often been confusion and annoyance. The case I'm closest to is Python, where having both 'Python 2' and 'Python 3' has caused all sorts of problems. But I'm not sure that calling them 'Python-2' and 'Python-3' would have helped. Incompatible major version changes in major projects appear to be a problem, period.)

I initially thought that you'd have problems getting this clear signal in a world where things were 'mypackage-1' and 'mypackage-2' instead of 'mypackage version ...'. Now, though, I'm not so sure. You'd probably have to be more explicit on your project web pages, which might not be such a bad thing, but it might be clear enough to people in general.

MajorVersionSupportSignal written at 23:35:08; Add Comment

2017-10-18

Using Shellcheck is good for me

A few months ago I wrote an entry about my views on Shellcheck where I said that I found it too noisy to be interesting or useful to me. Well, you know what, I have to take that back. What happened is that as I've been writing various shell scripts since then, I've increasingly found myself reaching for Shellcheck as a quick syntax and code check that I could use without trying to run my script. Shellcheck is a great tool for this, and as a bonus it can suggest some simplifications and improvements.

(Perhaps there are other programs that can do the same sort of checking that shellcheck does, but if so I don't think I've run across them yet. The closest I know of is shfmt.)

Yes, Shellcheck is what you could call nitpicky (it's a linter, not just a code checker, so part of its job is making style judgments). But going along with it doesn't hurt (I've yet to find a situation where a warning was actively wrong) and it's easier to spot real problems if 'shellcheck <script>' is otherwise completely silent. I can live with the cost of sprinkling a bunch of quotes over the use of shell variables, and the result is more technically correct even if it's unlikely to ever make a practical difference.

In other words, using Shellcheck is good for me and my shell scripts even if it can be a bit annoying. Technically more correct is still 'more correct', and Shellcheck is right about the things it complains about regardless of what I think about it.

(With that said, I probably wouldn't bother using Shellcheck and fixing its complaints about unquoted shell variable usage if that was all it did. The key to its success here is that it adds value over and above its nit-picking; that extra value pushes me to use it, and using it pushes me to do the right thing by fixing my variable quoting to be completely correct.)

ShellcheckGoodForMe written at 23:55:25; Add Comment

2017-10-10

An interesting way to leak memory with Go slices

Today I was watching Prashant Varanasi's Go release party talk Analyzing production using Flamegraphs, and starting at around 28 minutes in the talk he covered an interesting and tricky memory leak involving slices. For my own edification, I'm going to write down a version of the memory leak here and describe why it happens.

To start with, the rule of memory leaks in garbage collected languages like Go is that you leak memory by retaining unexpected references to things. The garbage collector will find and free things for you, but only if they're actually unused. If you're retaining a reference to them, they stick around. Sometimes this is ultimately straightforward (perhaps you're deliberately holding on to a small structure but not realizing that it has a reference to a bigger one), but sometimes the retention is hiding in the runtime implementation of something. Which brings us around to slices.

Simplified, the code Prashant was dealing with was maintaining a collection of currently used items in a slice. When an item stopped being used, it was rotated to the end of the slice and then the slice was shrunk by truncating it (maintaining the invariant that the slice only had used items in it). However, shrinking a slice doesn't shrink its backing array; in Go terms, it reduces the length of a slice but not its capacity. With the underlying backing array untouched, that array retained a reference to the theoretically discarded item and all other objects that the item referenced. With a reference retained, even one invisible to the code, the Go garbage collector correctly saw the item as still used. This resulted in a memory leak, where items that the code thought had been discarded weren't actually freed up.

Now that I've looked at the Go runtime and compiler code and thought about the issue a bit, I've come to the obvious realization that this is a generic issue with any slice truncation. Go never attempts to shrink the backing array of a slice, and it's probably impossible to do so in general since a backing array can be shared by multiple slices or otherwise referenced. This obviously strongly affects slices that refer to things that contain pointers, but it may also matter for slices of plain old data things, especially if they're comparatively big (perhaps you have a slice of Points, with three floats per point).

For slices containing pointers or structures with pointers, the obvious fix (which is the fix adopted in Uber's code) is to nil out the trailing pointer(s) before you truncate the slice. This retains the backing array at full size but discards references to other memory, and it's the other memory where things really leak.

For slices where the actual backing array may consume substantial memory, I can think of two things to do here, one specific and one generic. The specific thing is to detect the case of 'truncation to zero size' in your code and specifically null out the slice itself, instead of just truncating it with a standard slice truncation. The generic thing is to explicitly force a slice copy instead of a mere truncation (as covered in my entry on slice mutability). The drawback here is that you're forcing a copy, which might be much more expensive. You could optimize this by only forcing a copy in situations where the slice capacity is well beyond the new slice's length.

Sidebar: Three-index slice truncation considered dangerous (to garbage collection)

Go slice expressions allow a rarely used third index to set the capacity of the new slice in addition to the starting and ending points. You might thus be tempted to use this form to limit the slice as a way of avoiding this garbage collection issue:

slc = slc[:newlen:newlen]

Unfortunately this doesn't do what you want it to and is actively counterproductive. Setting the new slice's capacity doesn't change the underlying backing array in any way or cause Go to allocate a new one, but it does mean that you lose access to information about the array's size (which would otherwise be accessible through the slice's capacity). The one effect it does have is forcing a subsequent append() to reallocate a new backing array.

GoSlicesMemoryLeak written at 00:26:09; Add Comment

2017-10-09

JavaScript as the extension language of the future

One of the things I've been noticing as I vaguely and casually keep up with technology news is that JavaScript seems to be showing up more and more as an extension language, especially in heavy-usage environments. The most recent example is Cloudflare Workers, but there are plenty of other places that support it, such as AWS Lambda. One of the reasons for picking JavaScript here is adequately summarized by Cloudflare:

After looking at many possibilities, we settled on the most ubiquitous language on the web today: JavaScript.

JavaScript is ubiquitous not just in the browser but beyond it. Node.js is popular on the server, Google's V8 JavaScript engine is apparently reasonably easy to embed into other programs, and then there's at least Electron as an environment to build client-side applications on (and if you build your application in JavaScript, you might as well allow people to write plugins in JavaScript). But ubiquity isn't JavaScript's only virtue here; another is that it's generally pretty fast and a lot of people are putting a lot of money into keeping it that way and speeding it up.

(LuaJIT might be pretty fast as well, but Lua(JIT) lacks the ecology of JavaScript, such as NPM, and apparently there are community concerns.)

This momentum in JavaScript's favour seems pretty strong to me as an outside observer, especially since its use in browsers insures an ongoing supply of people who know how to write JavaScript (and who probably would prefer not to have to learn another language). JavaScript likely isn't the simplest option as an extension language (either to embed or to use), but if you want a powerful, fast language and can deal with embedding V8, you get a lot from using it. There are also alternatives to V8, although I don't know if any of them are particularly small or simple.

(The Gnome people have Gjs, for example, which is now being used as an implementation language for various important Gnome components. As part of that, you write Gnome Shell plugins using JavaScript.)

Will JavaScript start becoming common in smaller scale situations, where today you'd use Lua or Python or the like? Certainly the people who have to write things in the extension language would probably prefer it; for many of them, it's one fewer language to learn. The people maintaining the programs might not want to embed V8 or a similar full-powered engine, but there are probably lighter weight alternatives (there's at least one for Go, for example). These may not support full modern JavaScript, though, which may irritate the users of them (who now have to keep track of who supports what theoretically modern feature).

PS: Another use of JavaScript as an 'extension' language is various NoSQL databases that are sometimes queried by sending them JavaScript code to run instead of SQL statements. That databases are picking JavaScript for this suggests that more and more it's being seen as a kind of universal default language. If you don't have a strong reason to pick another language, go with JavaScript as the default. This is at least convenient for users, and so we may yet get a standard by something between default and acclamation.

JavaScriptExtensionLanguage written at 01:46:27; Add Comment

2017-09-25

What I use printf for when hacking changes into programs

I tweeted:

Once again I've managed to hack a change into a program through brute force, guesswork, determined grepping, & printf. They'll get you far.

First off, when I say that I hacked a change in, I don't mean that I carefully analyzed the program and figured out the correct and elegant place to change the program's behavior to what I wanted. I mean that I found a spot where I could add a couple of lines of code that reset some variables and when the dust settled, it worked. I didn't carefully research my new values for those variables; instead I mostly experimented until things worked. That's why I described this as being done with brute force and guesswork.

The one thing in my list that may stand out is printf (hopefully the uses of grep are pretty obvious; you've got to find things in the source code somehow, since you're certainly not reading it systematically). When I'm hacking a program up like this, the concise way to describe what I'm using printf for is I use printf to peer into the dynamic behavior of the program.

In theory how the program behaves at runtime is something that you can deduce from understanding the source code, the APIs that it uses, and so on. This sort of understanding is vital if you're actively working on the program and you want relatively clean changes, but it takes time (and you have to actually know the programming languages involved, and so on). When I'm hacking changes into a program, I may not even know the language and I certainly don't have the time and energy to carefully read the code, learn the APIs, and so on. Since I'm not going to deduce this from the source code, I take the brute force approach of modifying the program so that it just tells me things like whether some things are called and what values variables have. In other words, I shove printf calls into various places and see what they report.

I could do the same thing with a debugger, but generally I find printf-based debugging easier and often I'd have to figure out how to hook up a debugger to the program and then make everything work right. For that matter, I may not have handy a debugger that works well with whatever language the program happens to use. Installing and learning a new debugger just to avoid adding some printfs (or the equivalent) is rather a lot of yak shaving.

PrintfToSeeDynamicBehavior written at 00:14:53; Add Comment

2017-09-24

Reading code and seeing what you're biased to see, illustrated

Recently I was reading some C code in systemd, one of the Linux init systems. This code is run in late-stage system shutdown and is responsible for terminating any remaining processes. A simplified version of the code looks like this:

void broadcast_signal(int sig, [...]) {
   [...]
   kill(-1, SIGSTOP);

   killall(sig, pids, send_sighup);

   kill(-1, SIGCONT);
   [...]
}

At this point it's important to note that the killall() function manually scans through all remaining processes. Also, this code is called to send either SIGTERM (plus SIGHUP) or SIGKILL to all or almost all processes.

The use of SIGSTOP and SIGCONT here are a bit unusual, since you don't need to SIGSTOP processes before you kill them (or send them signals in general). When I read this code, what I saw in their use was an ingenious way of avoiding any 'thundering herd' problems when processes started being signalled and dying, so I wrote it up in yesterday's entry. I saw this, I think, partly because I've had experience with thundering herd wakeups in response to processes dying and partly because in our situation, the remaining processes are stalled.

Then in comments on that entry, Davin noted that SIGSTOPing everything first did also did something else:

So, I would think it's more likely that the STOP/CONT pair are designed to create a stable process tree which can then be walked to build up a list of processes which actually need to be killed. By STOPping all other processes you prevent them from forking or worse, dieing and the process ID being re-used.

If you're manually scanning the process list in order to kill almost everything there, you definitely don't want to miss some processes because they appeared during your scan. Freezing all of the remaining processes so they can't do inconvenient things like fork() thus makes a lot of sense. In fact, it's quite possible that this is the actual reason for the SIGSTOP and SIGCONT code, and that the systemd people consider avoiding any thundering herd problems to be just a side bonus.

When I read the code, I completely missed this use. I knew all of the pieces necessary to see it, but it just didn't occur to me. It took Davin's comment to shift my viewpoint, and I find that sort of fascinating; it's one thing to know intellectually that you can have a too-narrow viewpoint and miss things when reading code, but another thing to experience it.

(I've had the experience where I read code incorrectly, but in this case I was reading the code correctly but missed some of the consequences and their relevance.)

CodeReadingNarrowness written at 00:53:22; Add Comment

(Previous 10 or go back to August 2017 at 2017/08/31)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.