Wandering Thoughts archives

2024-08-30

Mercurial's extdiff extension and reporting filenames in diffs

We have a long standing Mercurial 'alias' (it's not an alias in the Git sense) called 'hg sdiff' that provides diffs in non-context form, because for system administrator usage we're often changing things where the context of standard (context) diffs isn't useful and we want the terseness of standard diffs. For a long time we've had a little irritation, where if you changed only one file in a Mercurial repository 'hg sdiff' wouldn't show you the file name, but if you changed several files, 'hg sdiff' would. Today I dug into what was going on and it is more peculiar than I expected.

Mercurial has no native 'hg diff' option to do non-context diffs, so we actually do this through the standard Extdiff extension, which allows you to use external programs to provide diffs. We configure our 'sdiff' custom extdiff command to run 'diff -Nr', which on our Ubuntu machines uses GNU diffutils. However, GNU Diff has no command line option to print filenames. Nor are the file names printed by the code of the Extdiff extension itself when it runs your external diff command.

The clue to what is going on is in the '-r' argument to diff, or alternately this sentence in the Extdiff documentation:

The external diff programs are called with a configurable set of options and two non-option arguments: paths to directories containing snapshots of files to compare, or paths to the files themselves if only one file is modified.

If we run 'hg sdiff' and only one file has been changed in the repository, Extdiff will invoke 'diff -Nr file1.old file1' (to simplify the arguments a bit), and diff itself won't print any filenames. However, if we run 'hg sdiff' and we've changed two or more files, Extdiff will create two directories and invoke 'diff -Nr directory1 directory2', and then since it's running in recursive mode, diff will print the filenames along with the changes in the files.

(I believe things may be slightly more complex if you run Extdiff to compare two revisions, instead of comparing a revision to the working tree, but even then our 'hg sdiff' doesn't print the file name for single-file changes.)

As far as I can tell there's no Extdiff option to always create directories and do recursive diffs, which would do what we want here. Extdiff does have the '--per-file' option to do the reverse and always pass your external program two files. One could write a cover script for Extdiff usage that detects the two-files case and prints filenames appropriately, but we're going to just continue to live with the situation.

(We're not interested in switching to zero-context unified or context diffs, both of which would print the filenames but be more verbose and potentially confusing.)

MercurialExtdiffAndFilenames written at 22:49:21;

2024-08-19

It's not simple to add function keyword arguments to Go

I recently read An unordered list of things I miss in Go (via). One of those things is 'keyword and default arguments for functions', which in the article was not labeled as requiring a major revision to the language. In one sense this is true, but in another sense it's not, because adding keyword arguments to any compiled language raises ABI questions. This is especially the case in Go, which is generally supposed to be a low-overhead language in terms of how things are stored in memory and passed around in function calls (and Go went through an ABI change not too long ago to enable passing function arguments in registers instead of on the stack).

(I think that keyword arguments with default values don't really raise general API issues, assuming that keyword arguments can only be passed as keywords.)

If Go wants keyword arguments to be useful for small functions that will be called often and should be efficient, such as the article's example of strings.Replace() making the final 'n' argument be a keyword argument with a default value, then calling such a function needs to be roughly as efficient as calling a version without a keyword argument. This implies that such arguments should be passed more or less like non-keyword arguments are today, rather than assembled into a separate mechanism that would necessarily have more overhead; basically you'd treat them as regular arguments that can be specified by name instead of position.

Making default values efficient is tricky and has implications for what sort of default values are allowed. If default values must be compile time constants, every call site can add them in for unspecified keyword arguments. If they can be values only established at runtime (as the initial value of variables can be), then you need some scheme to record these values and then either fetch them at the call sites or pass information on what keyword arguments need default arguments to the function being called. If default values are only determined at the time the function is called, you must do the last, but this will probably be the least efficient option.

All of these choices have implications for ABI stability, which affects what Go shared libraries can be used for and how. For instance, people using shared libraries for Go packages would probably like it if adding a new keyword argument to some function did not break existing compiled code that was calling that function. But it certainly would be simpler if all code had to be compiled together with exactly current information and there was no shared library ABI compatibility of the form that is common for C shared libraries.

(In C, adding an extra argument to a function is an API break, but as mentioned, this isn't necessarily true for adding a keyword argument with a default argument. If it's not an API break, it would be convenient if it's not an ABI break either, but that's challenging, especially for efficient calls even with keyword arguments.)

Generally, it would be best if the (hypothetical) Go language specification for function keyword arguments didn't preclude some of these options, even if the main Go compiler and toolchain was not going to use them. Go already has several implementations, so someday there might be an implementation that values C-like ABI stability. Nor do you want to preclude efficient implementations with no such ABI compatibility, although some of the choices for things like what default argument values are allowed affect that.

(Python doesn't have problems with all of this because it's not trying to be Go's kind of language. Python can define abstract semantics without worrying about efficient implementation or whether some of the language semantics require various inefficiencies. And even then, Python has traps in default argument values, and also.)

GoAndFunctionKeywordArguments written at 22:30:17;

2024-08-18

A downside or two of function keyword arguments (and default values)

Recently I read An unordered list of things I miss in Go (via). One of the things is 'keyword and default arguments for functions', to which I had a reaction:

Hot take: having keyword arguments and optional arguments (and default values) for function calls in your language encourages people to make functions that take too many arguments.

(I can see why Python did it, because on the one hand class/object construction and on the other hand, in a dynamic language it lets you change a function's API without having to hunt down absolutely everyone who calls it and now doesn't have that extra argument. Just make the new argument a keyword one, done.)

Technically you can have keyword arguments without supporting optional arguments or default values, but I don't think very many languages do. The languages I've seen with this generally tend to make keyword arguments optional and then sometimes let you set a default value if no value is supplied (otherwise an unsupplied argument typically gets a zero value specific to its type or the language; for example I believe Emacs Lisp makes them all nil).

I'm sure it's possible to make tasteful, limited use of keyword arguments to good benefit; the article suggests one (in Go) and I'm sure I've seen examples in Python. But it's far more common to create sprawling APIs with a dozen or more parameters, such as Python's subprocess.run() (or the even larger subprocess.Popen()). I'm personally not a fan of such APIs, although this is one of those taste issues that is hard to quantify.

(The usual excuse is that you don't normally use all that many of those keyword arguments. But if you do, things are messy, and the API is messy because all of them exist. There are other patterns that can be used for APIs that intrinsically have lots of options; some are common Go practice. These patterns are more verbose, but in my view that verbosity is not necessarily a bad thing.)

Another unfortunate aspect of optional keyword arguments with default values is that they enable a lazy way of expanding and changing function APIs (and method APIs and constructor APIs and so on). Rather than add or change a regular argument and have to update all of the call sites, you add a new keyword argument with a default value, and then only update or add call sites that need to use the new part of the API. I've done this myself because it was quick and easy, and I've also wound up with the end state of this that you sometimes get, where all of my call sites were using the new API with the new keyword argument, so I could have gotten rid of it as a keyword and made it a regular argument.

I also feel that keyword arguments encourage a certain bad way of evolving APIs (by making this way the easiest way to move forward). In one version, some bad behavior is allowed to linger because everyone is supposed to know to turn it off with a keyword argument. In another version, some incompatibility is eventually allowed to be added because everyone who doesn't want it is supposed to have used a keyword argument to disable it. In either situation, if you don't know about the magic trick with the keyword argument, you lose out.

What you can say for keyword arguments is that taking a lot of keyword arguments is better than taking a lot of non-keyword arguments, but to me this is not much of an endorsement. It's better not to have lots of arguments in general.

AgainstFunctionKeywordArguments written at 15:28:46;

2024-08-07

Maybe understanding Crowdstrike's argument arity problem

Crowdstrike recently released an "External Technical Root Cause Analysis" [PDF] (via) for their recent extreme failure. The writeup is rather unclear about what exactly happened, but I think I understand it and if I do, it's an uncomfortably easy programming mistake to make. So here is my version of the core programming issue.

Part of Crowdstrike's agent is a signature matching system, where they match signature patterns (templates) against what's going on. There are different types of patterns (template types) depending on what sort of activity the signature pattern is matching. Rather than have one big regular expression or equivalent in each pattern that matches against all data for the activity, all smashed together and encoded somehow, Crowdstrike's code breaks this out so pattern types have some number of fields (each of them a separate regular expression or equivalent) and are supplied with matching input values (or input data) extracted from the activity in order to do the field by field matching.

(You could invert this so that during the matching of each pattern (template), the matching code calls back to obtain the input values from the activity, but since you can have a lot of different templates for a given template type, it's more efficient to extract the data once and then reuse it across all of the templates of that type. You can still do this with the call back pattern, but it's more complicated.)

For whatever reason, the input values were passed not as separate arguments but as a single variable size array, the input data array (although this might not have been as a literal array but instead as varargs function arguments). It's possible that the core matching code was in fact generic; it was given a template with some number of fields and an input data array with some number of fields and it walked through matching them up. If all of the field matching genuinely is text regular expression matches, this wouldn't be a crazy way to structure the code. You'd have one general matching system plus a collection of template type specific code to that was passed information about the activity and had the job of pulling data out of it and populating the input data array for its patterns.

The mismatch occurred when the IPC templates specified and matched against 21 fields, but the input data array only had 20 pieces of data. The issue had previously been masked because all previous IPC templates had a wildcard match for the 21st field and this skipped the actual comparison. In hand-waved Go code, one version of this might look like:

func (t *Template) DoesMatch(eventdata SomeType[]) bool {
  for i := range t.Fields {
    if t.Fields[i] != MatchesAll &&
       !FieldMatches(t.Fields[i], eventdata[i]) {
      return false
    }
  }
  return true
}

(To be clear, I'm not claiming that this is what Crowdstrike's code was doing. All we know about the actual code from Crowdstrike's report is that it involved an arity mismatch that wasn't detected at compile time.)

Now, you would certainly like this code to start with a check that 'len(t.Fields) == len(eventdata)' so that it doesn't panic if there's a length mismatch. But you might not put that check in, depending on the surrounding context. And in general this code and design is, in my opinion, not particularly bad or unnatural; you might write something like it in all sorts of situations where you have a variable amount of data that needs to be provided to something (or various different things).

Compilers and language environments are broadly not all that good at statically detecting dynamic arity mismatches; it's a hard problem involving things like dataflow analysis. You can sometimes change the code to avoid being dynamic this way, but it might involve awkward restructuring that leads to other issues. For example, you might have a per template type 'myEventData' structure with named fields, insuring that you can never access an undefined field, but then you probably have to have per template type matching code so that you can create and fill in the right myEventData structure and call everything in a type-safe way (although remember to write tests that verify that every field in the structure is actually filled in). This would involve duplicating the logic of 'check all of these templates of the give type' across these functions, and that code might have complexities of its own.

(Modern language environments have done a lot of work to detect arity problems in things like printf() or its equivalents, but this is generally done by specific knowledge of how to determine the arity (and types) given a format string or the like.)

If you have a master source of truth for what the arity should be (how many fields or arguments are required), such as a data file definition, you can build additional tooling to check that the arity is correct or to automatically generate test cases that will verify that. But this requires having that source of truth and building the generally custom tooling to use it. You also have to insure that the test cases are doing what you think they're doing, which might have been another issue in play here (the people writing the IPC template type code and its tests may not have known that wildcard field matches were short circuited so that they didn't even look at the matching element of the input data array).

(IDLs also don't have a great reputation among programmers, partly because there have been a lot of bad IDLs over the years that people have been forced to deal with.)

There are also intermediate positions between the fully dynamic arity issue that Crowdstrike apparently had and the fully type safe version you could write with enough work. But they all involve some runtime dynamic behavior, which means at least a possibility for runtime mistakes, which would have to be carefully caught lest they crash your code. For instance, in Go you could have a per template type structure, pass it through the common code as an interface{}, and then type-assert it back to the real structure type in your per template type matcher function. But you'd have to handle the possibility of that type assertion failing, however unlikely it should be.

PS: None of this should be taken as excusing Crowdstrike. They were writing software that ran in an extremely privileged and important situation, and they should have done that (much) better.

PPS: Neither Go nor Rust by themselves will save you from this specific sort of dynamic arity mistake; they merely change how your code will fail, from an invalid memory dereference crash to a panic in the language runtime or your (macro-expanded) code.

CrowdstrikeArityProblem written at 23:46:42;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.