2024-08-07
Maybe understanding Crowdstrike's argument arity problem
Crowdstrike recently released an "External Technical Root Cause Analysis" [PDF] (via) for their recent extreme failure. The writeup is rather unclear about what exactly happened, but I think I understand it and if I do, it's an uncomfortably easy programming mistake to make. So here is my version of the core programming issue.
Part of Crowdstrike's agent is a signature matching system, where they match signature patterns (templates) against what's going on. There are different types of patterns (template types) depending on what sort of activity the signature pattern is matching. Rather than have one big regular expression or equivalent in each pattern that matches against all data for the activity, all smashed together and encoded somehow, Crowdstrike's code breaks this out so pattern types have some number of fields (each of them a separate regular expression or equivalent) and are supplied with matching input values (or input data) extracted from the activity in order to do the field by field matching.
(You could invert this so that during the matching of each pattern (template), the matching code calls back to obtain the input values from the activity, but since you can have a lot of different templates for a given template type, it's more efficient to extract the data once and then reuse it across all of the templates of that type. You can still do this with the call back pattern, but it's more complicated.)
For whatever reason, the input values were passed not as separate arguments but as a single variable size array, the input data array (although this might not have been as a literal array but instead as varargs function arguments). It's possible that the core matching code was in fact generic; it was given a template with some number of fields and an input data array with some number of fields and it walked through matching them up. If all of the field matching genuinely is text regular expression matches, this wouldn't be a crazy way to structure the code. You'd have one general matching system plus a collection of template type specific code to that was passed information about the activity and had the job of pulling data out of it and populating the input data array for its patterns.
The mismatch occurred when the IPC templates specified and matched against 21 fields, but the input data array only had 20 pieces of data. The issue had previously been masked because all previous IPC templates had a wildcard match for the 21st field and this skipped the actual comparison. In hand-waved Go code, one version of this might look like:
func (t *Template) DoesMatch(eventdata SomeType[]) bool { for i := range t.Fields { if t.Fields[i] != MatchesAll && !FieldMatches(t.Fields[i], eventdata[i]) { return false } } return true }
(To be clear, I'm not claiming that this is what Crowdstrike's code was doing. All we know about the actual code from Crowdstrike's report is that it involved an arity mismatch that wasn't detected at compile time.)
Now, you would certainly like this code to start with a check that 'len(t.Fields) == len(eventdata)' so that it doesn't panic if there's a length mismatch. But you might not put that check in, depending on the surrounding context. And in general this code and design is, in my opinion, not particularly bad or unnatural; you might write something like it in all sorts of situations where you have a variable amount of data that needs to be provided to something (or various different things).
Compilers and language environments are broadly not all that good at statically detecting dynamic arity mismatches; it's a hard problem involving things like dataflow analysis. You can sometimes change the code to avoid being dynamic this way, but it might involve awkward restructuring that leads to other issues. For example, you might have a per template type 'myEventData' structure with named fields, insuring that you can never access an undefined field, but then you probably have to have per template type matching code so that you can create and fill in the right myEventData structure and call everything in a type-safe way (although remember to write tests that verify that every field in the structure is actually filled in). This would involve duplicating the logic of 'check all of these templates of the give type' across these functions, and that code might have complexities of its own.
(Modern language environments have done a lot of work to detect arity problems in things like printf() or its equivalents, but this is generally done by specific knowledge of how to determine the arity (and types) given a format string or the like.)
If you have a master source of truth for what the arity should be (how many fields or arguments are required), such as a data file definition, you can build additional tooling to check that the arity is correct or to automatically generate test cases that will verify that. But this requires having that source of truth and building the generally custom tooling to use it. You also have to insure that the test cases are doing what you think they're doing, which might have been another issue in play here (the people writing the IPC template type code and its tests may not have known that wildcard field matches were short circuited so that they didn't even look at the matching element of the input data array).
(IDLs also don't have a great reputation among programmers, partly because there have been a lot of bad IDLs over the years that people have been forced to deal with.)
There are also intermediate positions between the fully dynamic
arity issue that Crowdstrike apparently had and the fully type safe
version you could write with enough work. But they all involve some
runtime dynamic behavior, which means at least a possibility for
runtime mistakes, which would have to be carefully caught lest they
crash your code. For instance, in Go you could have a per template
type structure, pass it through the common code as an interface{}
,
and then type-assert it back to
the real structure type in your per template type matcher function.
But you'd have to handle the possibility of that type assertion
failing, however unlikely it should be.
PS: None of this should be taken as excusing Crowdstrike. They were writing software that ran in an extremely privileged and important situation, and they should have done that (much) better.
PPS: Neither Go nor Rust by themselves will save you from this specific sort of dynamic arity mistake; they merely change how your code will fail, from an invalid memory dereference crash to a panic in the language runtime or your (macro-expanded) code.