Some notes on using 'join' to supplement one file with data from another

May 9, 2025

Recently I said something vaguely grumpy about the venerable Unix 'join' tool. As the POSIX specification page for join will unhelpfully tell you, join is a 'relational database operator', which means that it implements the rough equivalent of SQL joins. One way to use join is to add additional information for some lines in your input data.

Suppose, not entirely hypothetically, that we have an input file (or data stream) that starts with a login name and contains some additional information, and that for some logins (but not all of them) we have useful additional data about them in another file. Using join, the simple case of this is easy, if the 'master' and 'suppl' files are already sorted:

join -1 1 -2 1 -a 1 master suppl

(I'm sticking to POSIX syntax here. Some versions of join accept '-j 1' as an alternative to '-1 1 -2 1'.)

Our specific options tell join to join each line of 'master' and 'suppl' on the first field in each (the login) and print them, and also print all of the lines from 'master' that didn't have a login in 'suppl' (that's the '-a 1' argument). For lines with matching logins, we get all of the fields from 'master' and then all of the extra fields from 'suppl'; for lines from 'master' that don't match, we just get the fields from 'master'. Generally you'll tell apart which lines got supplemented and which ones didn't by how many fields they have.

If we want something other than all of the fields in the order that they are in the existing data source, in theory we have the '-o <list>' option to tell join what fields from each source to output. However, this option has a little problem, which I will show you by quoting the important bit from the POSIX standard (emphasis mine):

The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.

What that means is that if we're also printing non-joined lines from our 'master' file, our '-o' still applies and any fields we specified from 'suppl' will be blank and empty (unless you use '-e'). This can be inconvenient if you were re-ordering fields so that, for example, a field from 'suppl' was listed before some fields from 'master'. It also means that you want to use '1.1' to get the login from 'master', which is always going to be there, not '2.1', the login from 'suppl', which is only there some of the time.

(All of this assumes that your supplementary file is listed second and the master file first.)

On the other hand, using '-e' we can simplify life in some situations. Suppose that 'suppl' contains only one additional interesting piece of information, and it has a default value that you'll use if 'suppl' doesn't contain a line for the login. Then if 'master' has three fields and 'suppl' two, we can write:

join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl

Now we don't have to try to tell whether or not a line from 'master' was supplemented by counting how many fields it has; everything has the same number of fields, it's just sometimes the last (supplementary) field is the default value.

(This is harder to apply if you have multiple fields from the 'suppl' file, but possibly you can find a 'there is nothing here' value that works for the rest of your processing.)

Written on 09 May 2025.
« In Apache, using OIDC instead of SAML makes for easier testing
Using WireGuard seriously as a mesh network needs a provisioning system »

Page tools: View Source.
Search:
Login: Password:

Last modified: Fri May 9 22:45:37 2025
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.