Some notes on using 'join
' to supplement one file with data from another
Recently I said something vaguely grumpy about the venerable Unix
'join
' tool. As
the POSIX specification page for join
will unhelpfully tell you, join
is a 'relational database operator',
which means that it implements the rough equivalent of SQL joins. One
way to use join
is to add additional information for some lines in
your input data.
Suppose, not entirely hypothetically, that we have an input file
(or data stream) that starts with a login name and contains some
additional information, and that for some logins (but not all of
them) we have useful additional data about them in another file.
Using join
, the simple case of this is easy, if the 'master
'
and 'suppl
' files are already sorted:
join -1 1 -2 1 -a 1 master suppl
(I'm sticking to POSIX syntax here. Some versions of join
accept '-j 1' as an alternative to '-1 1 -2 1'.)
Our specific options tell join
to join each line of 'master
'
and 'suppl
' on the first field in each (the login) and print them,
and also print all of the lines from 'master
' that didn't have a
login in 'suppl
' (that's the '-a 1
' argument). For lines with
matching logins, we get all of the fields from 'master
' and then
all of the extra fields from 'suppl
'; for lines from 'master
'
that don't match, we just get the fields from 'master
'. Generally
you'll tell apart which lines got supplemented and which ones didn't
by how many fields they have.
If we want something other than all of the fields in the order that
they are in the existing data source, in theory we have the '-o
<list>
' option to tell join
what fields from each source to
output. However, this option has a little problem, which I will
show you by quoting the important bit from the POSIX standard
(emphasis mine):
The fields specified by list shall be written for all selected output lines. Fields selected by list that do not appear in the input shall be treated as empty output fields.
What that means is that if we're also printing non-joined lines
from our 'master
' file, our '-o' still applies and any fields we
specified from 'suppl
' will be blank and empty (unless you use
'-e'). This can be inconvenient if you were re-ordering fields so
that, for example, a field from 'suppl
' was listed before some
fields from 'master
'. It also means that you want to use '1.1'
to get the login from 'master
', which is always going to be there,
not '2.1', the login from 'suppl
', which is only there some of
the time.
(All of this assumes that your supplementary file is listed second and the master file first.)
On the other hand, using '-e' we can simplify life in some situations.
Suppose that 'suppl
' contains only one additional interesting
piece of information, and it has a default value that you'll use
if 'suppl
' doesn't contain a line for the login. Then if 'master
'
has three fields and 'suppl
' two, we can write:
join -1 1 -2 1 -a 1 -e "$DEFVALUE" -o '1.1,1.2,1.3,2.2' master suppl
Now we don't have to try to tell whether or not a line from 'master
'
was supplemented by counting how many fields it has; everything has
the same number of fields, it's just sometimes the last (supplementary)
field is the default value.
(This is harder to apply if you have multiple fields from the
'suppl
' file, but possibly you can find a 'there is nothing
here' value that works for the rest of your processing.)
|
|