2012-04-18
Why you should never use file
(or libmagic
) to identify files
Every so often, someone needs their program to figure out what sort of
thing a file is; is it text, or HTML, or a JPEG image, or Postscript, or
whatever? When this happens it must be very tempting to use the file
program to classify things, especially since some versions of file
will
give you a MIME type for the file (instead of just a text label).
Here, presented in the traditional illustrated form, is why you do not want to do this:
; file example example: Netpbm PGM image text ; cat example P238: An introduction Lorem ipsum dolor sit amet.
File is exceedingly generous with classifications. It does not
verify that your target file contains anything like a valid instance of
the file type; instead, it checks for signatures. Over time, lots of
people have added lots of signatures for lots of file formats. A certain
number of these signatures are very minimal and so will match lots of
things. This creates misclassifications where unknown file formats and
plain data can match a minimal signature if things are just right (or
just wrong, from some perspectives).
People and programs who use file
to identify and classify files are
operating under a mistaken impression of what it really says. File does
not say 'this is definitely a <whatever>'; instead it merely says 'this
kind of looks like a <whatever> to me'. The difference is important.
Some of you might think that this is theoretical and will never come up in real life. I regret to inform you that our CUPS print system just did this to someone, causing their plain text files to get fed to an image converter (which choked, meaning no printouts for this person).
(CUPS is probably not literally running file
, but these days file
is
just a wrapper around the libmagic
shared library. Which exists so that
people can use it for exactly this purpose, sadly.)
Note that this is not merely a Linux issue. The version of file
on,
eg, a not all that current FreeBSD machine will also misidentify this
plaintext file as a Netpbm PGM image.