Why you should never use file (or libmagic) to identify files

April 18, 2012

Every so often, someone needs their program to figure out what sort of thing a file is; is it text, or HTML, or a JPEG image, or Postscript, or whatever? When this happens it must be very tempting to use the file program to classify things, especially since some versions of file will give you a MIME type for the file (instead of just a text label).

Here, presented in the traditional illustrated form, is why you do not want to do this:

; file example
example: Netpbm PGM image text
; cat example
P238: An introduction

Lorem ipsum dolor sit amet.

File is exceedingly generous with classifications. It does not verify that your target file contains anything like a valid instance of the file type; instead, it checks for signatures. Over time, lots of people have added lots of signatures for lots of file formats. A certain number of these signatures are very minimal and so will match lots of things. This creates misclassifications where unknown file formats and plain data can match a minimal signature if things are just right (or just wrong, from some perspectives). People and programs who use file to identify and classify files are operating under a mistaken impression of what it really says. File does not say 'this is definitely a <whatever>'; instead it merely says 'this kind of looks like a <whatever> to me'. The difference is important.

Some of you might think that this is theoretical and will never come up in real life. I regret to inform you that our CUPS print system just did this to someone, causing their plain text files to get fed to an image converter (which choked, meaning no printouts for this person).

(CUPS is probably not literally running file, but these days file is just a wrapper around the libmagic shared library. Which exists so that people can use it for exactly this purpose, sadly.)

Note that this is not merely a Linux issue. The version of file on, eg, a not all that current FreeBSD machine will also misidentify this plaintext file as a Netpbm PGM image.

Written on 18 April 2012.
« ls -l should show the presence of Linux capabilities
An interesting experience with IP-based SMTP blocks »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Apr 18 01:55:36 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.