Wandering Thoughts archives

2012-04-18

Why you should never use file (or libmagic) to identify files

Every so often, someone needs their program to figure out what sort of thing a file is; is it text, or HTML, or a JPEG image, or Postscript, or whatever? When this happens it must be very tempting to use the file program to classify things, especially since some versions of file will give you a MIME type for the file (instead of just a text label).

Here, presented in the traditional illustrated form, is why you do not want to do this:

; file example
example: Netpbm PGM image text
; cat example
P238: An introduction

Lorem ipsum dolor sit amet.

File is exceedingly generous with classifications. It does not verify that your target file contains anything like a valid instance of the file type; instead, it checks for signatures. Over time, lots of people have added lots of signatures for lots of file formats. A certain number of these signatures are very minimal and so will match lots of things. This creates misclassifications where unknown file formats and plain data can match a minimal signature if things are just right (or just wrong, from some perspectives). People and programs who use file to identify and classify files are operating under a mistaken impression of what it really says. File does not say 'this is definitely a <whatever>'; instead it merely says 'this kind of looks like a <whatever> to me'. The difference is important.

Some of you might think that this is theoretical and will never come up in real life. I regret to inform you that our CUPS print system just did this to someone, causing their plain text files to get fed to an image converter (which choked, meaning no printouts for this person).

(CUPS is probably not literally running file, but these days file is just a wrapper around the libmagic shared library. Which exists so that people can use it for exactly this purpose, sadly.)

Note that this is not merely a Linux issue. The version of file on, eg, a not all that current FreeBSD machine will also misidentify this plaintext file as a Netpbm PGM image.

unix/NeverUseFile written at 01:55:36; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.