2016-08-22
A belated realization about 'TLS suicide' and user CGIs et al
As part of my general 'web infosec' reading habit, I recently wound up going through Scott Helme's Using security features to do bad things (via). This discusses a number of ways to use HSTS and HPKP for evil, both for sniffing out what sites people have visited and for damaging sites that you've compromised. It's neat work and I like keeping up on this sort of stuff in general, but initially I didn't think it had any particular relevance to us. Then a little light went on in my mind: user CGI scripts can add HTTP headers to their responses, as can the user run web servers we use to solve the multiuser PHP problem.
We have innocently, ignorantly, and accidentally given everyone on our primary web server the ability to inflict a certain amount of what I'll call 'TLS suicide' on us. With no work at all they can use HSTS to force all future access to any part of our web server to be over TLS for some time (which isn't too big a problem, as we're not likely to drop TLS on the server). With work they can probably inflict some degree of HPKP suicide on us, although this mostly isn't something they could do by accident.
(There doesn't even have to be any malign intent, just people's ignorance or default software configurations. I can easily see someone simply following directions on how to increase the security of their site, directions that include 'add HSTS headers', and not realizing that this affects our entire site instead of just their URLs. HPKP would be harder to do this way, but it might be possible; I'm sure there are going to be canned directions on how to set up HPKP for a site that uses Let's Encrypt certificates.)
Fortunately I think the fix is simple; we just need to specifically
configure Apache to strip out any (user-added) Strict-Transport-Security
,
Public-Key-Pins
, or Public-Key-Pins-Report-Only
headers in
responses. The standard Apache mod_headers module will do
this for you if configured appropriately, and I think the appropriate
configuration is just:
Header unset Strict-Transport-Security Header unset Public-Key-Pins Header unset Public-Key-Pins-Report-Only
This prevents accidents, but the bad news is that people can add
their own Header
directives if you allow FileInfo overrides for
.htaccess
files. Unfortunately a ton of important Apache options
are all under FileInfo; if you turn off allowing FileInfo in
.htaccess
, you disable things like Redirect
and RewriteRule
.
Not to mention that there are entirely legitimate reasons to add
headers in .htaccess
files. Based on carefully reading the
Apache documentation on configuration sections, I think we can
do what we want here by putting these Header
directives inside a
<Location>
directive, because that will be applied last. In order
to be sure, I'm going to have to test this (carefully and probably
on a test server, just in case).
(I suppose I can test the behavior here using a harmless X-something header instead of one of the dangerous TLS ones.)
An interesting case of NFS traffic (probably) holding a ZFS snapshot busy
We have a few filesystems on our fileservers
that are considered sufficiently important that we take hourly
snapshots during the working day. We use a simple naming and expiry
scheme for these snapshots, where they're called <Day>-<Hour> (eg
Tue-15
) and the script simply deletes any old version before
creating the new one. Both because it's the default and because it
enables self-serve restores, we NFS-export the ZFS snapshots as
well as the main filesystem. Recently that script threw up an
error:
cannot destroy snapshot POOL/h/NNN@Mon-16: dataset is busy cannot create snapshot 'POOL/h/NNN@Mon-16': dataset already exists
We believe that this ultimately happened because an hour or so two beforehand, a runaway IMAP process was traversing its way through that ZFS snapshot via the NFS export. The runaway IMAP process had been terminated well before this, but that might not have mattered enough; an NFS server doesn't know when a NFS client is done with the filehandles it has requested, so the server needs to guess and it may well guess conservatively (saying, for example, 'if I still have them in my server side cache, they're not old enough yet').
This was several weeks ago and the snapshot in question was quietly
recycled a week later without any problems, so this did go away
after a while. I can't even definitely say that past NFS activity
in the snapshot was the problem; we haven't tried to reproduce it,
and unfortunately as far as I know OmniOS lacks tools to give us
visibility into this sort of thing (fuser
reported nothing for
the snapshot, for example, which is not surprising; there was no
user-level activity on the fileserver that involved the snapshot).
This instance wasn't urgent and went away on its own. I'm not sure what we'd do if these weren't the case, because I don't know if there's any good ways of pushing the kernel to give up things like old(er) NFS filehandles and so on. Shutting down NFS service or rebooting the fileserver would probably do it, but both are rather drastic steps.
(It may be possible to write some DTrace to give us more information
about why a dataset is still busy. Or, since DTrace is not always the
answer to everything, possibly mdb
can give us results too.)