2019-09-25
It's always convenient when malware is clear about its nature (7z edition)
A certain amount of malware these days likes 7z files, or at least things that claim
to be 7z files with their file extension. We've been getting a run
of malware that claims its file extension is .pdf.7z
and that
Sophos PureMessage detects as 'CXmail/MalPE-AS', which I suspect
means that there's actually a Windows executable in there. We've
also got others that are being reported by our attachment logger as simply
.7z
files (and also, apparently, as MalPE-AS).
All of this caused me to take a look at our logs, where I found
some attachments with the listed extension of .exe.7z
(which were
also detected as MalPE-AS). This is actually quite convenient for
us, because we already reject email with .exe
attachments. If
you're going to helpfully label your attachment as a .exe in some
way, well, we'll extend our rejection to rejecting it too, which
we now do.
(We've also decided to reject .pdf.7z
attachments. As far as we
can tell we don't get any real ones. We're not sure we get any real
.7z
attachments in general, but rejecting those is currently a
little bit more chancy. For various reasons, we will probably be
augmenting our attachment logger to try to peer into 7z archives,
as it currently does for ZIP and RAR archives.)
As a side note, the reason I said that the .7z
attachments were
'apparently' detected as MalPE-AS is that all of the email messages
with them actually had two attachments:
application/octet-stream; MIME file ext: .7z application/msword; MIME file ext: .doc; zip exts: .bin .emf .rels .xml[9] none
PureMessage only gives us a report for the entire message, and
it reported that these emails have both CXmail/DocDrp-C and
CXmail/MalPE-AS. I suspect that DocDrp is the .doc
and MalPE
is the .7z
, but I don't know for sure.
(This would be another example of malware covering its bases.)
Our workaround for Ubuntu 16.04 and 18.04 failing to reliably reboot some of our servers
A few years ago I wrote about how and why systemd on Ubuntu 16.04 couldn't reliably reboot some of our servers. At the time I finished off the entry by suggesting that we'd live with the intermittent failures that caused some of our systems to hang during reboot attempts, forcing us to go power cycle them. Shortly afterward, we changed our minds and decided to work around the situation by resorting to a bigger hammer. These days we use our bigger hammer on both Ubuntu 16.04 and Ubuntu 18.04; the latter may have improved some aspects of the shutdown situation, but our experience is that it hasn't fixed all of them.
The fundamental problem is that systemd can leave descendant processes running even when it has nominally terminated a systemd service, such as Apache, cron, or Exim. These lingering processes are not killed (or attempted to be killed) until very late and can cause a variety of problems during NFS unmounts, turning off swap, or various other portions of system shutdown. To deal with this, we use the big hammer of doing it ourselves; during shutdown, we run a script to kill lingering processes from various service units.
The script has a list of systemd services. For each service, it first
looks in the systemd cgroup hierarchy to see if there are still
processes associated with the service, by counting how many lines there
are in /sys/fs/cgroup/systemd/system.slice/<what>.service/tasks. If
there are processes still associated with the service, it kills them
with SIGTERM
and then SIGKILL
(if necessary), using systemd itself
to do the work with:
systemctl --kill-who=all --signal=SIG... kill <what>.service
(The actual implementation is slightly more complicated.)
The script has a bunch of logging to report on whether it had to
do anything, what it did, and what the process tree looked like
before and after it did various killing (as reported through
systemd-cgls
, because that will show us what systemd units the
stray processes are under).
All of this is driven by a systemd .service unit with the following relevant bits:
[Unit] After=remote-fs.target Before=cron.service apache2.service exim4.service atd.service slurmd.service [Service] Type=oneshot RemainAfterExit=True ExecStop=/path/to/script
We set After
so that our stop action is run before NFS unmounting
starts, and Before
so that the stop action happens after those
listed services are shut down. Not all of those services exist and
are enabled on all machines, but listing a Before
service that
isn't enabled is harmless. The Before
list is basically 'what has
caused us problems'; we add things to it as we run into problem
services.
(Slurmd is a recent addition, for example.)
Right now the list of 'before' services is duplicated between the
script and the systemd unit. It feels tempting to try to eliminate
that, but on the other hand I'm not sure I want to be introspecting
systemd too much during shutdown. We could also try to be more
general by sniffing around the cgroup hierarchy to find stray
processes from any unit we don't whitelist (or at least any unit
that's theoretically been shut down). However, that might not be
very useful on modern systems, where 'KillMode=control-group
'
is the default.
The good news is that the script's logging suggests that it usually
doesn't need to do anything during system shutdown on our 18.04
machines. But usually isn't always, which is what prompted the
addition of slurmd.service
.
Sidebar: A potential alternate approach
Basically this is making these units behave as if they were set to
'KillMode=control-group
' during shutdown. You can change systemd
unit properties on the fly and only for the current system boot
(with 'systemctl --runtime set-property
', which we use for our
per-user CPU and memory limits), so
perhaps it would work to switch to this KillMode on the relevant
service units early in the shutdown process.
This option didn't even occur to me until I wrote this entry, and in general it seems more uncertain and chancy than just killing things (even if we're killing things indirectly through systemd). But it'd give you a much smaller and simpler script.