2019-11-30
Operating spam and malware filtering is ultimately a social problem
Successfully filtering spam and malware is a technical issue, full of problems like recognizing new sorts of spam and malware, developing recognition rules, and not having your servers eaten by expensive code (even when people send you gigantic files, or malicious ones such as compressed archives that expand hugely or endlessly). However, operating spam and malware filtering is ultimately a social problem, because the people you are doing the filtering for need to be happy with what your system does and how you operate it.
No spam and malware filtering can be perfect, because of the fundamental problem of spam. This means that all filtering in operation is a tradeoff between rejecting good email that looks too suspicious and letting in too much bad email because it doesn't look sufficiently clearly suspicious (or because you can't recognize it yet). Where you set this for various sorts of email ultimately comes down to what your users want and will accept, and also what sorts of email they get from where.
(You may also have to force certain sorts of anti-malware filtering on people regardless of what they feel about, because the risks are too high and the malware recognition too imperfect. One manifestation of this is how GMail and many other places reject a whole raft of attachment types despite potential valid uses for a number of them. A place with high enough security needs and concerns might reject all Microsoft Office attachments in email and tell outside people 'upload them to our upload service here instead'; this would be inconvenient for everyone, but inconvenience versus security is another social problem and tradeoff.)
One corollary to this is that perceptions matter even if the ultimate outcomes are the same, because perceptions are part of what drive people's reactions to how your spam filtering works. For example, having spam filtering that is a black box that's impossible for you to tune is different from having spam filtering with a lot of adjustments, and we can't say that one is universally better or worse in the large scale. If you can't tune your spam filtering, on the one hand your users can't demand that you constantly tune it to deal with small issues but on the other hand you may have to completely throw it away if significant issues come up. If you can tune, actually tuning it may be considered one of your responsibilities (and people will blame you if you could but didn't), but you may have a greater ability to deal with significant problems.
(In practice, spam levels and scores are a mostly a copout because most people do not want to be tuning your spam filtering; they want it to just work.)
(This is an obvious observation and it's been in the back of my mind for some time, but for various reasons I feel like writing down explicitly.)
Counting the number of distinct labels in a Prometheus metric
Suppose, not hypothetically, that you're collecting Prometheus metrics on your several VPN servers, including a per user count of sessions on each server. The resulting metric looks like this:
vpn_user_sessions{ user="cks", server="vpn1", ... } 1 vpn_user_sessions{ user="fred", server="vpn1", ... } 1 vpn_user_sessions{ user="cks", server="vpn2", ... } 1
We would like to know how many different users are currently connected
across our entire collection of VPN servers. As we see here, the
same user may be connected to multiple VPN servers for whatever
reason, including that different devices prefer to use different
VPN software (such as L2TP or OpenVPN). In Prometheus terms, we
want to count the number of distinct label values in vpn_user_sessions
for the 'user
' label, which I will shorten to the number of
distinct labels.
To do this, our first step is to somehow reduce this down to something
with one metric point per user, with no other labels. Throwing
away labels is done with the 'by (...)
' modifier to PromQL
aggregation operators.
For our purposes we can use any of the straightforward operators
such as sum
, min
, or max
; I'll use sum
. Using 'sum(...)
by (user)
' will produce a series like this:
{ user="cks" } 2 { user="fred" } 1
Having generated this new vector, we simply count how many elements
are in it with count()
. The final expression is:
count( sum( vpn_user_sessions ) by (user) )
This will give us the number of different users that are connected right now.
Next, suppose that we want to know how many different users have used our VPNs over some span of time, such as the past day. To do this in the most straightforward way, we'll start by basically aggregating our time spam down to something that has an element (with a full set of labels) if the user was connected to a particular VPN server at some point in the time span. Since we don't care about the values, we can use any reasonable <aggregation>_over_time function, such as 'min':
min_over_time( vpn_user_sessions[24h] )
(The choice of aggregation to use is relatively arbitrary; we're using it to sweep up all of the different sets of labels that have appeared in the last 24 hours, not for its output value. Min does this and is simple to compute.)
This gives us an instant vector that we can then process in the
same way as we did with vpn_user_sessions
when we generated our
number of currently connected users; we aggregate it to get rid of
all labels other than 'user
', and then we count how many distinct
elements we have. The resulting query is:
count( sum( min_over_time( vpn_user_sessions[24h] ) ) by (user) )
This is not the only way to create a query that does this, but it's the simplest and probably also the best performing.
(I initially wrote a 'how many different users over time' query that didn't produce correct numbers, which I didn't realize until I tested it, and then my next attempt used a subquery and some brute force. It wasn't until I sat down to systematically work out what I wanted and how to get there that I came up with these current versions. This is a valuable learning experience; whenever I'm faced with a complex PromQL query situation, I shouldn't just guess, I should tackle the problem systematically, building up the solution in steps and verifying each one interactively.)
PS: It's possible that this trick is either well known or obvious, but if so I couldn't find it in my initial Internet searches before I started flailing around writing my own queries.