Old and new addresses and spam
In response to an aside wondering how fast spam fell off for disused email addresses, Henry Spencer wrote me to mention that his older address (disused now for many years) gets a lot more spam than his current address. I've been thinking about this since then and I've realized that I implicitly divide disused addresses into at least two different categories. Let us call these the old active addresses and everything else.
Put simply, the old active addresses were actively and generally widely used on the Internet in what is roughly the pre-spam era. Henry Spencer's old address is definitely one example of this, since Henry spent years being active (and famous) on Usenet. Old active addresses were visible to spammers in the era where spammers began accumulating address lists and as a result they made it on to a huge number of such lists. These lists seem to still circulate and recombine today, even though an increasing amount of the addresses are no longer valid; effectively they have an exceptionally and I suspect atypically long half-life.
(One of my old addresses seems to be like this, in fact, although not the address that prompted my earlier entry.)
Other addresses either weren't visible enough to make it on to those early spammer address lists or postdate them in general. These addresses are not so universal in spammer usage and so get hit less and, I assume, also fall out of usage faster and to a larger degree. These are the addresses where it's interesting to ask about the half life of spam. Of course what I think of as a general category here is probably some number of different ones that I don't really see because I don't have enough exposure to information about how spammers harvest and pass around addresses today.
(My impression is that one reason old active addresses are so heavily spammed is that these old addresses have become pervasively and basically freely available to spammers via many paths. I assume that newer addresses are harder and more costly for spammers to get, so they are less pervasive. This is probably an incorrect assumption.)
The real thing this has made me realize that I don't really know much about how modern spammers operate. Is there a modern equivalent of the old 'million addresses' CDs that spammers apparently used to sell and pass around a decade ago, for example? I have no idea.
(I'm not likely to find out, either, since doing so would take a bunch of work even to find reliable sources of information and I just don't care enough any more. My spam problems have been basically solved by us outsourcing the work to commercial software.)
Some things I've learned from transitioning a website to HTTPS
A while back I first added a HTTPS version of my personal site along side the existing HTTP version and then decided that I was going to actively migrate it to HTTPS. The whole thing has been running for a few months now, so it seems about time to write up some things I've learned from it.
The first set of lessons I learned were about everything on my side,
especially my own code. The first layer of problems was code et al with
http:' bits in it; it was amazing and depressing how many
places I was just automatically doing that (you could call this 'HTTP
blindness' if you wanted a trendy term for it). The more subtle problem
areas were things like caches, where a HTTP version of a page might be
different from a HTTPS version yet I was storing them under the same
cache key. I also ran into a situation where I wanted to generate output
for a HTTP URL request but use the 'canonical' HTTPS URLs for links
embedded in the result; this required adding a feature to DWiki.
(I also found a certain amount of other software that didn't cope well.
For example, the Fedora 19 version of mod_wsgi doesn't seem to cope
with a single WSGI application group that's served over both HTTP and
HTTPS environment value latches to one value and never
Once I had my own code working I got to find out all sorts of depressing things about how other people's code deals with such a transition. In no particular order:
- while search engines did eventually switch over to returning HTTPS
results and to crawling only the HTTPS version of my site, it
took a surprisingly long time (and the switch may not be complete
even now, it's hard to tell).
- Many syndication feed fetchers have not changed to the HTTPS version;
they still request a HTTP URL then get redirected. I will reluctantly
concede that there are sensible reasons for this behavior. It does mean that the HTTP redirects
will probably live on forever.
- There are a certain number of syndication feed fetchers that still
don't deal with HTTPS feeds or at least with redirections to them.
Yes, really, in 2013. Unfortunately two of these are FeedBurner
and the common Planet software, both of which I at least sort of
care about. This led to the 'generate HTTP version but use the
canonical HTTPS links' situation for my software.
- Some web spiders don't follow redirects for
robots.txt. I decided to not redirect for that URL alone rather than block the spiders outright in the server configuration, partly because the former was a bit easier than the latter.
(I already totally ban the spiders in
robots.txt, which is one reason I wanted them to see it.)
Despite all of this the process has been relatively straightforward and mostly without problems. To the extent that there were problems, I'm more or less glad to know about them (and to fix my code; it was always broken, I just didn't realize it).