Wandering Thoughts archives

2013-10-27

Old and new addresses and spam

In response to an aside wondering how fast spam fell off for disused email addresses, Henry Spencer wrote me to mention that his older address (disused now for many years) gets a lot more spam than his current address. I've been thinking about this since then and I've realized that I implicitly divide disused addresses into at least two different categories. Let us call these the old active addresses and everything else.

Put simply, the old active addresses were actively and generally widely used on the Internet in what is roughly the pre-spam era. Henry Spencer's old address is definitely one example of this, since Henry spent years being active (and famous) on Usenet. Old active addresses were visible to spammers in the era where spammers began accumulating address lists and as a result they made it on to a huge number of such lists. These lists seem to still circulate and recombine today, even though an increasing amount of the addresses are no longer valid; effectively they have an exceptionally and I suspect atypically long half-life.

(One of my old addresses seems to be like this, in fact, although not the address that prompted my earlier entry.)

Other addresses either weren't visible enough to make it on to those early spammer address lists or postdate them in general. These addresses are not so universal in spammer usage and so get hit less and, I assume, also fall out of usage faster and to a larger degree. These are the addresses where it's interesting to ask about the half life of spam. Of course what I think of as a general category here is probably some number of different ones that I don't really see because I don't have enough exposure to information about how spammers harvest and pass around addresses today.

(My impression is that one reason old active addresses are so heavily spammed is that these old addresses have become pervasively and basically freely available to spammers via many paths. I assume that newer addresses are harder and more costly for spammers to get, so they are less pervasive. This is probably an incorrect assumption.)

The real thing this has made me realize that I don't really know much about how modern spammers operate. Is there a modern equivalent of the old 'million addresses' CDs that spammers apparently used to sell and pass around a decade ago, for example? I have no idea.

(I'm not likely to find out, either, since doing so would take a bunch of work even to find reliable sources of information and I just don't care enough any more. My spam problems have been basically solved by us outsourcing the work to commercial software.)

spam/OldAndNewAddresses written at 22:47:36; Add Comment

Some things I've learned from transitioning a website to HTTPS

A while back I first added a HTTPS version of my personal site along side the existing HTTP version and then decided that I was going to actively migrate it to HTTPS. The whole thing has been running for a few months now, so it seems about time to write up some things I've learned from it.

The first set of lessons I learned were about everything on my side, especially my own code. The first layer of problems was code et al with explicit 'http:' bits in it; it was amazing and depressing how many places I was just automatically doing that (you could call this 'HTTP blindness' if you wanted a trendy term for it). The more subtle problem areas were things like caches, where a HTTP version of a page might be different from a HTTPS version yet I was storing them under the same cache key. I also ran into a situation where I wanted to generate output for a HTTP URL request but use the 'canonical' HTTPS URLs for links embedded in the result; this required adding a feature to DWiki.

(I also found a certain amount of other software that didn't cope well. For example, the Fedora 19 version of mod_wsgi doesn't seem to cope with a single WSGI application group that's served over both HTTP and HTTPS; the HTTPS environment value latches to one value and never changes.)

Once I had my own code working I got to find out all sorts of depressing things about how other people's code deals with such a transition. In no particular order:

  • while search engines did eventually switch over to returning HTTPS results and to crawling only the HTTPS version of my site, it took a surprisingly long time (and the switch may not be complete even now, it's hard to tell).

  • Many syndication feed fetchers have not changed to the HTTPS version; they still request a HTTP URL then get redirected. I will reluctantly concede that there are sensible reasons for this behavior. It does mean that the HTTP redirects will probably live on forever.

  • There are a certain number of syndication feed fetchers that still don't deal with HTTPS feeds or at least with redirections to them. Yes, really, in 2013. Unfortunately two of these are FeedBurner and the common Planet software, both of which I at least sort of care about. This led to the 'generate HTTP version but use the canonical HTTPS links' situation for my software.

  • Some web spiders don't follow redirects for robots.txt. I decided to not redirect for that URL alone rather than block the spiders outright in the server configuration, partly because the former was a bit easier than the latter.

    (I already totally ban the spiders in robots.txt, which is one reason I wanted them to see it.)

Despite all of this the process has been relatively straightforward and mostly without problems. To the extent that there were problems, I'm more or less glad to know about them (and to fix my code; it was always broken, I just didn't realize it).

web/HTTPSTransitionLessonsLearned written at 02:30:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.