Wandering Thoughts: Recent Entries For oldest/10

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web.

2005-06-18

SMTP IP firewall stats at June 18th, 2005

We maintain a filter list of bad hosts and network areas that can't talk to our SMTP port at all; their SMTP packets are silently discarded. The filter list is reinitialized each time the server reboots, currently once a week. During the week we add various spam sources and high volume sources of other rejections to the filters on a dynamic basis.

As the server does its weekly reboot at 6 AM Sunday morning, right now is a great time to pull a top-N summary from the kernel's firewall statistics. So, here are the top 20 sources of rejected packets to this server over the past nearly 7 days:

Host/Mask           Packets   Bytes
213.4.129.48           7768    356K	[a] [njabl]
192.35.251.3           4539    218K	[a] [bad-helo]
61.128.0.0/10          4356    215K
216.7.201.43           4169    200K	[a] [bad-helo]
220.160.0.0/11         3313    161K
195.46.148.28          2955    177K	[a] [baddns]
65.194.220.21          2696    129K	[a] [cbl]
24.156.64.52           2683    129K	[a] [dialup] [cbl]
218.0.0.0/11           2577    126K
213.29.7.174           2492    150K	[a] [njabl]
219.128.0.0/12         2435    123K
65.214.61.100          2425    116K
66.18.69.6             2359    142K	[a] [spews]
24.222.77.233          2088    125K	[a] [flushot]
62.219.46.43           1949   93552	[a] [dialup] [cbl]
193.70.192.0/24        1893   85360
212.47.15.29           1824    109K	[a] [flushot]
12.31.56.73            1719   82512	[a] [bad-helo]
212.216.176.0/24       1654   86576
221.216.0.0/13         1584   78068

The key:

  • [a]: entry was added during the week as a high-count rejection source.
  • [baddns]: IP lacks a good PTR record.
  • [bad-helo]: tried to say hi with a bad SMTP HELO name.
  • [cbl]: IP in cbl.abuseat.org.
  • [dialup]: IP seems to be in a dynamic/dialup address range.
  • [flushot]: IP address sent email to our spamtraps.
  • [njabl]: IP in dnsbl.njabl.org.
  • [spews]: IP in the SPEWS DNSbl.

This isn't a particularly active server for mail in general; we usually get about 1,000 to 2,000 incoming real mail messages a day (mostly from mailing lists).

I believe that 213.4.129.48 (smtpout.terra.es), 213.29.7.174 (mail1002.centrum.cz), and 66.18.69.6 (mailout06.infosat.net) are all involved in providing free email. And apparently doing a bad job of stopping spammers from using it. Both 213.29.7.174 and 66.18.69.6 would have been rejected by later blocks as well, blocks we set up due to them sending us spam.

Due to a long-term spam problem, we have a number of Chinese netblocks that we aren't interested in accepting email from. In this listing, that's 61.128.0.0/10, 220.160.0.0/11, 218.0.0.0/11, 219.128.0.0/12, and 221.216.0.0/13.

212.216.176.0/24 is tin.it, an Italian ISP that had yet to get HELO greetings correct by the time I gave up and firewalled them.

193.70.192.0/24 is liberato.it, another Italian ISP with a significant spam problem that we've just stopped talking to. (On a quick spot check it seems to also be iol.it; they may have merged, been bought out, or renamed since I put them in our filter list.)

65.214.61.100 kept trying to send us email from the blocked origin address of 'info@salesrecruits.imakenews.net', week after week after week. At some point I just put them in our core filter list instead of adding them every week. I don't consider their continued attempts to send us email despite everything bouncing for months to be a good sign.

Note: because we drop incoming packets from these IP addresses on the floor and don't reply to them in any way, this is not an accurate count of even SMTP connection attempts. (One SMTP connection attempt will produce a number of packets to our SMTP port, depending on how much their OS retries TCP connection attempts.)

Disclaimer

By the time you read this, some of these IP addresses may no longer be in the DNSbls listed. Because this is IP level firewalling, we can't say anything definite about whether what these places are trying to send us is spam; we've just decided that we don't want to talk to them at all.

(Some of the SMTP connection attempts are probably for bounce backscatter from spammers forging our domain as the MAIL FROM of their spam runs.)

spam/IPReject-2005-06-18 written at 22:27:45; Add Comment

2005-06-17

The problem with CPAN (and other similar systems)

At one level, CPAN is a great thing: people really like having a simple way of installing Perl packages (and a big archive of them). It's such a good idea that it's being copied over and over: Python's distutils, Ruby's RubyGems, the R statistic package's CRAN, and no doubt others.

But CPAN and things like it have a problem: they're a package management system. Or, to be more detailed, they're another package management system, on top of the one that our Unix systems already have.

Multiple package systems on a single computer means that no single package system has a full picture of the system. This causes various problems:

  1. to get a complete picture of what's on the system, I have to remember to use multiple tools (and remember to how to use all of them).
  2. two different tools can both think they own or exclusively manage certain files (for example, index files of all the packages installed). The extreme case is installing the same thing through the OS package manager and a program's own package manager.
  3. missing cross packaging system relationships; for example, things installed through CPAN likely depend on the version of Perl installed by the OS's package manager. Does the OS package manager know enough to tell me that upgrading Perl because of a security fix is going to orphan all of those CPAN packages I need?
  4. satisfying dependencies: when I try to install a core OS package BazOrp, which requires Python package FooBar (version 1.6.1 to 1.7.8) how does the core OS package management system know that I installed FooBar 1.7.7 through Python's distutils and it's OK to go ahead?

And this simplifies the problems, because most of these CPAN-like things are not actually package management systems, they are package installation systems. All they do is install things; they don't keep a package inventory (especially with version numbers), they usually don't have much of an idea of package dependencies, and often they can't even remove what they just installed.

The situation is worse when I work in large-scale environments, with tens to hundreds of systems. Systems that large can't deal with computers by hand; they have to be managed through automated systems.

In that sort of environment, every program with its own package system means that I would have to obtain or build an automated system to manage that package system. Since the package system itself is unlikely to provide the basic management tools (inventory, dependencies, etc), I would have to build those, too.

You may have guessed the punchline: as a result of all of this, we don't and can't use CPAN, distutils, RubyGems, CRAN, and so on. Of course this is sometimes difficult to explain to users, who are know to approach us to ask 'there is this CPAN module I need, can you please install it on the machines?' and then don't understand why I break down and twitch.

Solution: build real OS packages

I already have to deal with the OS's package management system, so the best way to make my life easier is to make your package installation system build OS packages for me, instead of directly installing files.

This shouldn't be too difficult, as your installation system already has most of the information necessary, such as what files are going to be installed and a package description. Don't worry too much about dependencies, as a decent OS packaging system will be capable of working them out for you.

On Linux systems, supporting building Debian .debs and RPMs will get you most of the way to making people entirely happy. (You don't have to decide which distributions to support; with generic building support, you support everyone using that packaging format.)

Existing support for this

Debian's dh-make-perl builds CPAN packages into Debian .debs.

The CPAN RPM::Specfile package and its cpanflute2 program will build RPMs from CPAN packages. (Getting it in RPM form to bootstrap this properly may be a pleasantly recursive exercise.) There's also the cpan-to-rpm.pl program from here, to do everything in one go. (I believe cpanflute2 has had some problems for us in the past, but I have blotted them out from my mind.)

Python distutils has a bdist_rpm command for building RPMs, but this doesn't work reliably for somewhat complicated packages in the versions I've tried. (Yes, I should file bug reports and produce patches to fix things. Someday, when I have enough time to fully investigate the situation.)

sysadmin/CPANProblem written at 22:05:47; Add Comment

2005-06-16

AJAX vs Dialups

AJAX is short for 'Asynchronous Javascript And XML', the common term for the technology behind highly interactive web sites like Google Maps and Google Mail. Given that the features AJAX enables (from the large to the small) are very appealing to designers, we're pretty much guaranteed to see more and more use of it on web sites.

But please don't reach for AJAX too fast, because there is such a thing as being too interactive.

AJAX's interactivity comes through communication, and communication takes bandwidth. While it'd be nice if everyone coming to your web site had lots of bandwidth, it's not true (unless you want to make it true by driving away everyone else).

Let's take an example: using AJAX to implement incremental searches. The search box on your web pages uses AJAX to notice when I start typing and does a callback to your web server so it can show me matching results; once I've typed enough to pull up what I want, I can just go there.

So I start typing, entering 'p'. Lightning-fast, your highly interactive AJAX wakes up and sends the request back to your web server. Of course there are a lot of pages that match such a broad criteria, so the reply is not short (the RD light of my modem goes on solid). As I add a 'y' and a 't' the whole process repeats, possibly colliding with the data transfer for the initial 'p' in the process.

This hypothetical web site's great interactivity hasn't helped me, it's frustrated me. Search has turned into a laggy experience where I have to wait for the application to catch up to my typing. The slower a typist I am, the worse it may be; if I type fast I have at least a chance of outracing the AJAX over-interactivity.

So: don't be too interactive. If your AJAX needs results from your web server, you probably can't keep up with the user's interactions in real time. Don't try; wait a bit, let the user get a bit of a head start, give some feedback every so often, and reserve your big efforts for when the user has paused. (Pauses in user input are your big hint that the user is waiting for you now.)

Google Suggest shows another solution to this: don't return interactive results until they're small enough to be useful. (In a search interface I do ask that you put up some feedback to the effect of 'searching for "py": too many results to show in the sidebar', so that I can tell the difference between lots of results and no results.)

Whichever you choose, people on dialups (like me at home on my poky 28.8K PPP link) will thank you for considering them. And you may discover that there are more of us than you thought, along with the people using your web site from halfway around the world, the unfortunates stuck behind choked up corporate Internet links, and so on.

Unsurprisingly I'm not the only person (or the first person) writing about this general issue; for example, Markus Baker's discussion is here.

You can read about other AJAX design issues here and here. (And this entirely neglects the collection of practical issues one faces when implementing AJAX in the presence of network delays.) Note to self: AJAX is complicated in practice.

You can read more about AJAX in the Wikipedia article.

web/AJAXvsDialups written at 23:03:47; Add Comment

Iterator & Generator Gotchas

Python iterators are objects (or functions, using some magic) that repeatedly produce values, one at a time, until they get exhausted. Python introduced this general feature to efficiently support things like:

for line in fp.readlines():
    ... do something with each line ...

Without iterators, .readlines() would have to read the entire file into memory, split it up into lines, and return a huge list; now, this code only has one line in memory at any given time, even if the file is tens or hundreds of megabytes.

Generators are functions that magically create iterators instead of just returning values (ignoring some technicalities). Generators are the most common gateway to iterators, and are thus the more commonly used term for the whole area.

When iterators were introduced, a number of standard things that had previously returned lists started returning iterators, and using a generator instead of just returning a list became part of the common Python programming idioms.

In many cases it can be tempting, and temptingly easy, to replace things that return lists with generators; it looks like it should just work, and it mostly does. It can be similarly tempting to just ignore the difference in the standard Python modules.

But there are some gotchas when you write code like this, and I have the stubbed toes to prove it. At one point or another, I've made all of these iterator-confusion mistakes in my code.

Iterators are always true

t = generate_list(some, inputs)
if not t:
   return
print "Header Line:"
for item in t:
   .....

If generate_list returns an iterator instead of a list, this code doesn't work right. Unless someone got quite fancy, iterator objects are always true, unlike lists, which are only true if they contain something.

There's really no way to see if an iterator contains anything except to try to get a value from it. And there's no 'push value back onto iterator' operation.

Iterators can't be saved

def cached_lookup(what):
  if what not in cache:
    cache[what] = real_lookup(what)
  return cache[what]

If real_lookup returns iterators, this code doesn't work. When an iterator's exhausted, it's exhausted; if you try to use it again (such as if cached_lookup found it as a cached result), it generates nothing.

(Technically I believe there are semi-magical ways to copy iterators. I suspect one is best off avoiding them unless you really have to save an iterator copy.)

I can't use list methods on iterators

t = generate_list(some, inputs)
t.sort()
t = t[:firstN]
# ... admire the pretty explosions

Of course, iterators don't have general list functions like .sort() (or .len(), or so on). If you want to use those functions, you have to write:

t = list(generate_list(some, inputs))
t.sort(); t = t[:firstN]

Fortunately, list() will expand the iterator for you and is harmless to apply to real lists, so you can use it without having to care if the generate_list routine changes what it returns.

Writing recursive generators

Sometimes the most natural structure for a generator is a recursive one. This works, but you have to bear in mind a twist: you cannot simply return the results of the recursive calls. This is because the recursive results are themselves iterators, and if you return them straight your callers get iterators that produce a stream of iterators that produce a stream of iterators that someday, at some level, produce actual results. (But by that time the caller has given up in despair.)

Instead each time you recurse, you have to expand the resulting iterator and return each result, like so:

def treewalk(node):
  if not node:
    return
  yield node.value
  for val in treewalk(node.left):
    yield val
  for val in treewalk(node.right):
    yield val

This implies that significantly recursive generators can be quite inefficient, as they will spend a great deal of time trickling results up through all the levels involved.

python/GeneratorGotchas written at 02:26:39; Add Comment

2005-06-14

Putting a pleasant Python surprise to use

Although I've been programming in Python for a few years now, it keeps surprising me with little bits and pieces. Here's a neat Python language feature that I recently used for the first time (discovered originally through Bram Cohen's LiveJournal).

A common programming pattern is 'search for a something to work on, but stop if you don't find anything'. In Python one might write it something like this (taken more or less from DWiki's source):

found = False
for dir in utils.walk_to_root(curdir):
	page = dir.child("__readme")
	if page.exists():
		found = True
		break
if not found:
	return ''
# Go on to use the __readme file we found in some directory.

Python allows you to put 'else' conditions on loops (both for and while loops); the else condition is executed if the loop completed instead of being break'd from. This lets us simplify this pattern down to:

for dir in utils.walk_to_root(curdir):
	page = dir.child("__readme")
	if page.exists():
		break
else:
	return ''

If there's no __readme file to be found from the current directory up to the root, we just return nothing; otherwise, we'll process it. This DWiki code is the first occasion I've had to use this feature since I discovered it, and I'm pleased to finally have been able to.

(As you can now see, not all the entries in this blog are going to be long and meandering.)

python/LoopElse written at 17:16:09; Add Comment

Pitfalls in generating Last-Modified:

Every HTTP reply from a web server can include a Last-Modified: header, which theoretically tells interested parties when the web page was last modified. This is really something that works best when the web server is just sending out static files; when it is generating dynamic content, like DWiki does, things get interesting.

The major use of Last-Modified: is to decide when a browser already has a current copy of the web page and doesn't need to fetch it again. Thus, with dynamic pages built from many pieces the Last-Modified: time needs to be the most recent modification time for all of the pieces. Then when any of the pieces that make a page are updated, changing the page's appearance, the page's Last-Modified: time will change and the browser will fetch a new copy.

This means DWiki can't just use the page's modification time (which is what gets shown in the 'Last Modified:' line at the bottom of most CSpace pages). DWiki pages are built from a cascade of templates and pieces, so as it builds a web page DWiki keeps track of the most recent modification time of all the files involved; change one template, and the updated time is automatically propagated through the system.

Or it would if there weren't some complications.

Authentication Soup

Being logged in to a DWiki, and who you're logged in as, not just can but will change the appearance of pages. It's not just big things, like being able to see a page's contents; it's everything from DWiki saying 'Welcome, <whoever>' in the top right corner down to whether you get a login form or a logout button. So if you log in or out and then refresh pages in your browser, the pages better change to look right for your current status; otherwise users start wondering if their login or logout actually worked.

In order to support Last-Modified: with authentication, DWiki would have to somehow arrange to track the last time you logged in or out of the DWiki. While this is theoretically possible, it would be a bunch of work and would involve trying to send a cookie to every visiting browser (and I refuse to do the latter).

Instead DWiki just mostly punts when authentication is enabled; regular DWiki pages get served without any Last-Modified: header. Fortunately modern browsers have another, better header called ETags: that they can use instead of Last-Modified: to see if they need to refresh a page.

Page List Soup

The other complication is easy to state: what's the modification time of a list of files?

Lists of files come up in several places in DWiki, most importantly when generating Atom syndication feeds. Atom feeds also complicate life because of two factors:

  • the Atom feed format requires some kind of 'most recently updated' timestamp.
  • the ETags: header's value is some identifying hash of the HTTP response's contents, so if the contents keep changing (because one generates a 'right now' timestamp as the most recently updated time in an Atom feed), the ETags: header will keep changing and everything will keep re-fetching Atom feeds and pages even when nothing has changed.

(Also, the RSS/Atom feed reader I use doesn't use ETags:, only Last-Modified:, so I have been trying to support Last-Modified: in my Atom feeds.)

The simple approach is to make the Last-Modified: value be the modification time of the most recently modified file in the list. Unfortunately this doesn't change when files are added or removed from the middle of the list, which makes it useless for most of DWiki's purposes.

At the moment DWiki folds in the modification times of all the directories it scans when looking at files during Atom feed generation (thereby currently missing directories that currently have no files in them at all). At other times it just punts.

Summary For Client Authors

If you're thinking of writing a feed reader client or a web browser, I have this to say: please just use the ETags: header. Since it's some hash value of the HTTP response's data, it's easy to generate and always accurate about whether or not the response is the same. Last-Modified: is essentially an approximation in everything except relatively simple situations or programs that go to obsessive amounts of work.

web/LastModifiedPitfalls written at 00:50:46; Add Comment

2005-06-12

Making a Python mountain out of a molehill

DWiki is the software that runs CSpace, including this blog. It's wound up much bigger than I expected and wanted it to be. This is sort of the story of how (or why) that happened.

[More available: DWikiGrowth]

python/DWikiGrowth written at 03:22:32; Add Comment

Why a Blog?

Honesty compels me to admit that part of why I am trying blogging is that I coded a bunch of blog features into DWiki, my wiki-thing, and I would feel kind of stupid if they never got used. (The blogdir code has a clear usage case, but the blog code does not quite so much.)

But that's not all of it. Web pages cry out for a linking structure, an organization, a taxonomy, just like we are supposed to sensibly organize directories on our computers. And HTML pages themselves are standalone objects, so it feels wrong to make one too small. (I have some HTML pages that are only a few lines long; I suspect that they irritate people who think 'I clicked on a link for this little?'.)

A blog is a structure for casualness, and its way of organizing things liberates you all of those worries. Freed of the neurotic worry to organize, I can just write, jot, scribble, rant, or whatever. And since blog posts are all run together, I don't have to worry that one is too small for a HTML page (or just too much work once I add the necessary framing).

Blog writing doesn't have to be scribbles, and a lot of the blogs that I most enjoy reading are very carefully put together. But I'm not sure I have the time and energy for that; a good part of my purpose in trying out blogging is the theory that writing something is better than writing nothing.

(That's also the theory behind CSpace and DWiki as a whole.)

The question of why I'm not using a different piece of blog software for this is something for another time and another entry.

BlogGenesis written at 00:02:19; Add Comment

2005-06-11

Writing DWiki has been a very educational process. Mostly it has been educational about all sorts of irritations that I was previously happily ignorant of.

Take HTTP redirects, for example. (Please.)

To be fully specification-compliant, an HTTP redirect must be to a different URL than the current one, and must be to an absolute URL: it must redirect to http://host/some/where, not just /some/where. (Perhaps common browsers all accept relative redirects, but at least lynx complains about them.)

issue #1: when absolute URLs can't be

This presents a small problem for a program like DWiki: just what is the absolute URL of a DWiki page? The host is relatively easy, since modern HTTP requests include the host name (it's how name-based virtual hosts work).

But ... what about the port? Not every web server lives on port 80, especially a DWiki running in standalone test mode.

In theory the absolute URL should include the port (unless it's the default). In practice, every program I've tried gleefully adds the port itself if it is a non-standard port and you're referring to the same hostname. If you naievely generate redirects of http://host:port/..., what most programs try to get is http://host:port:port/..., which doesn't work too well.

Presumably people who want to run two web servers on the same host on different ports just lose.

Maybe this is even documented somewhere. (I jest; I looked, and failed to find anything obvious in the RFCs.)

Update, much later: I was completely mistaken here; see HostMistake.

issue #2: did you say different URL?

Why, yes. Different URL. Why is this irritating? Let's take logging in to this wiki as an example.

Login forms need to be POST forms, not GETs, because one does not want the password sitting in plaintext in URLs. The natural way to do it is to let the login form POST to the current page, which then just redisplays itself. Unfortunately if you then ask your browser to reload the resulting page (perhaps to see an updated edit of it), your browser warns you that you're about to resubmit a POST form and are you sure?

So: what we want to happen is to POST to the current URL, which instead of redisplaying itself in a POST context immediately redirects back to the GET version of itself.

Which is where it becomes very irritating that HTTP redirects have to go to a different URL.

DWiki 'solves' this issue by making up synthetic page names for processing logins (and logouts, which have the same problem). Fortunately it can guarantee that certain page names in its URL space will never be valid real DWiki pages, so it just uses some of them.

To get back to the page you were just reading, the login and logout forms add a hidden field to the form to say what the old page was. (Which means that the form has to be generated dynamically, because it's different on each page.)

web/HTTPRedirects written at 19:44:27; Add Comment

I find myself quite irritated with CSS lately, because I have been trying to be a good modern web-boy and style this place with CSS.

The problem with CSS is what it leaves out. Take a very simple example: this blog. Like many blogs, I want a two-column layout: blog entries on the left, a small sidebar about the blog on the right.

No problem in CSS: use two <div>s, set a width: or a min-width:, and then set one <div> as float: left; or so. And this works, as long as neither column overflows. Unfortunately, a) I am guaranteed to someday write a blog entry with a long unbreakable line, since I am going to quote code periodically and b) the user can make their browser window pretty darn narrow.

What I want to happen in this situation is for CSS to shrug and enlarge the whole thing so that the user has to scroll sideways to see the entire page. What I can get is one of:

  • the over-long line is truncated and is undisplayable.
  • the thing containing the over-long line is truncated but grows a scrollbar, so at least I can theoretically read it.
  • the over-long line scribbles itself all over the sidebar.
  • my sidebar stops being a sidebar and suddenly becomes a top-bar or a bottom-bar.

No, blech, ptui, and 'ha, you jest' respectively.

Those evil bad table things that we aren't supposed to use for layout? Surprise. They get this right.

So if you look at the HTML source for the blog, you will see a giant table.

(Now, perhaps there is some magic way to do this in CSS that actually works right and that no one mentions or uses. If so, please tell me about it. I would like to use CSS if I can.)

web/CSSIrritation written at 19:15:44; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
(Next 10)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.