2011-07-15
Something to remember: HTML forms are anonymous
By and large, web programming frameworks have settled on a common
model of handling HTML forms. You have named (or typed) forms
with named and typed fields and you use the framework to render
them into HTML and extract them from POST responses. Django programmers, for example, have a
familiar, reflexive idiom:
class MyForm(forms.Form):
name = forms.CharField(...)
...
def handle_my_url(request):
if request.method == "POST":
form = MyForm(request.POST)
if form.is_valid():
...
This simple, clear approach is misleading. It's misleading because it
makes the whole process look sort of like storing objects, which means
that of course you're only going to get a valid MyForm back from the
HTTP POST if you actually put one there in the first place (or the user
made up the POST data themselves to fool you, which you can ignore).
The thing is, HTML forms are anonymous. In their natural state,
the only way you can tell different types of forms apart is by the URL
they are sent to and the names of the form fields that they have. There
is nowhere natural in a HTML form where you can say 'this is a MyForm
form'; in general, you have to infer that from the fact that it has
all the fields that MyForm has and is POST'd to a URL that expects a
MyForm form.
(Your web framework may be adding a hidden label field that it uses to be sure, but you have to check the generated HTML in order to know for sure.)
So suppose that you have two different forms with the same form fields; this means that the only way to tell these forms apart is by the URL that each of them uses. If they use the same URL (for example because there are alternate versions of the page, with forms that have a different meaning), you can't tell them apart at all. You can render a page with a 'MyForm1', have the user POST it back, and happily retrieve a 'MyForm2' from the POST response. Although these two forms looked like they were distinct and different objects in your code, in HTML they are actually the same thing.
(It's as if your programming language ignored the type of things when doing 'is-a' and equivalence checks and only checked that two instances had all of the same fields. There are languages that work this way; I believe the term of art for it is 'structural equality'.)
All of this is abstract sounding, so let me give a concrete example where I almost shot my foot off this way. Our account request system allows privileged users to do two very special operations to requests: if a request is marked as having been either accepted or rejected, you can reset it to 'pending', and if a request is pending you can immediately delete it. In both cases you need to confirm that you really do want to do this by ticking off a checkbox, and both operations are done from the same 'detailed information about this request' URL; which option the page gives you depends on the request's state.
So we can create two forms:
class ReallyRevive(forms.Form): yes_really = forms.BooleanField(...) class ReallyDelete(forms.Form): yes_really = forms.BooleanField(...)
Then we write code that tries to get and validate a ReallyRevive form
from the POST response if the request is not pending, and do the same
with a ReallyDelete form instead if the request is pending. And we have
just created a dangerous race.
Suppose that two privileged users are both trying to revive the same request at the same time. Both see the page rendered with a ReallyRevive form, both tick the checkbox, and both submit the form, one somewhat after the other. In the first form submission, the code retrieves a valid ReallyRevive form and sets the request back to the pending state. In the second form submission, the code successfully retrieves a valid ReallyDelete form from the POST response despite the fact that it is actually a ReallyRevive response, and immediately deletes the just-revived request. Oops.
(You can see this as a REST violation if you want to. My view is that these things happen in practice so I should be aware of the bear traps waiting in the underbrush.)
The solution is to give your forms different field names; here we would
have a really_revive checkbox in one form and a really_delete
checkbox in the other.
2011-07-02
Dear Googlebot: SMTP is not HTTP
From the logs of a SMTP server here:
32301# remote from [66.249.67.36] 32301r GET /robots.txt HTTP/1.1 32301w 550 Syntax error 32301r Host: 128.100.3.51:25 32301w 550 Unknown command 'Host' 32301r Connection: Keep-alive 32301w 550 Syntax error 32301r Accept: text/plain,text/html 32301w 550 Syntax error 32301r From: googlebot(at)googlebot.com 32301w 550 Unknown command 'From' 32301# aborted: session terminated
(The abort is from my server, which drops connections after too many syntax errors.)
Then it immediately tries the same thing with 'GET / HTTP/1.1'
instead. Oh, and this is nowhere near the first time that Googlebot
has tried this; the first instance in my logs dates from 2007.
Yes, I'm sure that somewhere there is something that looks like a HTTP link to port 25 on this IP address (although Google doesn't know about it; I've tried the obvious web search). But this is still a failure on Google's part, because they should be much more careful than this with any 'url' that involves a port that is known to be used for another protocol. Sure, someone could be running a web server on port 25 against all expectations, but the odds are far better that someone has created a bad or malicious link. And certainly when Googlebot has been receiving SMTP replies for years, it should stop attempting to crawl entirely.
The other failure is that Googlebot should not have made the second
query for / after its attempt to retrieve robots.txt failed. This
was not a web server telling Googblebot 'there is no such file here';
this was the retrieval itself failing with protocol errors. Even if
Googlebot does not specifically have recognizers for SMTP responses (and
I maintain that it should), an odd port plus protocol failures should
mean 'this is probably not a web server, stop now'.
PS: I'm aware that part of the blame falls on my MTA for being so old that it doesn't immediately disconnect Googlebot for illegal pipelining (I assume that that's what's happening here).