How software makes reverse proxying hard
Our user run webservers rely on the ability to run various web applications that people want to use behind a reverse proxy. Well, the theoretical ability, because it turns out that there are a couple of things that programs do to make reverse proxying hard (and that they could do differently to make things easier).
First is that they should be willing to use the HTTP proxy headers added by Apache to get certain bits of information about the request, most notably the IP origin address. For obvious reasons, they should do this only when specifically configured to do so.
(Possibly there is an Apache setting for lying to CGIs and other applications about this sort of stuff, but if so we haven't stumbled across it.)
The less obvious thing is that applications need to distinguish between what I will call 'input' URLs, or at least a URL prefix, and 'output' URLs. Input URLs are what you see on requests after they have been remapped by the proxying process; output URLs are the external, pre-proxying, public URLs that should appear in your output (in HTML, in redirects, in Atom feeds, etc).
Applications with no such distinction are, unfortunately, very common. We've tried a couple of ways to hack around it:
- Apache's ProxyPassReverse directive is a very, very limited attempt
to patch up this problem. In my opinion, it actually does more harm
than good in most situations, since it papers over only part of the
problem; better to have no papering over at all, so that everything
- one can often make the absolute path on the user-run web server the same as it is on the real web server; this leaves you with just the port being different. If you're willing to do some hacking, you can configure Apache to lie about that too.
(This works even when the absolute path has a '/~user/' component.
If you disable UserDirs, Apache is perfectly happy to have a literal
~user/ directory in your document root and to serve things from it.)
I'm honestly surprised that more web applications don't make it easy to use them behind a reverse proxy; I had the impression that various forms of reverse proxies were relatively common in high load environments. Maybe they're deliberately set up to be more transparent than ours is, to look more like load balancers than actual reverse proxies.