Wandering Thoughts archives

2025-02-02

Web spiders (or people) can invent unfortunate URLs for your website

Let's start with my Fediverse post:

Today in "spiders on the Internet do crazy things": my techblog lets you ask for a range of entries. Normally the range that people ask for is, say, ten entries (the default, which is what you normally get links for). Some deranged spider out there decided to ask for a thousand entries at once and my blog engine sighed, rolled up its sleeves, and delivered (slowly and at large volume).

In related news, my blog engine can now restrict how large a range people can ask for (although it's a hack).

DWiki is the general wiki engine that creates Wandering Thoughts. As part of its generality, it has a feature that shows a range of 'pages' (in Wandering Thoughts these are entries, in general these are files in a directory tree), through what I call virtual directories. As is usual with these things, the range of entries (pages, files) that you're asking for is specified in the URL, with syntax like '<whatever>/range/20-30'.

If you visit the blog front page or similar things, the obvious and discoverable range links you get are for ten entries. You can under some situations get links for slightly bigger ranges, but not substantially larger ones. However, the engine didn't particularly restrict the size of these ranges, so if you wanted to create URLs by hand you could ask for very large ranges.

Today, I discovered that two IPs had asked for 1000-entry ranges today, and the blog engine provided them. Based on some additional log information, it looks like it's not the first time that giant ranges have been requested. One of those IPs was an AWS IP, for which my default assumption is that this is a web spider of some source. Even if it's not a conventional web spider, I doubt anyone is asking for a thousand entries at once with the plan of reading them all; that's a huge amount of text, so it's most likely being done to harvest a lot of my entries at once for some purpose.

(Partly because of that and partly because it puts a big load on DWiki, I've now hacked in a mentioned feature to restrict how large a range you can request. Because it's a hack, too-large ranges get HTTP 404 responses instead of something more useful.)

Sidebar: on the "virtual directories" name and feature

All of DWiki's blog parts are alternate views of a directory hierarchy full of files, where each file is a 'page' and in the context of Wandering Thoughts, almost all pages are blog entries (on the web, the 'See as Normal' link at the bottom will show you the actual directory view of something). A 'virtual directory' is a virtual version of the underlying real directory or directory hierarchy that only shows some pages, for example pages from 2025 or a range of pages based on how recent they are.

All of this is a collection of hacks built on top of other hacks, because that's what happens when you start with a file based wiki engine and decide you can make it be a blog too with only a few little extra features (as a spoiler, it did not wind up requiring only a few extra things). For example, you might wonder how the blog's front page winds up being viewed as a chronological blog, instead of a directory, and the answer is a hack.

web/SpidersInventUnfortunateURLs written at 19:55:57;

Build systems and their effects on versioning and API changes

In a comment on my entry on modern languages and bad packaging outcomes at scale, sapphirepaw said (about backward and forward compatibility within language ecologies), well, I'm going to quote from it because it's good (but go read the whole comment):

I think there’s a social contract that has broken down somewhere.

[...]

If a library version did break things, it was generally considered a bug, and developers assumed it would be fixed in short order. Then, for the most part, only distributions had to worry about specific package/library-version incompatibilities.

This all falls apart if a developer, or the ecosystem of libraries/language they depend on, ends up discarding that compatibility-across-time. That was the part that made it feasible to build a distribution from a collection of projects that were, themselves, released across time.

I have a somewhat different view. I think that the way it was in the old days was less a social contract and more an effect of the environment that software was released into and built in, and now that the environment has changed, the effects have too.

C famously has a terrible story around its (lack of a) build system and dependency management, and for much of its life you couldn't assume pervasive and inexpensive Internet connectivity (well, you still can't assume the latter globally, but people have stopped caring about such places). This gave authors of open source software a strong incentive to be both backward and forward compatible. If you released a program that required the features of a very recent version of a library, you reduced your audience to people who already had the recent version (or better) or who were willing to go through the significant manual effort to get and build that version of the library, and then perhaps make all of their other programs work with it, since C environments often more or less forced global installation of libraries. If you were a library author releasing a new minor version or patch level that had incompatibilities, people would be very slow to actually install and adopt that version because of those incompatibilities; most of their programs using your libraries wouldn't update on the spot, and there was no good mechanism to use the old version of the library for some programs.

(Technically you could make this work with static linking, but static linking was out of favour for a long time.)

All of this creates a quite strong practical and social push toward stability. If you wanted your program or its new version to be used widely (and you usually did), it had better work with the old versions of libraries that people already had; requiring new APIs or new library behavior was dangerous. If you wanted the new version of your library to be used widely, it had better be compatible with old programs using the old API, and if you wanted a brand new library to be used by people in programs, it had better demonstrate that it was going to be stable.

Much of this spilled over into other languages like Perl and Python. Although both of these developed central package repositories and dependency management schemes, for a long time these mostly worked globally, just like the C library and header ecology, and so they faced similar pressures. Python only added fully supported virtual environments in 2012, for example (in Python 3.3).

Modern languages like Go and Rust (and the Node.js/NPM ecosystem, and modern Python venv based operation) don't work like that. Modern languages mostly use static linking instead of shared libraries (or the equivalent of static linking for dynamic languages, such as Python venvs), and they have build systems that explicitly support automatically fetching and using specific versions of dependencies (or version ranges; most build systems are optimistic about forward compatibility). This has created an ecology where it's much easier to use a recent version of something than it was in C, and where API changes in dependencies often have much less effect because it's much easier (and sometimes even the default) to build old programs with old dependency versions.

(In some languages this has resulted in a lot of programs and packages implicitly requiring relatively recent versions of their dependencies, even if they don't say so and claim wide backward compatibility. This happens because people would have to take explicit steps to test with their stated minimum version requirements and often people don't, with predictable results. Go is an exception here because of its choice of 'minimum version selection' for dependencies over 'maximum version selection', but even then it's easy to drift into using new language features or new standard library APIs without specifically requiring that version of Go.)

One of the things about technology is that technology absolutely affects social issues, so different technology creates different social expectations. I think that's what's happened with social expectations around modern languages. Because they have standard build systems that make it easy to do it, people feel free to have their programs require specific version ranges of dependencies (modern as well as old), and package authors feel free to break things and then maybe fix them later, because programs can opt in or not and aren't stuck with the package's choices for a particular version. There are still forces pushing towards compatibility, but they're weaker than they used to be and more often violated.

Or to put it another way, there was a social contract of sorts for C libraries in the old days but the social contract was a consequence of the restrictions of the technology. When the technology changed, the 'social contract' also changed, with unfortunate effects at scale, which most developers don't care about (most developers aren't operating at scale, they're scratching their own itch). The new technology and the new social expectations are probably better for the developers of programs, who can now easily use new features of dependencies (or alternately not have to update their code to the latest upstream whims), and for the developers of libraries and packages, who can change things more easily and who generally see their new work being used faster than before.

(In one perspective, the entire 'semantic versioning' movement is a reaction to developers not following the expected compatibility that semver people want. If developers were already doing semver, there would be no need for a movement for it; the semver movement exists precisely because people weren't. We didn't have a 'semver' movement for C libraries in the 1990s because no one needed to ask for it, it simply happened.)

programming/BuildSystemsAndAPIChanges written at 16:52:44;

By day for February 2025: 1 2 3 4 5 6 7 8 9 10 11 12; before February.

Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.