My "time to full crawl" (vague) metric

September 17, 2024

This entry, along with all of Wandering Thoughts (this blog) and in fact the entire wiki-thing it's part of is dynamically rendered from my wiki-text dialect to HTML. Well, in theory. In practice, one of the several layers of caching that make DWiki (this software) perform decently is a cache of the rendered HTML. Because DWiki is often running as an old fashioned Apache CGI, this rendering cache lives on disk.

(DWiki runs in a complicated way that can see it operating as a CGI under low load or as a daemon with a fast CGI frontend under higher load; this entry has more details.)

Since there are only so many things to render to HTML, this on disk cache has a maximum size that it stabilizes at; given enough time, everything gets visited and thus winds up in the disk cache of rendered HTML. The render disk cache lives in its own directory hierarchy, and so I can watch its size with a simple 'du -hs' command. Since I delete the entire cache every so often, this gives me an indicator that I can call either "time to full cache" or "time to full crawl". The time to full cache is how long it typically takes for the cache to reach maximum size, which is how long it takes for everything to be visited by something (or actually, used to render a URL that something visited).

I haven't attempted to systematically track this measure, but when I've looked it usually takes less than a week for the render cache to reach its stable 'full' size. The cache stores everything in separate files, so if I was an energetic person I could scan through the cache's directory tree, look at the file modification times, and generate some nice graphs of how fast the crawling goes (based on either the accumulated file sizes or the accumulated number of files, depending on what I was interested in).

(In theory I could do this from web server access logs. This would give me a somewhat different measure, since I'd be tracking what URLs had been accessed at least once instead of which bits of wikitext had been used in displaying URLs. At the same time, it might be a more interesting measure of how fast things are visited, and I do have a catalog of all page URLs here in the form of an automatically generated sitemap.)

PS: I doubt this is a single crawler visiting all of Wandering Thoughts in a week or so. Instead I expect it's the combination of the assorted crawlers (most of them undesirable), plus some amount of human traffic.

Written on 17 September 2024.
« Why my Fedora 40 systems stalled logins for ten seconds or so
Open source maintainers with little time and changes »

Page tools: View Source.
Search:
Login: Password:

Last modified: Tue Sep 17 22:43:03 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.