Wandering Thoughts

2017-06-15

Why I am not installing your app on my phone

For reasons beyond the scope of this entry, I spent a decent chunk of time today using my phone to amuse myself. Part of that time was reading Twitter, and part of that reading involved following links to interesting articles on various places. Quite a number of those places wanted me to install their iPhone app instead of reading things on their website, and some of them were quite obnoxious about it. For example, Medium sticks a prominent non-dismissable button in the middle of the bottom of the screen, effectively shrinking an already-too-small screen that much further.

Did I install any of the apps that these websites wanted me to? Of course not. This is not because my phone has only limited space, and it's not really because I prefer keeping my phone uncluttered. There's a much more fundamental reason: I don't trust your app. In fact, I assume that almost all apps that websites want me to use instead of reading the site are actually trojan horses.

By this, I don't mean that I expect any of these apps to quietly attack the security of my phone and attempt to compromise it (although I wouldn't take that bet on Android). I don't even necessarily expect all of these apps to demand intrusive device permissions, like constant access to location services (although I suspect that a lot of them will at least ask for lots of permissions, because maybe I'll be foolish enough to agree). I do definitely expect that all of these apps will put their app nature to 'good' use in order to spy on, track, and monetize my in-app activity to a much larger extent than their websites can. Any potential improvement in my user experience over just reading the website is incidental to their actual reason for existing, which is why they're trojan horses.

So no, I will not install your app, because I trust Apple and Safari to do a lot more to preserve my privacy against your website and your spying on me than I trust you. Safari allows various sorts of (ad)blockers, it allows me to totally discard cookies and other tracking identification that websites have foisted off on me, and in general I trust Apple to be pretty carefully restrictive of what it allows websites to do with JavaScript and what information they can get. Your app? I have no such trust; in fact I expect the exact reverse of trust.

There is nothing particularly surprising here, of course. This is simply the inevitable result of the business model of these websites. I'm not their customer, although they may pretend otherwise; instead, I am part of the product, to be packed up and sold off to advertisers. Trying to get me to accept the app is part of fattening me up for their actual customers.

(This is a bit of a grumpy rant, because I got sick and tired of all the 'install our app, really' badgering from various places, especially when it makes their websites less usable. Some of the time these nags encouraged me to close the page as not sufficiently fascinating, which may or may not have been a win for the websites in question.)

WhyNotInstallingYourApp written at 01:05:50; Add Comment

2017-06-06

The IPv6 address lookup problem (and brute force solution)

In Julia Evans' article Async IO on Linux: select, poll, and epoll, she mentioned in passing that she straceed a Go program making a HTTP request and noticed something odd:

Then [the Go program] makes 2 DNS queries for example.com (why 2? I don’t know!), and uses epoll_wait to wait for replies [...]

It turns out that this is all due to IPv6 (and the DNS standards), and it (probably) happens in more than Go programs (although I haven't straced anything else to be sure). So let's start with the problem.

Suppose that you have a 'dual-stack' machine, one with both IPv4 and IPv6 connectivity. You need to talk to a wide variety of other hosts; some of them are available over IPv4 only, some of them are available over IPv6 only, and some of them are available over both (in which case you traditionally want to use IPv6 instead of IPv4). How do you look up their addresses using DNS?

DNS currently has no way for a client to say 'give me whatever IPv4 and IPv6 addresses a host may have'. Instead you have to ask specifically for either IPv4 addresses (with a DNS A record query) or IPv6 addresses (with a DNS AAAA record query). The straightforward way for a dual-stack machine to find the IP addresses of a remote host would be to issue an AAAA query to get any IPv6 addresses, wait for it to complete (or error out or time out), and then issue an A query for IPv4 addresses if necessary. However, there are a lot of machines that have no IPv6 addresses, so a lot of the time you'd be adding the latency of an extra DNS query to your IP address lookups. Extra latency (and slower connections) doesn't make people happy, and DNS queries are not necessarily the fastest thing in the world in the first place for various reasons.

(Plus, there are probably some DNS servers and overall DNS systems that will simply time out for IPv6 AAAA queries instead of promptly giving you a 'no' answer. Waiting for a timeout adds substantial amounts of time. Properly operated DNS systems shouldn't do this, but there are plenty of DNS systems that don't operate properly.)

To deal with this, modern clients increasingly opt to send out their A and AAAA DNS queries in parallel. This is what Go is doing here and in general (in its all-Go resolver, which is what the Go runtime tries to use), although it's hard to see it in the net package's source code until you dig quite far down. Go waits for both queries to complete, but there are probably some languages, libraries, and environments that immediately start a connection attempt when they get an answer back, without waiting for the other protocol's query to finish too.

(There is a related algorithm called Happy Eyeballs which is about trying to make IPv6 and IPv4 connections in parallel and using whichever completes first. And there is a complicated RFC on how you should select a destination address out of the collection that you may get from your AAAA and A DNS queries.)

Sidebar: DNS's lack of an 'all types of IP address' query type

I don't know for sure why DNS doesn't have a standard query type for 'give me all IP addresses, either IPv4 or IPv6'. Per Wikipedia, DNS itself was created in the mid 1980s, well before IPv6 was designed. However, IPv6 itself is decades old at this point, which is lots of time to add such a query type to DNS and have people adopt it (although it might still not be universally supported, which would leave you falling back to explicit A queries at least). My best guess for why such a query type was never added is a combination of backwards compatibility worries (since initially not many DNS servers would support it, so clients would mostly be making an extra DNS query for nothing) and a general belief on the part of IPv6 people that IPv4 was going to just go away entirely any day soon, really.

(We know how that one turned out. It's 2017, and IPv4 only hosts and networks remain very significant.)

IPv6AddressLookupProblem written at 00:27:36; Add Comment

2017-06-02

My views on the JSON Feed syndication feed format

When I first read the JSON Feed version 1 specification, I came away feeling frustrated (and expressed it on Twitter) because my initial impression was that the JSON Feed people had not bothered to look at prior art and (painful) prior experiences. Then I read more, including things like Mapping RSS and Atom to JSON Feed, which made it clear that several things that I thought might be accidental omissions were in fact deliberate decisions. Now my current dominant feeling about JSON Feed is quiet sadness.

On a straightforward level I think that the current JSON Feed specification makes some bad suggestions about id elements (and also). I also think that the specification is at least loosely written overall, with imprecise language and important general qualifications that are mentioned only in one spot. I think that this is a bad idea given how I expect JSON Feed's specification to be read. Since people implementing JSON Feed seem to currently be coordinating with each other, JSON Feed may still avoid potential misunderstandings and being redefined by implementations.

Stepping beyond issues of how the specification is written, I'm sad that JSON Feed has chosen to drop various things that Atom allows. The thing that specifically matters to me is HTML in feed entry titles, because I use that quite frequently, usually for fonts. Resources like Mapping RSS and Atom to JSON Feed make it plain that this was a deliberate choice in creating the specification. I think that Atom encapsulates a lot of wisdom about what's important and useful in a syndication feed format and it would clearly be useful to have a JSON mapping of that, but that's not what JSON Feed is; it has deliberately chosen to be less than Atom, eliminating some features and some requirements outright.

(The whole thing leaves me with the feeling that JSON Feed is mostly crafted to be the minimum thing that more or less works, both in the actual content of the specification and how it's written. Some people will undoubtedly consider this praise for JSON Feed.)

As you might suspect from this, I have no plans to add JSON Feed generation to DWiki, the wacky Python-based wiki engine behind Wandering Thoughts. Among other issues, DWiki is simply not written in a way that would make generating JSON natively at all an easy process. Adding a JSON Feed is probably reasonably easy in most environments where you assemble your syndication feed as a complete data structure in memory and then serialize it in various formats, because JSON is just another format there (and these days, probably an easy one to serialize to). But for better or worse, DWiki uses a fundamentally different way of building feeds.

Should you provide a JSON Feed version of your syndication feed? I have no opinion either way. Do it if you want to, especially if it's easy. I do hope very much that we don't start seeing things that are JSON-Feed-only, because of course there are a lot of syndication feed consumers out there that certainly don't understand JSON Feed now and may never be updated to understand it.

(But then, maybe syndication feeds are on the way out in general. Certainly there has been rumbles about that in the past, although you couldn't prove it from my Atom feed fetch rates.)

JSONFeedMyViews written at 00:21:47; Add Comment

2017-05-31

How a lot of specifications are often read

In the minds of specification authors, I suspect that they have an 'ideal reader' of their specification. This ideal reader is a careful person; they read the specification all the way through, cross-referencing what they read with other sections and perhaps keeping notes. When there is ambiguity in one part, the ideal reader keeps it in mind as an unsettled issue and looks for things said in other parts that will resolve it, and when something of global importance is mentioned in one section, the reader remembers and applies it to the entire specification.

I'm sure that some specifications are read by some people in this way. If you're working on something of significant importance (especially commercial importance) and there's a core standard, probably you approach it with this degree of care and time, because there is a lot on the line. However, I don't think that this is common. In practice I believe that most people read most specifications rather differently; they read them as if they were references.

People very rarely read references front to back, much less taking notes and reconciling confusions. Instead, they perhaps skim your overview and then when they have a question they open up the reference (the specification), go to the specific section for their issue, and try to read as little as possible in order to get an answer. Perhaps they'll skim some amount of things around the section just in case. People do this for a straightforward reason; they don't want to spend the time to read the entire thing carefully, especially when they have a specific question.

(If it's not too long and is written decently well, people may read your entire specification once, casually and with some skimming, just to get a broad understanding of it. But they're unlikely to read it closely with lots of care, because that's too much work, and then when they wind up with further questions they're going to flip over to treating the specification as a reference and trying to read as little as possible.)

The corollary to this is that in a specification that you want to be implemented unambiguously, it's important that each part or section is either complete in itself or clearly incomplete in a way that actively forces people to go follow cross-references. If you write a section so that it looks complete but is actually modified in an important way by another section, you can probably expect a fair number of the specification's readers to not realize this; they will just assume that it's complete and then they won't remember, notice, or find your qualifications elsewhere.

(This includes sections that are quietly ambiguous as written but have that ambiguity resolved by another section. When this happens, readers are basically invited to assume that they know what you mean and to make up their own answers. This is a great way to wind up with implementations that don't do what you intended.)

SpecsHowAreOftenRead written at 23:07:12; Add Comment

2017-05-28

Specifications are ultimately defined by their implementations

In theory, the literal text of a specification is the final authority on defining what the specification means and requires. In practice, it generally doesn't work out this way; once a specification gets adopted, it ultimately becomes defined by its implementations. Regardless of what the actual text says, if everyone, or most people, or just dominant implementations do something or have some (mis-)interpretation of the specification, those things become the specification in practice. If your implementation doesn't conform to the wrong things that other implementations do, you can expect to have problems interoperating with those other implementations, and they almost always have more practical power than you do. You can appeal to the specification all you want, but it's not going to get you anywhere. People actually using the implementations generally care most that they interoperate, and they don't really care about why they do or don't. A new implementation that refuses to interoperate may or may not be 'correct' by the specification (many people are not well placed to know for sure), but it certainly isn't very useful to most people and it's not likely to get many users in the absence of other factors.

(Of course there can always be other factors. It's sometimes possible to give people no choice about using a particular (new) implementation or very strongly tilt them towards it, and if you do this with a big enough pool of people, your new implementation can rapidly become a dominant one. The browser wars in the late 90s are one example of this effect in action, as are browser engines on mobile platforms today.)

One corollary of this is that it's quite important to write a clear and good specification. Such a specification maximizes the chances that all implementations will do the same thing and that what they do will match what you wrote. Conversely, the more confusing and awkward the specification, the more initial chaos there will be in implementations and the more random and divergent from your intentions the eventual actual in-practice 'standard' is likely to be.

(If your specification is successful, enough of the various people involved will wind up implementing some common behavior so they can interoperate. This behavior does not necessarily have much relationship to what you intended; instead it's likely to be based on some combination of common misunderstandings, early implementations that set the stage for everyone else to copy, and what people settled on as the most useful way to behave.)

(I've sort of written about this before in the context of programming language specifications.)

SpecsEndUpDefinedByImplementations written at 00:18:02; Add Comment

2017-05-26

Why globally unique IDs are useful for syndication feed entries

Pretty much every syndication feed format ever has some sort of 'id' field for syndication feed entries (ie, the posts and so on) that is supposed to be unique for every entry in the feed and never repeat. Since basically everything else about a feed entry can change (the title, the text, and yes even the URL), this unique ID is used by feed readers and other consumers of the syndication feed in order to tell the difference between new entries and ones that have merely been updated. Often it is also used to find unchanged entries as well, rather than forcing a feed consumer to carefully compare all of the other fields against all of the current entries it knows about (either directly or by using them to derive a hash identity for every entry).

When used for this purpose alone it's sufficient for the ID field to be merely unique within this particular syndication feed, but potentially duplicated in other syndication feeds; in other worlds, you don't strictly speaking need an ID that is globally unique. However, this is thinking too small. In practice it's very useful to be able to recognize the same entry appearing across multiple feeds and to do things specially as a result of it. One obvious action is for a feed reader to mark such a cross-posted feed entry as 'read' in all feeds it appears in when you read it in one, so that you don't have to read and re-read and re-re-read it as it reappears repeatedly, but there are other tricks that can be done here as well.

You might wonder how feed entries can ever be cross-posted. There are two common cases: aggregation sites and feeds such as Planet Debian, and subset feeds from a site where a single entry appears in multiple ones (for example, if there are category feeds and a single entry is in several categories). The aggregation sites case definitely happens in the field and is even not uncommon if you follow several Planets in the same general field (Planet Debian and Kernel Planet, for example; there are a number of people included in both).

The usefulness of truly globally unique feed identifies is why the Atom syndication format goes out of its way to specify the atom:id element such that properly constructed Atom IDs will be globally unique, not merely locally unique within a single feed. That the standard explicitly says that they must be IRIs is neither an accident nor Atom being overly picky; properly formed IRIs are globally unique.

This is also why it makes me sad that the JSON Feed version 1 specification does not talk at all about making your items.id values globally unique. By failing to even request that people generate globally unique IDs, the JSON Feed people are ignoring a significant body of practical syndication feed experience.

(Sure, their suggestion of URLs will result in globally unique IDs, but URLs have problems as permanent IDs and there is no guidance to people who immediately see the problems with using URLs here and want to do something better. With no guidance, people will be tempted to do things like use the database primary key as the ID, or maybe generate a random and must-be-unique GUID for each entry. Things like global uniqueness of IDs are too important to be left as implicit side effects of one suggested implementation strategy.)

SyndicationFeedWhyGlobalUniqueIDs written at 22:33:22; Add Comment

2017-05-15

Thinking about how much asynchronous disk write buffering you want

Pretty much every modern system defaults to having data you write to filesystems be buffered by the operating system and only written out asynchronously; you have to take special steps either to make your write IO synchronous or to force it to disk (which can lead to design challenges). When the operating system is buffering data like this, one obvious issue is the maximum amount of data it should let you buffer up before you have to slow down or stop.

Let's start with two obvious observations. First, if you write enough data, you will always have to eventually slow down to the sustained write speed of your disk system. The system only has so much RAM; even if the OS lets you use it all, eventually it will all be filled up with your pending data and at that point you can only put more data into the buffer when some earlier data has drained out. That data drains out at the sustained write speed of your disk system. The corollary to this is that if you're going to write enough data, there is very little benefit to letting you fill up lots of a write buffer; the operating system might as well limit you to the sustained disk write speed relatively early.

Second, RAM being used for write buffers is often space taken away from other productive uses for that RAM. Some times you will read back some of the written data and be able to get it from RAM, but if it is purely written (for now) then the RAM is otherwise wasted, apart from any benefits that write buffering may get you. By corollary with our first observation, buffering huge amounts of write-only data for a program that is going to be limited by disk write speed is not productive (because it can't even speed the program up).

So what are the advantages of having some amount of write buffering, and how much do we need to get them?

  • It speeds up programs that write occasionally or only once and don't force their data to be flushed to the physical disk. If their data fits into the write buffer, these programs can continue immediately (or exit immediately), possibly giving them a drastic performance boost. The OS can then write the data out in the background as other things happen.

    (Per our first observation, this doesn't help if the collection of programs involved write too much data too fast and overwhelm the disks and possibly your RAM with the speed and volume.)

  • It speeds up programs that write in bursts of high bandwidth. If your program writes a 1 GB burst every minute, a 1 GB or more write buffer means that it can push that GB into the OS very fast, instead of being limited to the (say) 100 MB/s of actual disk write bandwidth and taking ten seconds or more to push out its data burst. The OS can then write the data out in the background and clear the write buffer in time for your next burst.

  • It can eliminate writes entirely for temporary data. If you write data, possibly read it back, and then delete the data fast enough, the data needs never be written to disk if it can be all kept in the write buffer. Explicitly forcing data to disk obviously defeats this, which leads to some tradeoffs in programs that create temporary files.

  • It allows the OS to aggregate writes together for better performance and improved data layout on disk. This is most useful when your program issues comparatively small writes to the OS, because otherwise there may not be much improvement to be had from aggregating big writes into really big writes. OSes generally have their own limits on how much they will aggregate together and how large a single IO they'll issue to disks, which clamps the benefit here.

    (Some of the aggregation benefit comes from the OS being able to do a bunch of metadata updates at once, for example to mark a whole bunch of disk blocks as now used.)

    More write buffer here may help if you're writing to multiple different files, because it allows the OS to hold back writes to some of those files to see if you'll write more data to them soon enough. The more files you write to, the more streams of write aggregation the OS may want to keep active and the more memory it may need for this.

    (With some filesystems, write aggregation will also lead to less disk space being used. Many filesystems that compresses data are one example, and ZFS in general can be another one, especially on RAIDZ vdevs (and also).)

  • If the OS starts writing out data in the background soon enough, a write buffer can reduce the amount of time a program takes to write a bunch of data and then wait for it to be flushed to disk. How much this helps depends partly on how fast the program can generate data to be written; for the best benefit, you want this to be faster than the disk write rate but not so fast that the program is done before much background write IO can be started and completed.

    (Effectively this converts apparently synchronous writes into asynchronous writes, where actual disk IO overlaps with generating more data to be written.)

Some of these benefits require the OS make choices that push against each other. For example, the faster the OS starts writing out buffered data in the background, the more it speeds up the overlapping write and compute case but the less chance it has to avoid flushing data to disk that's written but then rapidly deleted (or otherwise discarded).

How much write buffering you want for some of these benefits depends very much on what your programs do (individually and perhaps in the aggregate). If your programs write only in bursts or fall into the 'write and go on' pattern, you only need enough write buffer to soak up however much data they're going to write in a burst so you can smooth it out again. Buffering up huge amounts of data for them beyond that point doesn't help (and may hurt, both by stealing RAM from more productive uses and from leaving more data exposed in case of a system crash or power loss).

There is also somewhat of an inverse relationship between the useful size of your write buffer and the speed of your disk system. The faster your disk system can write data, the less write buffer you need in order to soak up medium sized bursts of writes because that write buffer clears faster. Under many circumstances you don't need the write buffer to store all of the data; you just need it to store the difference between what the disks can write over a given time and what your applications are going to produce over that time.

(Conversely, very slow disks may call in theory call for very big OS write buffers, but there are often practical downsides to that.)

WriteBufferingHowMuch written at 23:55:52; Add Comment

2017-05-14

People don't like changes (in computer stuff)

There are always some people who like to fiddle around with things. Some number of photographers are always shuffling camera settings or experimenting with different post-processing; some number of cyclists are always changing bits of their bikes; some car enthusiasts like fiddling with engines and so on. But most people are not really interested in this; they want to get something that works and then they want it to keep on just like that, because it works and it's what they know.

Computers are not an exception to this. For most people, a computer is merely a tool, like their car. What this means is that people don't like their computers to change, any more than they want other things in their life to change. Imagine how it would be if every time you took your car in for service, the mechanics changed something about how the dashboard and controls worked, and every few years during a big service call they would replace the dashboard entirely with a new one that maybe mostly looked and worked the same. Or not. You and many other people would find it infuriating, and pretty soon people would stop bringing their cars in for anything except essential service.

Unfortunately us computer people really love to change things in updates, and of course 'upgrade' is generally synonymous with 'changes' in practice. Against all available evidence we are convinced that people want the latest shiny things we come up with, so we have a terrible track record of forcing them down people's throats. This is not what people want. People want stuff to work, and once it works they want us to stop screwing with it because it works, thanks. People are well aware that us screwing with stuff could perhaps improve it, but that's what everyone claims about all changes; rarely do people push out a change that says 'we're making your life worse' and most changes are created with the belief that they're either necessary or an improvement. However, much of the time the changes don't particularly make people's lives clearly better, and when they do make people's lives better in the long run there is often a significant payoff period that makes the disruption not worth it in the short run.

(Rare and precious is a non-bugfix update that immediately makes people's lives better. And bugfix updates are just making things work the way they should have in the first place.)

In my opinion, this is a fundamental reason why forcing updates on people is not particularly a good answer to people not patching. Unless upgrades and updates magically stop changing things, forcing updates means forcing changes, which makes people unhappy because they generally very much do not want that.

(There is also the chance that an update will do harm. Every time that happens, people's trust in updates decays along with their willingness to take the risk. If your system works now, applying an update might keep it working or it might blow things up, so applying an update is always a risk.)

PeopleDislikeChanges written at 00:51:20; Add Comment

2017-05-13

People don't patch systems and that's all there is to it

Recently (ie, today) there has been all sorts of commotion in the news about various organizations getting badly hit by malware that exploits a vulnerability that was patched by Microsoft in MS17-010, a patch that was released March 14th. I'm sure that the usual suspects are out in force pointing their fingers at organizations for not patching. In response to this you might want to read, say, Steve Bellovin on the practical difficulties of patching. I agree with all of this, of course, but I have an additional perspective.

Although one may dress it up in various ways, real computer security ultimately requires understanding what people actually do and don't do. By now we have a huge amount of experience in this area about what happens when updates are released, and so we know absolutely for sure that people often don't apply updates, and the extended version of this, which is people often still stick with things that aren't getting security updates. You can research why this happens and argue about how sensible they are in doing so and what the balance of risks is, but the ground truth is that this is what happens. Much as yelling at people has not magically managed to stop them from falling for phish and malware links in email (for all sorts of good reasons), yelling at people has not persuaded them to universally apply patches (and to update no longer supported systems) and it is not somehow magically going to do so in the future. If your strategy to deal with this is 'yell harder' (or 'threaten people more'), then it is a more or less guaranteed failure on day one.

(If we're lucky, people apply patches and updates sometime, just not right away.)

Since I don't know what the answers are, I will leave corollaries to this blunt fact as an exercise for the reader.

(I'm not throwing stones here, either. I have systems of my own that are out of date or even obsolete (my Linux laptop is 32-bit, and 32-bit Linux Chrome hasn't gotten updates for some years now). Some of the time I don't have any particularly good reason why I haven't updated; it's just that it's too much of a pain and disruption because it requires a reboot.)

PS: I'm pretty sure that forcing updates down people's throats is not the answer, at least not with the disruptive updates that are increasingly the rule. See, for example, people's anger at Microsoft forcing Windows reboots on them due to updates.

PeopleDontPatch written at 00:20:10; Add Comment

2017-05-11

The challenges of recovering when unpacking archives with damage

I wrote recently about how 'zfs receive' makes no attempt to recover from damaged input, which means that if you save 'zfs send' output somewhere and your saved file gets damaged, you are up the proverbial creek. It is worth mentioning that this is not an easy or simple problem to solve in general, and that doing a good job of this is likely going to affect a number of aspects of your archive file format and how it's processed. So let's talk a bit about what's needed here.

The first and most obvious thing you need is an archive format that makes it possible to detect and then recover from damage. Detection is in some sense easy; you checksum everything and then when a checksum fails, you know damage has started. More broadly, there are several sorts of damage you need to worry about: data that is corrupt in place, data that has been removed, and data that has been inserted. It would be nice if we could assume that data will only get corrupted in place, but my feelings are that this assumption is unwise.

(For instance, you may get 'removed data' if something reading a file off disk hits a corrupt spot and spits out only partial or no data for it when it continues on to read the rest of the file.)

In-place corruption can be detected and then skipped with checksums; you skip any block that fails its checksum, and you resume processing when the checksums start verifying again. Once data can be added or removed, you also need to be able to re-synchronize the data stream to do things like find the next start of a block; this implies that your data format should have markers, and perhaps some sort of escape or encoding scheme so that the markers can never appear in actual data. You want re-synchronization in your format in general anyway, because one of the things that can get corrupt is the 'start of file' marker; if it gets corrupted, you obviously need to be able to unambiguously find the start of the next file.

(If you prefer, call this a more general 'start of object' marker, or just metadata in general.)

So you have an archive file format that has internal markers for redundancy and where you can damage it and resynchronize with as little data lost and unusable as possible. But this is just the start. Now you need to look at the overall structure of your archive and ask what happens if you lost some chunk of metadata; how much of the archive is unable to be usefully processed? For example, suppose that data in the archive is identified by inode number, you have a table mapping inode numbers to filenames, and this table can only be understood with the aid of a header block. Then if you lose the header block to corruption, you lose all of the filenames for everything in the archive. The data in the archive may be readable in theory, but it's not useful in practice unless you're desperate (since you'd have to go through a sea of files identified only by inode number to figure out what they are and what directory structure they might go into).

Designing a resilient archive format, one that recovers as much as possible in the face of corruption, often means designing an inconvenient or at least inefficient one. If you want to avoid loss from corruption, both redundancy and distributing crucial information around the archive are your friends. Conversely, clever efficient formats full of highly compressed and optimized things are generally not good.

You can certainly create archive formats that are resilient this way. But it's unlikely to happen by accident or happenstance, which means that an archive format created without resilience in mind probably won't be all that resilient even if you try to make the software that processes it do its best to recover and continue in the face of damaged input.

ResilientArchivesChallenges written at 02:07:30; Add Comment

(Previous 10 or go back to May 2017 at 2017/05/05)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.