My arrogance about Unicode and character encodings
Yesterday I described how I could get away with ignoring encoding issues and thus how forced Unicode was and is irritating. However there is a gotcha in my approach, one that hides behind a bit of arrogance. Let me repeat the core bit of how my programs typically work:
What they process is a mixture of ASCII (for keywords, directives, and so on, all of the things the program had to interpret) and uninterpreted bytestrings, which are simply regurgitated to the user as-is in appropriate situations.
This simple, reasonable description contains an assumption: this approach assumes that any encoding will be a superset of ASCII, because it assumes that code can extract plain text ASCII from a file without knowing the file's encoding. This works if and only if the file's actual encoding is implemented as ASCII plus other stuff hiding around the edges, which is true for many encodings including UTF-8 but not for all of them.
This is the arrogance of my blithe approach to ignoring character encoding issues. It assumes either that all character sets are a superset of ASCII or that any exceptions are sufficiently uncommon that I don't have to care about them. Of course, by assuming that my programs will never be used by people with such character sets I've insured that they never will be.
The conclusion that I draw from this is I can't ignore character encoding unless I'm willing to be somewhat arrogant. The pain of dealing with decoding and encoding issues is simply the price of not being arrogant.
(On the other hand it's still very tempting to be arrogant this way, for reasons that boil down to 'I can get away with it because the environments where it matters are probably quite rare, and it's much easier'.)