2012-07-25
My dislike for what I call 'perverse Test Driven Development'
There is a particular style of TDD that I will call 'perverse TDD' for lack of a better name. In perverse TDD, you are supposed to very literally write the most minimal code possible to pass a new test, even if the code is completely useless and artificial. The ostensible justification and excuse for this is that this makes sure you have tests for all of your code (taken from this site on TDD Django development, because reading its writeup on this is what pushed me over the edge today).
I hate this approach and it drives me up the wall. I think it's a stupid and wasteful way to work, and besides it doesn't really achieve its stated goals. It's very hard (and wasteful) to test enough to defeat serious perversity in the code under test, and it's inefficient to repeatedly rewrite your code (always moving towards what you knew from the start that you were going to write) as you write and rewrite a series of slightly smarter and more specific tests. In fact it's more than merely inefficient; it's artificial make-work.
I see the appeal of having relatively complete tests, but I don't think perverse TDD is the way to get to it. If you need thorough tests, just write a bunch of tests at once and then write the real code that should satisfy them. If you're writing the real code and you find important aspects of it that aren't being tested, add more tests. Tests and code do not exist in an adversarial relationship; attempting to pretend otherwise is fooling yourself.
(Indeed, as the TDD Django writeup shows, at some point even people who do perverse TDD deviate from their strict and narrow path in order to write real code instead of the minimal code that passes the tests. This is because programmers are ultimately not stupid and they understand what the real goal is, and it is not 'pass tests'.)
I feel that perverse testing is in the end one sign of the fallacy that everything can or should be (unit) tested. There are plenty of things about your code that are not best established through testing and one of them is determining if you are testing for enough things and the right things.
2012-07-24
Unicode's two new problems
If you want to convert code from dealing with more or less uninterpreted strings of bytes into dealing properly with character encodings, ie from using raw bytes into using Unicode, you will have two new fundamental problems. (You will also have a number of practical problems like sorting, but these can be addressed with suitable Unicode libraries and perhaps a certain amount of handwaving.)
These two problems are what to do with invalid input when you decode from byte strings into Unicode and what to do with Unicode code points that can't represented in the output character encoding. Both are new failure points introduced by adding Unicode, and neither can really be handled for you by a library because what to do is generally dependent on the specific situation in your code.
Having said that, there are various locales where one or the other problems do not apply. If the output locale is UTF-8 you will always be able to encode any Unicode code point (the same is true for less common locales like UCS-16). Many input locales have no such thing as invalid input; all of their bytes and byte sequences map to Unicode code points (this is true of pretty much all encodings that just use single bytes, for example). UTF-8 is actually uncommon in allowing plenty of invalid input sequences (so you win on output but you lose on input).
This causes potential pernicious problems if you develop in a well done UTF-8 locale. Your tools will normally not generate invalid UTF-8 input and of course there's nothing you can't output; the result is that none of your error paths for input and output will get exercised. In fact you can get away without any error handling for decoding and encoding errors (this is easier in some languages than others).
If you care about handling both problems you will need to test in a non-UTF-8 locale in order to provoke output encoding errors and in a UTF-8 locale with deliberately broken input in order to create input decoding errors. Even if you only support using your code in a UTF-8 locale you should test with invalid input because you will almost certainly see it sooner or later.
(I suspect that there is lots of code in lots of languages that doesn't make any attempt to handle either problem, precisely because everything runs fine normally (in a UTF-8 environment) even if you don't.)
(None of this is at all new. I just feel like writing it down myself in one place where I can find it.)