Monday 26 January 2015

Software Testing with Rigour

Previously, I've heard a lot about test-driven design (TDD) and why TDD is dead. Gauntlets have been thrown down, holy wars waged, and internet blood spilled.

This has resulted in widespread disaffection with TDD, and in part, some substantial disillusion with testing as a whole. I know that I suffer this; some days, I just want to cast the spear and shield of TestNG and Cobertura aside, and take up the mighty bastard-sword of SPARKAda.

I can sympathise with the feeling that testing is ineffective, that testing is a wast of time, that testing just doesn't work. When I get another bug report that the bit of code that I worked so hard to test is "just broken," I want to find the user and shake them until they conform to the code.

Clearly, assaulting end-users is not the way forwards. Unfortunately, tool support for the formal verification of Java code is also lacking in the extreme, so that route is right out too.

My own testing had been... undisciplined. Every class had a unit tests, and I even made quite a lot of integration tests. But it seemed lots of bugs were getting through.
It seems to me, that there are two things that really need to be said:
  1. Testing is extremely important.
  2. Creating tests up front was, at one point, basically common sense. Meyers covers this in the 1979 book, "The Art of Software Testing"
Following in Meyers' footsteps, I'd also like to make a big claim: Most people who do TDD aren't doing testing right.

TDD is often used as a substitute for program requirements, or program specification. Unfortunately, since the tests are almost always code, when a test fails, how does one really decide in a consistent way if it's the tests or the program that's broken? What if a new feature is to be worked into the code base that, on the surface looks fine, but the test suite shows it to be mutually exclusive with another feature?

Agile purists take note; a user story can work as a piece of a system's specification or requirements, depending on the level of detail in the story. If you're practicing "agile", but you don't have some way of specifying features, you actually practicing the "ad-hoc" software development methodology, and quality will likely fall.

Testing is "done right" when it is done with the intent of showing that the system is defective.

Designing Test Cases

A good test case is one which stands a high chance of exposing a defect in the system.

Unlike Meyers,  I side with Regehr on this one: Randomised testing of a system is a net good, if you have:
  1. A medium or high strength oracle.
  2. The time to tweak your test case generator to your system.
  3. A relatively well-specified system.
If you want to add even more strength to this method, combinatorial test case generation, followed by randomised test case generation looks to be a very powerful methodology.

However, I also feel strongly that time and effort needs to be put into manually designing test cases. Specifically, designing test cases with a high degree of discipline, and the honest intent to break your code.

Meyers recommends using boundary value analysis to partition both the input domain and output range of the system, and designing test cases which exercise those boundaries, as well as representative values from each range.

Oddly, he also discusses designing test cases which will raise most coverage metrics to close to 100%, which struck me as odd; although he tempered it by using boundary value analysis to slot into the high-coverage tests. I'm not sure I can recommend that technique, as it destroys the coverage metrics as a valid proxy of testing effectiveness, and Meyers acknowledges this earlier in the book.

For a really good run down of how to apply boundary value analysis (and lots of other really interesting techniques!), I really can't do any better than referring you to Meyers' book.

Testing Oracles

Testing oracles are things which, given some system input, and the system's output, decides if that output is "wrong".

These can be hand-written expectations, or simply waiting for exceptions to appear.

The former is a form of strong oracle, the latter is a very weak oracle. If your system has a high quantity of non-trivial assertions, you've got yourself a medium-strength oracle, and can rely on that to some degree.

What To Test

Meyers and Regehr are in agreement: Testing needs to be done at many levels, though it's not as clear in Meyers' book.

Unit testing is a must on individual classes, but this is not enough. Components need to be tested with their collaborators to ensure that there are no mis-communications or mis-understandings of a unit's contract.

I guess the best way to put this across is to simply state this: unit testing and integration testing are looking for different classes of errors. It is not valid to have one without the other, and claim that you may have found a reasonable amount of errors, as there are errors which you are simply not looking for.

I, personally, am a big fan of unit testing, then bottom-up integration testing. That is, test all the modules individually, with all collaborators mocked out, then start plugging units together starting at the lowest level, culminating in the completed system.

Other methods may be more effective for you; see Meyers' book for more methods.

This method allows you to look for logic errors in individual units, and when an integration test fails, you have a good idea of what the error is, and where it lies.

How to Measure Testing Effectiveness

A test is effective if it has a high probability of finding errors. Measuring this is obviously very hard. One thing that you may need to do is work out an estimate of how buggy your code is to begin with. Meyers has a run down of some useful techniques for this.

Coverage is a reasonable proxy -- if and only if you have not produced test cases designed to maximise a coverage metric.

It would also seem that most coverage tools only measure a couple of very weak coverage metrics: statement coverage and branch coverage. I would like to see common coverage tools start offering condition coverage, multi-condition coverage, and so on.

When to Stop Testing

When the rate at which you're finding defects becomes sufficiently low.

This is actually very hard, especially in an agile or TDD house, where tests are being run constantly,  defects are patched relatively quickly with little monitoring, and all parts of development are expected to be a testing phase.

If your methodology has a testing phase, and you find that at the end of your (for example) 4 week testing window is finding more and more defects every week, don't stop. Only stop when the defect-detection rate is dropping down to an acceptable level.

If your methodology doesn't have a testing phase, this is a much harder question. You have to rely on other proxy methods of whether your testing is effective, and if you've discovered most of the defects your end users are likely to see. Good luck.

I, unfortunately, am in the latter category. I just test as effectively as I can, as I go along and hope that my personal biases aren't sinking me too badly.

Conclusion

Do testing with the intent of breaking your code, otherwise you're not testing -- you're just stroking your ego.

If possible, get a copy of Meyers' book and have a read. The edition I have access to is quite small, coming in at ~160 pages. I think that if you're serious about having a highly-effective test suite, you need to read this book.

Regehr's Udacity course, "Software Testing" is also worth a plug, as he turns out to be both very capable of effective systems testing teaching; a rare trait. Take advantage for your benefit. The course also provides a nice, more modern view on many of Meyers' techniques. His blog is also pretty darn good.