Analyzing Experimental Data Concerning Agile Practices

It is not unusual in conversations concerning the effectiveness of Agile development practices for someone to quote “professor X in famous University Y ran an experiment to prove that Agile practice Z is 20% more effective than traditional development practices.” We therefore take that as truth because – of course – it must be correct. Unfortunately most experiments run and published have results that should not be generalized to real-world development projects. Fortunately, it is not difficult to quickly determine how much confidence you should (not) have in experimental results.

There are several validity criteria that will quickly enable you to determine if you should expect the same results as those reported in an experiment:

External validity – also known as generalizability – helps you determine if the results of an experiment are applicable to other situations. Does a pair programming experiment with students generalize to professional developers? In simple terms – no. If you are in a business environment, an experiment on students is not applicable because they are in a completely different context, are building different types of software, and have different experience. The context of the experiment should closely match the context of the real-world application.

Internal validity – is there true cause and effect between variables. For example, does pair programming improve code quality? If a group is pair programming, writing tests first, and taking longer to build the application – can we reasonably assume that it is pair programming that improved the quality – or are there other explanations? For example, could the fact that they spent more time building the application have made the difference?

Construct validity – is there a correspondence between your measurements and the concepts (constructs) under study. Does the measure being used, for example cyclomatic complexity, really indicate the quality of the concept being evaluated – in this case design?

Statistical validity - Is the sample size big enough, and are the results ‘statistically significant’? If you read of an experiment with real developers working for a week, and this shows an improvement in quality of design when using TDD, can we count on the results? In this case, no. A week is not enough data to extrapolate the effectiveness of TDD on a multi-month or multi-year project.

An experiment that was run to evaluate the effectiveness of TDD (in speed and design quality) can be found here. This experiment was run with professional developers developing 200 lines of code. The reader that is aware of the different types of validity can easily see the error of taking the results to mean they can be applied to projects of thousands (or millions) of lines.

A critical look at a pair programming report which claimed that pair programming was 15% faster than the alternative can be found on hacknot.

The fact is , the requirements needed for an experiment that can be generalized to real-world projects are prohibitively expensive. Experiments with students can only be generalized to other students. Experiments using professional developers for only a limited amount of time have results that cannot be generalized to long-running development projects. If you have cited experimental results before, take a quick re-read of the paper in this new light and share your thoughts.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Applied Research topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter