There are several validity criteria that will quickly enable you to determine if you should expect the same results as those reported in an experiment:
Internal validity – is there true cause and effect between variables. For example, does pair programming improve code quality? If a group is pair programming, writing tests first, and taking longer to build the application – can we reasonably assume that it is pair programming that improved the quality – or are there other explanations? For example, could the fact that they spent more time building the application have made the difference?
Construct validity – is there a correspondence between your measurements and the concepts (constructs) under study. Does the measure being used, for example cyclomatic complexity, really indicate the quality of the concept being evaluated – in this case design?
Statistical validity - Is the sample size big enough, and are the results ‘statistically significant’? If you read of an experiment with real developers working for a week, and this shows an improvement in quality of design when using TDD, can we count on the results? In this case, no. A week is not enough data to extrapolate the effectiveness of TDD on a multi-month or multi-year project.
An experiment that was run to evaluate the effectiveness of TDD (in speed and design quality) can be found here. This experiment was run with professional developers developing 200 lines of code. The reader that is aware of the different types of validity can easily see the error of taking the results to mean they can be applied to projects of thousands (or millions) of lines.
A critical look at a pair programming report which claimed that pair programming was 15% faster than the alternative can be found on hacknot.
The fact is , the requirements needed for an experiment that can be generalized to real-world projects are prohibitively expensive. Experiments with students can only be generalized to other students. Experiments using professional developers for only a limited amount of time have results that cannot be generalized to long-running development projects. If you have cited experimental results before, take a quick re-read of the paper in this new light and share your thoughts.