Today, I give you the third post in a series about how unit tests affect codebases.
The first one wound up getting a lot of attention, which was fun. In it, I presented some analysis I’d done of about 100 codebases. I had formed hypotheses about how I thought unit tests would affect codebases, and then I tested those hypotheses.
In the second post, I incorporated a lot of the feedback that I had requested in the first post. Specifically, I partnered with someone to do more rigorous statistical analysis on the raw data that I’d found. The result was much more clarity about not only the correlations among code properties but also how much confidence we could have in those relationships. Some had strong relationships while others were likely spurious.
In this post, though, I’m incorporating the single biggest piece of feedback. I’m analyzing more codebases.
Analysis of 500 (ish) C# Codebases
Performing static analysis on and recording information about 500 codebases isn’t especially easy. To facilitate this, I’ve done significant work automating ingestion of codebases:
- Enabling autonomous batch operation
- Logging which codebases fail and why
- Building in redundancy against accidentally analyzing the same codebase twice.
- Executing not just builds but also NuGet package restores and other build steps.
That’s been a big help, but there’s still the matter of finding these codebases. To do that, I mined a handful of “awesome codebase” lists, like this one. I pointed the analysis tool at something like 750 codebases, and it naturally filters out any that don’t compile or otherwise have trouble in the automated process.
This left me with 503 valid codebases. That number came down to 495 once adjusted for codebases that, for whatever reason, didn’t have any (non-third party) methods or types or that were otherwise somehow trivial.
So the results here are the results of using NDepend for static analysis on 495 C# codebases.
Stats About the Codebases
Alright. So what happened with the analysis? I’ll start with some stats that interested me and hopefully interest you. I’m looking here to offer some perspective.
- I analyzed a total of 6,273,547 logical lines of code (LLOC). That’s a lot of code! (Note: codebases have fewer LLOC than editor LOC. The latter is probably how you’re used to reasoning about lines of code.)
- The mean codebase size was 12,674 and the median size was 3,652, so a handful of relative monsters dragged the mean value pretty high.
- The maximum codebase size was 451,706 LLOC.
- The percent of codebases with 50% or more test methods was 3.4%, which is pretty consistent with the 100 codebase analysis.
- The percent of codebases with 40% or more test methods, however, was 9.1%, which was up from about 7% in the last analysis. So, it appears that we’re getting a little more test-heavy codebase representation here.
Findings From Last Time
Here’s a quick recap of some of the findings from last time around.
- Average method cyclomatic complexity seemed unrelated to prevalence of unit tests.
- Average method nesting depth also seemed unrelated to the prevalence of unit tests.
- More lines of code per method probably correlated with more unit tests, surprisingly.
- Lines of code per constructor decreased with an increase in unit tests, but the p-value was a little iffy.
- Parameters per method decreased as unit tests increased and with pretty definitive p-value.
- Number of overloads per method possibly had a negative relationship with unit test prevalence.
- More inheritance correlated with fewer unit tests, fairly decisively.
- Type cohesion correlated strongly with an increase in the unit test percentage.
I’ve omitted a few of the things I studied in the previous posts, both for the sake of brevity and in order to focus on what I think of as properties of clean codebases. Generally speaking, you want code with fewer lines, less complexity, fewer parameters, fewer overloads, and less nesting per method. In terms of types, you want a flat inheritance hierarchy and more cohesion.
What a Difference 400 Codebases Makes
So, let’s take a look at what happens now that we substantially increased sample size. I’ll summarize here and add a couple of screenshots below that.
- Average method cyclomatic complexity has a strong negative correlation with prevalence of unit tests!
- Average method nesting depth now correlates negatively with more unit tests, though p-value isn’t perfect.
- More lines of code per method flattened and saw a p-value spike, changing this to “probably no relationship.”
- Lines of code per constructor decreased even more with more unit tests, and p-value became bulletproof.
- Parameters per method, like lines of code per constructive, became an even more bulletproof negative correlation.
- Number of overloads per method became flatter and p-value worse, so I’m going to say a relationship here isn’t overly likely.
- More inheritance still correlated with fewer unit tests, but p-value is now non-trivial and the relationship flattened.
- Type cohesion’s relationship didn’t change very much at all.
Average Cyclomatic Complexity Per Method
Average Method Nesting Depth
Lines of Code Per Method
Number of Overloads Per Method
Unit Tests and Clean Code
If I circle back to my original hypotheses, it seems I’m doing better as I add more codebases to the study.
With 500 codebases in the mix, the results have improved considerably, though I’m not entirely sure why. Perhaps some outliers skewed the original study a bit more, or perhaps this resulted from the codebase corpus on the whole becoming more “unit-test heavy.” But whatever the reason, five times the sample size is starting to show some pretty definitive results.
The properties that we associate with clean code — cohesion, minimal complexity, and overall thematic simplicity — seem to show up more as unit tests show up more.
The only exception that truly surprises me was and remains lines of code per method. I wonder if this might be the result of a higher prevalence of properties in non-test-heavy codebases or some other common relationship situation. In any case, though, it’s interesting.
But 500 codebases analyzed automatically and results synthesized with statistical modeling software, I feel pretty good about where this study is. And while it doesn’t paint a “unit tests make everything rainbows and unicorns” picture, this study now demonstrates, pretty definitively, that codebases with unit tests also have other desirable properties.
What’s Next?
I’m going to keep working, in conjunction with the person doing the statistical models, to study more properties of codebases. And I think, for now, I’m going to wrap this unit test study and move on to other things, satisfied that we’ve given it a pretty good treatment.
One thing that occurs to me is the somewhat important differences between 100 codebases and 500. Maybe I should grow the corpus to 1,000 or even 2,500 to make sure I don’t see a similar reversal. But the thing is, that’s a lot of codebases, and I’ve already nearly exhausted the “awesome lists,” so I’m worried about diminishing returns. Rest assured, though — I’ll keep slurping down codebases, and if I find myself with significantly more at some point, we’ll redo the analysis.
So what’s next? I’ve had a few ideas and am brainstorming more of them.
- See what effect having CQRS seems to have on a codebase.
- Skinny or fat domain objects? Which is better?
- How does functional style programming impact codebases?
- How does a lot of async logic affect codebases?
- Or what about lots of static methods?
These are just some ideas. Weigh in below in the comments with your own, if you’d like. Hopefully you all find this stuff as interesting as I do!
Your analysis isn’t as statistically rigorous as it could be. I think you’ve misunderstood what a p-value is, which is a common mistake. It’s not a measure of how likely a hypothesis is false given the evidence. The p-value measures how likely it is that an experiment will produce a result that is at least as strong as the observed result, ASSUMING THERE IS NO ACTUAL RELATIONSHIP BETWEEN THE VARIABLES. For example, if you run an experiment with two completely unrelated variables, there is a 50% chance it will have a p-value less than 0.5!
The typical p-value cutoff is 0.05 for this reason, although some researchers believe that is still too high (search for John Ioannidis). Also note that doing multiple comparisons will increase your probability of obtaining a false positive result. In order to keep the p-value cutoff at 0.05 for all comparisons, you will have to use one of these techniques: https://en.wikipedia.org/wiki/Multiple_comparisons_problem
Using the Bonferroni correction, a p-value cutoff of 0.05 for 11 comparisons requires a 0.004545… per-comparison p-value cutoff. At this level, the only significant results in your previous article are for parameters per method and inheritance depth. None of the listed p-values in this article are significant at that level.
Alternatively you could do a Bayesian analysis. Bayesian methods can obtain what you are really looking for, the probability of the hypothesis given the evidence.