I’ve just wrapped up another study. (The last one was about singletons, if you’re interested.) This time, I looked at unit testing and the impact it has on codebases.
It didn’t turn out the way I expected.
I’m willing to bet that it also won’t turn out that you expected these results. It had a profound effect on certain unexpected codebase properties while having minimal effect on some that seem like no-brainers. But I’ll get to the specifics of that shortly.
First, let me explain a bit about expectations and methodology.
Unit Testing: The Expected Effect on Code
Let’s be clear for a moment here. I’m not talking about the expected effect of unit tests on outcomes. You might say, “We believe in unit testing because it reduces defects” or, “We unit test to document our design intentions,” and I didn’t (and probably can’t) measure those things.
When I talk about the effect of unit testing, I’m talking about the effect it has on the codebase itself. Let’s consider a concrete example to make it clear what I mean. You can’t (without a lot of chicanery) unit test private methods, and you can’t easily unit test internal methods. This creates a natural incentive to make more methods public, so we might expect heavily unit-tested codebases to feature more public methods.
This actually turns out to be true.
I’ll talk more about the axes later in the post. But for now, check out the plot and the trend line. More unit testing means a higher percentage of public methods. Score one for that hypothesis!
With a win there, let’s think of some other hypotheses that seem plausible. Methods with more “going on” tend to be harder to test. So you’d expect relatively simple methods in a highly tested codebase. To get specific, here was what I anticipated in heavily tested codebases:
- Less cyclomatic complexity per method.
- Fewer lines of code per method.
- Lower nesting depth.
- Fewer method overloads.
- Fewer parameters.
I also had some thoughts about the impact on types:
- More interfaces (this makes testing easier).
- Less inheritance (makes testing harder).
- More cohesion.
- Fewer lines of code.
- Fewer comments.
Methodology: Find the Answers
Hypotheses are no fun unless you test them. (Unless one is a conjecture buff, I suppose.) So that’s what I did. I looked to measure the effect of unit testing on these things. Here’s how I did it.
Just like the singleton challenge, I used 100 codebases (the same 100, actually). These I pulled from GitHub. I automated a process whereby I download the codebase, restore its Nuget packages, build it, analyze it with NDepend, and store the results. I then use those results for these studies I’m having fun with.
So I have 100 codebases and information down to the method level of granularity. This lets me see which methods have test attributes decorating them. For our purposes here, a test method is one decorated with the attributes corresponding to NUnit, MSTest or XUnit. So if any of these codebases are using a different framework, they’ll be counted as non-test codebases (which I recognize as a threat to validity).
For this study, I decided to quantify test prevalence in a nod to the oversimplification of a binary “tests” or “no tests.” In other words, a codebase that you’ve developed using TDD is a far cry from a legacy codebase that you just yesterday added a single test to. And I wanted to capture that nuance to some extent.
So I created conceptual buckets of unit testing:
- No tests.
- Less than 10% of methods in the codebase are tests methods.
- Between 10% and 25% of the methods are test methods.
- Between 25% and 50%.
- More than 50%.
Why Percentage of Methods That Are Test Methods?
I don’t (yet) have the capability to measure test coverage. And I haven’t yet fleshed out my automated assessment of unit test quality. So I thought the best way to approximate maturity of a codebase vis-à-vis testing would be to use percentage of methods that are test methods.
Why? Well, if you practice TDD, you’re going to drive the creation of any given method with a series of test methods. While you may wind up refactoring to extract methods out, you’ll also drive with a lot of test methods. So I reasoned that a 50% or higher ratio of test methods to production methods indicated a likely TDD approach. (A check of one known TDD codebase bore this quick hypothesis out, albeit for a tiny, anecdotal sample size.)
From there, the other buckets were somewhat arbitrary, but they gave me a reasonable distribution. Let’s look at that.
Overview of Basic Data
Here’s the quick-hitting data in bullet format:
- Of the 100 codebases, 70 had unit tests, while 30 did not.
- Of the 70 codebases with tests, 24 used XUnit, 36 used NUnit, and 10 used MSTest.
- In the “buckets,”
- 30 had no tests.
- 30 had 0% to 10% of total methods as tests.
- 17 had 10% to 25%.
- 19 had 25% to 50%.
- Four had more than 50%.
But that’s just the basic demographics of their profiles as they relate to unit testing. Let’s start looking at the results. I captured the code metrics that I mentioned (and more besides) for all of the codebases and looked at each bucket’s average. So, for instance, I looked at things like “average method cyclomatic complexity” on a per-bucket basis.
Unit Testing Appears to Simplify Method Signatures
First of all, let’s look at a couple of unambiguous wins for unit testing. The data clearly supports that unit testing correlates inversely with method overloads and number of parameters. In other words, in unit tested codebases, methods have fewer overloads and fewer parameters.
In both cases, the trend line is clear. With overloads, untested codebases really seem to skew. The distribution is more even with parameters. But in both cases, having more tests pushes your code toward simplicity.
Unit Testing Appears to Make Complexity and Method Length Worse
This pains me to type (and I do recognize that correlation does not mean causation), but the data here is unambiguous. Look at these trend lines.
In both sets of data, the 0-10% bucket actually does REALLY well. This may skew the trend line some. But even if you allow for that, your best case scenario would be a flat trend line. The codebases with the most tests are the same or worse in both stats than the ones with the fewest tests.
But taking the data at face value, the trend lines speak clearly. More test methods in codebases predict more cyclomatic complexity per method and more lines of code per method. I find that deeply weird and plan to investigate to see if I can figure out why.
Looking at the Rest of the Results
For the sake of completeness, let’s check in on the rest of the hypotheses. Here they are for methods (with results):
- Less cyclomatic complexity per method: wrong (there’s more!)
- Fewer lines of code per method: wrong (there’s more!)
- Lower nesting depth: wrong (the trend line is flat)
- Fewer method overloads: correct.
- Fewer parameters: correct.
I also had some thoughts about the impact on types:
- More interfaces (this makes testing easier): wrong (there’s fewer!)
- Less inheritance (makes testing harder): correct.
- More cohesion: correct.
- Fewer lines of code: wrong (the trend line is flat)
- Fewer comments: correct.
Of the hypotheses I documented here, I did okay, I suppose. Five out of ten correct, with another two wrong but not counter. But the three that I was dead wrong about blew me away. And that’s why you experiment instead of assume.
Is That the Final Word on the Matter?
Let me throw several things out there that could most certainly have a mitigating effect on the research I’ve done here. This is by no means the final word on anything. It’s the start of a conversation.
- It could be that the heavy test codebases here are bad representatives of testing.
- Percent of total methods that are test methods might be the wrong way to reason about testing value/impact.
- It might be that a few outliers dramatically skew results.
- I might need to include more test frameworks.
Anyway, you get the idea. I’ve created more data here than I’ve ever seen anywhere else on this subject. (In fact, I’m not aware of any significant data gathered on this.) But there is plenty of work left to do.
I’m planning to continue this series of posts that I think of as “Moneyball for code” or “Freakonomics of code” or something similarly derivative. Let’s keep digging. So please, weigh in and tell me what you think I should investigate next. More codebases for a broader sample? Run this same experiment but with buckets based on another factor around testing? What do you think?