I’ve just wrapped up another study. (The last one was about singletons, if you’re interested.) This time, I looked at unit testing and the impact it has on codebases.
It didn’t turn out the way I expected.
I’m willing to bet that it also won’t turn out that you expected these results. It had a profound effect on certain unexpected codebase properties while having minimal effect on some that seem like no-brainers. But I’ll get to the specifics of that shortly.
First, let me explain a bit about expectations and methodology.
Unit Testing: The Expected Effect on Code
Let’s be clear for a moment here. I’m not talking about the expected effect of unit tests on outcomes. You might say, “We believe in unit testing because it reduces defects” or, “We unit test to document our design intentions,” and I didn’t (and probably can’t) measure those things.
When I talk about the effect of unit testing, I’m talking about the effect it has on the codebase itself. Let’s consider a concrete example to make it clear what I mean. You can’t (without a lot of chicanery) unit test private methods, and you can’t easily unit test internal methods. This creates a natural incentive to make more methods public, so we might expect heavily unit-tested codebases to feature more public methods.
This actually turns out to be true.
I’ll talk more about the axes later in the post. But for now, check out the plot and the trend line. More unit testing means a higher percentage of public methods. Score one for that hypothesis!
With a win there, let’s think of some other hypotheses that seem plausible. Methods with more “going on” tend to be harder to test. So you’d expect relatively simple methods in a highly tested codebase. To get specific, here was what I anticipated in heavily tested codebases:
- Less cyclomatic complexity per method.
- Fewer lines of code per method.
- Lower nesting depth.
- Fewer method overloads.
- Fewer parameters.
I also had some thoughts about the impact on types:
- More interfaces (this makes testing easier).
- Less inheritance (makes testing harder).
- More cohesion.
- Fewer lines of code.
- Fewer comments.
Methodology: Find the Answers
Hypotheses are no fun unless you test them. (Unless one is a conjecture buff, I suppose.) So that’s what I did. I looked to measure the effect of unit testing on these things. Here’s how I did it.
Just like the singleton challenge, I used 100 codebases (the same 100, actually). These I pulled from GitHub. I automated a process whereby I download the codebase, restore its Nuget packages, build it, analyze it with NDepend, and store the results. I then use those results for these studies I’m having fun with.
So I have 100 codebases and information down to the method level of granularity. This lets me see which methods have test attributes decorating them. For our purposes here, a test method is one decorated with the attributes corresponding to NUnit, MSTest or XUnit. So if any of these codebases are using a different framework, they’ll be counted as non-test codebases (which I recognize as a threat to validity).
For this study, I decided to quantify test prevalence in a nod to the oversimplification of a binary “tests” or “no tests.” In other words, a codebase that you’ve developed using TDD is a far cry from a legacy codebase that you just yesterday added a single test to. And I wanted to capture that nuance to some extent.
So I created conceptual buckets of unit testing:
- No tests.
- Less than 10% of methods in the codebase are tests methods.
- Between 10% and 25% of the methods are test methods.
- Between 25% and 50%.
- More than 50%.
Why Percentage of Methods That Are Test Methods?
I don’t (yet) have the capability to measure test coverage. And I haven’t yet fleshed out my automated assessment of unit test quality. So I thought the best way to approximate maturity of a codebase vis-à-vis testing would be to use percentage of methods that are test methods.
Why? Well, if you practice TDD, you’re going to drive the creation of any given method with a series of test methods. While you may wind up refactoring to extract methods out, you’ll also drive with a lot of test methods. So I reasoned that a 50% or higher ratio of test methods to production methods indicated a likely TDD approach. (A check of one known TDD codebase bore this quick hypothesis out, albeit for a tiny, anecdotal sample size.)
From there, the other buckets were somewhat arbitrary, but they gave me a reasonable distribution. Let’s look at that.
Overview of Basic Data
Here’s the quick-hitting data in bullet format:
- Of the 100 codebases, 70 had unit tests, while 30 did not.
- Of the 70 codebases with tests, 24 used XUnit, 36 used NUnit, and 10 used MSTest.
- In the “buckets,”
- 30 had no tests.
- 30 had 0% to 10% of total methods as tests.
- 17 had 10% to 25%.
- 19 had 25% to 50%.
- Four had more than 50%.
But that’s just the basic demographics of their profiles as they relate to unit testing. Let’s start looking at the results. I captured the code metrics that I mentioned (and more besides) for all of the codebases and looked at each bucket’s average. So, for instance, I looked at things like “average method cyclomatic complexity” on a per-bucket basis.
Unit Testing Appears to Simplify Method Signatures
First of all, let’s look at a couple of unambiguous wins for unit testing. The data clearly supports that unit testing correlates inversely with method overloads and number of parameters. In other words, in unit tested codebases, methods have fewer overloads and fewer parameters.
In both cases, the trend line is clear. With overloads, untested codebases really seem to skew. The distribution is more even with parameters. But in both cases, having more tests pushes your code toward simplicity.
Unit Testing Appears to Make Complexity and Method Length Worse
This pains me to type (and I do recognize that correlation does not mean causation), but the data here is unambiguous. Look at these trend lines.
In both sets of data, the 0-10% bucket actually does REALLY well. This may skew the trend line some. But even if you allow for that, your best case scenario would be a flat trend line. The codebases with the most tests are the same or worse in both stats than the ones with the fewest tests.
But taking the data at face value, the trend lines speak clearly. More test methods in codebases predict more cyclomatic complexity per method and more lines of code per method. I find that deeply weird and plan to investigate to see if I can figure out why.
Looking at the Rest of the Results
For the sake of completeness, let’s check in on the rest of the hypotheses. Here they are for methods (with results):
- Less cyclomatic complexity per method: wrong (there’s more!)
- Fewer lines of code per method: wrong (there’s more!)
- Lower nesting depth: wrong (the trend line is flat)
- Fewer method overloads: correct.
- Fewer parameters: correct.
I also had some thoughts about the impact on types:
- More interfaces (this makes testing easier): wrong (there’s fewer!)
- Less inheritance (makes testing harder): correct.
- More cohesion: correct.
- Fewer lines of code: wrong (the trend line is flat)
- Fewer comments: correct.
Of the hypotheses I documented here, I did okay, I suppose. Five out of ten correct, with another two wrong but not counter. But the three that I was dead wrong about blew me away. And that’s why you experiment instead of assume.
Is That the Final Word on the Matter?
Let me throw several things out there that could most certainly have a mitigating effect on the research I’ve done here. This is by no means the final word on anything. It’s the start of a conversation.
- It could be that the heavy test codebases here are bad representatives of testing.
- Percent of total methods that are test methods might be the wrong way to reason about testing value/impact.
- It might be that a few outliers dramatically skew results.
- I might need to include more test frameworks.
Anyway, you get the idea. I’ve created more data here than I’ve ever seen anywhere else on this subject. (In fact, I’m not aware of any significant data gathered on this.) But there is plenty of work left to do.
I’m planning to continue this series of posts that I think of as “Moneyball for code” or “Freakonomics of code” or something similarly derivative. Let’s keep digging. So please, weigh in and tell me what you think I should investigate next. More codebases for a broader sample? Run this same experiment but with buckets based on another factor around testing? What do you think?
This was really interesting!
However the small bucket sizes make me wonder how well these results reflect reality. The last bucket is just 4 code bases, which in statistical terms is next to nothing. I wonder if the results would still hold with a larger number of code bases.
The buckets also hide a lot of detail. Why not just show all the 100 data points in a scatter plot? That would give a much more nuanced view of the data.
Hi Helen — thanks for the thoughts. I think one of the first things that I’m going to do is see if I can 2x, 5x, or even 10x the overall sample sizes. Once I do that, I might also experiment with smaller (or no) buckets as well.
Stay tuned!
You might also want to consider choosing your bucket sizes to include approximately equal samples – your last three buckets combined have only a bit more than either of your first two buckets, which makes projecting a trend off of them as three points instead of one suspect.
Hi Erik!
This was an amazing post, and I’m really surprised and puzzled by some of the results. I’m looking forward to see the next articles in the series!
Just a couple of things:
– Is there any chance of you open sourcing the code that you used to perform the analysis?
– I think I’ve got a typo here: “Of the 66 codebases with tests, 24 used XUnit, 36 used NUnit, and 10 used MSTest.” Shouldn’t it be 70?
Got the typo fixed — thanks for pointing that out. I’d been using a different set of figures initially and forgot to change that one. As for open sourcing the code, most of it was using the NDepend API. The parts that I automated (which are part of my own IP for my consulting) are mainly the automated ingestion and analysis of codebases en masse.
You might want to consider, when it comes to complexity and lines of code, that the causality goes the other way around : given a fixed coverage, methods with higher complexity or more lines need more test cases to achieve the coverage than simpler, smaller methods.
That’s a fair point. Do you have any thoughts off the cuff for how to test a hypothesis like that?
Unfortunately I can’t think of a way to test it, but an attempt may have already been done by Mccabe in his paper on cyclomatic complexity (1976) and his “basis path” testing strategy. What I remember of his conclusions is that following that test strategy (which seems quite similar to unit testing strategies) you invariably need at least as many tests as the value of the cyclomatic complexity of a given body of code.
“need at least as many tests”… To achieve 100% statement coverage. I forgot to add that to the previous comment
Another hypothetical mechanism for the higher CC and LOC in tested code: testing drives developers to consider input values and resulting edge conditions that they otherwise would have neglected, and the extra code and complexity is created in handling those cases.
Phil,
I had a similar thought from the opposite direction, although your theory sounds just as plausible if not more so. Could testing in some ways enable more complex code? Especially over time as features are added, cruft might build up but as long as the tests pass this is accepted. Meanwhile the code bases with less testing need to be easier for the programmer to understand analytically, so they tend to get refactored to enable that.
Adam
I love this approach of trying to rigorously analyze the effects of unit testing. One Threat to Validity I possibly see is error bounds: if for example your average LOC for >50% tests was 5 but the uncertainty was +-2 lines, you couldn’t say that it’s clearly different from 0% methods (3.5 lines).
Another example of this: your “#parameters” ranges from 1.2 to 0.8. Since you can’t have fractional parameters, my instinct is to wonder if they’re both clustered around 1: essentially no difference.
Man, this brings me back to my physics days. “Did you get clear results? Then the equipment’s probably broken.”
It would be really interesting to get code coverage metrics included in this. Another interesting test quality evaluation is mutation testing. For those not familiar, the idea is to (automatically) introduce bugs into the code and then check to see if these mutations cause test failures. Good tests kill mutants. It’s a simple idea, but it takes some sophistication to efficiently mutate the code base as doing so exhaustively is computationally prohibitive.
The conclusion about cyclomatic complexity reminds me of a joke from auto (car) engineering; what makes a car go fast? The brakes! It’s possible that the reason tested code bases have higher levels of internal complexity is because the engineers are confident they can change the code without breaking it (i.e. for more features). What you really want to compare is the lifetime development of the same code base with varying levels of unit tests.
If you studied Go codebases it would be easy to get test coverage stats, as the standard library unit test framework everyone uses provides the stat by default.
Also, I’ve noticed that in my code with extensive unit testing, I frequently get linter warnings about cyclomatic complexity in the unit tests themselves, for whatever that’s worth.
Thank you so much for attempting to bring some scientific methodology to bear on this topic! I’m amazed that stuff like this is so rare.
I’d like to echo the thoughts of some of the other posters:
– The sample size definitely seems on the small side.
– It would be great if you could share the code you used for this (or at least give some more detail on the methodology). If you’re going to embrace the scientific process, letting people pick apart your methodology is a big component.
I’d also be interested in knowing:
– How did you select the projects, and what sort of size are they?
– Were the test methods themselves excluded from the analysis? I would argue they should be. If they weren’t though, that might explain the longer average method length for the code with more unit tests.
good finds – and not that surprising.
Tests are a) executable documentation + b) safety net . With a safety net u would jump from much higher than without. Hence the complexity goes up. about Good brakes, yes.
Number of interfaces – depends on who writes the tests. If it is same person who architects+writes the code-under-test, well, less interfaces means less things to test (thouroughly) = less work to do. Kind-a backfiring.. (having all contexts/load on same head may cause quite some limping here-there, and seemingly-absurd dependencies appearing where they should be none)
Add some (social?psychological?) characteristic/variable of the code-vs-tests – what is done by who (same person / people, other people), resources applied (free time, paid time, etc), time-pressed/deadline vs unlimited , etc.. maybe things will get more in tune then. And more rocks will surface..
You know, Making software is quiet a social affair 🙂 – even when alone, you’re not – the you-from-friday is also there… and the whole other bunch.
Your results are compatible with “Risk homeostasis” or “risk compensation” theory. People who use unit tests have an increased sense of safety, so are more comfortable living with less-safe designs and code-bases.
“Risk compensation is a theory which suggests that people typically adjust their behavior in response to the perceived level of risk, becoming more careful where they sense greater risk and less careful if they feel more protected. Although usually small in comparison to the fundamental benefits of safety interventions, it may result in a lower net benefit than expected”
— https://en.wikipedia.org/wiki/Risk_compensation
http://injuryprevention.bmj.com/content/4/2/89
Did you do any significance testing on your findings? Also for multiple measurements (Bonferonni)?
I’m more surprised by your expectations that by what you found. Private methods are mere helpers. When we are unit testing code it isn’t obvious these helpers should be moved to the class interface. But tests don’t give either incentive to create them. Henceforth when you find less private methods, I wouldn’t expect that they were moved to public methods, but merely that they don’t exist at all: they were just inlined. That unique change explains at once why we get fewer private methods, why we get more cyclomatic complexity in public methods and why they are longer.
At course this could also be seen as an artefact of languages that aren’t providing any easy way to test private methods. Comparing these results with unit tests written for a language without that restriction would enable to check it. And python could probably be used for that purpose, because if some kind of private method is possible (__ prefix for external name mangling)… most programmers don’t use it.
Did the method number of LOC include the test code?
or did all the complexity / loc stats get generated on just all the non test code?
My thoughts:
– Definitely, it needs a broader sample in order to be significant.
– I would suggest that you consider statistic values as the standard deviation. I’m far from being an expert in statistics, but I think it will be very helpful as with it you will be able to pull off these possible outliers that could “have a mitigating effect on the research” (as you said) and have more reliable results.
– Anyway, I agree with svil, Rishi Ramraj and Kevin Greer about the length of methods: tests give you more sense of safety and, so, you take more risks by writing longer methods.
I think the main two points that you proved with this research is
1) that you are really bad at doing research
2) that you don’t understand TDD
Let me explain
1) Your sample pool is really small, especially when you look at the teams that supposedly did TDD. The data is anecdotal at best. Furthermore, you make assumptions that you make no effort to substantiate, and those assumptions have a huge effect on the outcome of your study. For instance, “I reasoned that a 50% or higher ratio of test methods to production methods indicated a likely TDD approach”. This actually proves nothing more than that the code had unit tests. Whether those tests were written first or last cannot be deduced from this metric. Which brings me to my second point.
2) You don’t understand TDD. The fact that your title talks about Unit Testing and your body of your piece talks about TDD as if the two terms are interchangeable strengthens my view on this. Having unit tests is not the same as doing TDD. TDD is a process of writing tests first, making them pass, and then refactoring. Any other way of obtaining unit tests, has the value of having unit tests, but it does not have the same value as TDD.
a. Test-first assures loose coupling between test and production code. That in turn ensures that there is loose coupling between your production code, and other consumers of your production code. It affects the design.
b. Refactoring ensures that the code is cleaned up continuously. In other words, it helps improve your code quality. If you have unit tests, even if you wrote them first, and you still have code smells, it means you didn’t do refactoring (clean-up) therefore you didn’t do TDD. You just wrote production code and unit test code.
c. The ratio between test methods and production methods is also influenced by the method scope argument that you made. There are two schools of thought on this matter. Some people say, testability trumps scope modifiers, therefore they will make methods public in order to be able to test them, while other argue that a private method can only be executed from some method with greater than private scope, therefore, if those methods are fully tested, the implementation detail of it calling a private method is not visible to the test. I would go as far as saying that if your private method is complex enough to warrant a test, perhaps it should have been a class with its own set of tests. If that private method has a huge cyclometric complexity, I wonder whether it was refactored? I think not.
I think the main two points that you proved with this comment are:
1) that you are tactless and have no intention to be tactful
2) that your purpose with this comment was not at all trying to help improving this interesting research idea but only saying “Hey! What are you doing, dude? You have no idea of what are you talking about! Shut up! Listen to me and maybe you will learn something. I am really good.”
You are correct on point 1. I see no point in dancing around the facts. Articles like this, that incorrectly use TDD and Unit Testing interchangeably does damage to the efforts of those people that are working hard to get a level of professionalism in our industry. It damages the industry.
On point 2, you are mistaken. If I wished to draw attention to myself, I would have used my full name with links to my twitter handle and blog etc. I don’t wish to draw attention to me, but only to the flaws in this article. I wish to create a voice of reason. When a developer is trying to sell TDD to his management, I don’t want this flawed article to be the thing that convinces his managers to say no to TDD.
This article didn’t look at whether the tests, or rather the process of TDD were given the opportunity to improve the code quality. The tests could have been written well after the production code and only moments before the “research”, in which case the presence of tests have had no opportunity to influence the code quality. tests in themselves can’t influence code quality, the practice of TDD can, and will. With such low test coverage, there is no chance that the code was TDD’d.
Any chance you might be including analysis of test methods in your complexity and loc per method stats?
Fabulous! We need to do more of this type of empirical analysis as the field of software matures. A few lines of though.
1) Since these are all git repos, it would be really cool to look at the time axis. Ie. How does the code base with test mature by looking at the code base at various points in time (not sure the best time slice to choose here. Maybe look at the commit patch size and inspect after a cummulative 1kloc has changed)? 2 things I’d expect to show up here are teasing apart early tests (an indicator TDD or some early test discipline) vs late tests.
I have a hypothesis about the increased method-loc and method-cc (at least that I’ve annecdotally observed in java code bases over the years). The safety net of tests means that methods accrete complexity over time. Find a bug or want to add a small feature flag? Just at a conditional and a single test method. ie. The security of tests doesn’t always translate into the behaviour of refactoring (just that you can add to a code base with a guarantee that you haven’t regressed existing code paths).
It would also be really cool to just compare test early vs tests bolted on later. Do the coding practices in code bases that have tests added to them later start to resemble those with early tests?
2) Having tests (and all of this “good discipline” stuff) is really about allowing changes to be made (mostly taking this idea from Kent Beck
. So the idea of looking at code the way you’ve done but at the patch / commit level would be a really cool tool. Insight into questions like: does having tests encourage more / better testing? How often do existing tests get refactoring love or does the suit tend to just grow? etc
3) I know this is a .net tool but it would be cool to compare results against a set of java code bases. Does language / framework matter to any of these effects
Thank you for taking the time to do this analysis. I have one serious critique of your methodology: the use of averages for metrics for which the Central Limit Theorem doesn’t hold. All else being equal, the average cyclomatic complexity of a 10,000 line program will be smaller than the average cyclomatic complexity of a 100,000 line program because cyclomatic complexity is power law distributed. Analyzing the slope of the distribution is meaningful, for example http://s3-eu-west-1.amazonaws.com/presentations2012/5_presentation.pdf.
Thanks — that’s a great point.
I’m making notes of the feedback here in general for what to pursue next, but one of the first things I’m going to do is examine a plot of the relationship between codebase size and cyclomatic complexity. I’d also be interested to explore that same relationship, but also factoring in module level cohesion and coupling (meaning, if you had a 1 MLOC codebase consisting of dozens of modules with no interdependence, I would expect the non-linear growth of complexity with size to be muted).
Thanks a lot for this study, it’s very interesting. I believe we should do a lot more of these, especially given all the data we generate to get out of the cult and into science …
Anyway, I wanted to say that I’m not so surprised that the number of interfaces is lower on tested codebases. I think I use less interface since I’m using TDD. Before, I used to add ‘speculative’ interfaces, in order to add extra flexibility in my design “just in case I need it down the road”. Now that I’ve been sticking with TDD for quite some years, I lean to more Spartan Programming. If there is a single implementation of something, I won’t use an interface, I know I can safely extract an interface down the road if needed.
Hello, everybody. Thanks for the comments and feedback! I don’t have the time to reply to everyone directly, but please know that I’m reading through, making note of suggestions and recording/prioritizing them to guide subsequent experimentation here. All of your input is extremely helpful.
I would agree with others about the direction of causality between tests and complexity.
When I get given a legacy spaghetti method I will try to write a lot of unit tests for it.
I would think that the way to check this is to examine the timeline. If the tests were created long after the code this would be a clear indicator. A confusing factor would be if the a legacy codebase was transferred into Git in which case the tests might appear at about the same age. Another indicator of causality would be the ratio of changes to the code to changes to the unit tests. This always assumes that you can relate tests to the code that they test which is generally hard but might be doable by name matching in some codebases.
Just a small, but very decisive remark: I am not convinced that the existence of unit tests in general does affect any codebase. I am convinced that test driven development (the process to develop) is making a difference.
I am glad someone has at least started this conversation.
Thanks for sharing,very useful article.
https://www.exltech.in/software-testing-course.html