That’s quite a coincidence that a few days after promoting the joy of immutability in the post C#9 records: immutable classes we stumbled on a bug due to a mutable object accessed from several threads.
This is a regression bug: per definition a new bug that broke a system that was working until now.
Let’s have a look at the regression bug and also at the strategies that could have helped preventing it.
Introduction to Mono.Cecil
NDepend relies on the OSS project Mono.Cecil to read compiled IL code within DLLs. Cecil has been created by Jean-Baptiste Evain (Jb). By any mean Cecil is an outstanding OSS project:
- It is improved and maintained relentlessly for more than a decade
- Jb is very careful about performance, code structure, API and tests
- Several dozens of other OSS and commercial projects rely on Cecil see the list here.
- Cecil is not just for IL reading but also does IL writing. Many tools harness this capability to modify assemblies after compilation.
Recently we planned to release the new version 2020.2 of NDepend with the latest Cecil source. We update the Cecil code regularly to keep up with new benefits from Cecil like performance improvements and bug fixes. However our integration tests suite quickly showed that something was wrong:
- The number of lines of code computed was always a bit less than expected. The negative delta was random between 1% and 2% less.
- Some assertions in code were randomly broken both at DEBUG integration-test-time and at smoke-test-time.
As most .NET tools – especially coverage tools – NDepend counts the number of logical lines of code. Simply put, a logical line of code is debugging sequence point:
This code query executed against the baseline analysis result showed that a random small sub-set of methods had their number of line of code set to zero or even Not Available.
Per default NDepend analyzes Visual Studio projects concurrently. An option can force to analyze all projects sequentially on a single thread. With this sequential mode enabled the bug didn’t occur. At this point it was clear that we were facing a concurrency bug, either in our code or in Cecil code.
Bug investigation and fix
Since the bug appeared suddenly after upgrading to latest Cecil our best bet was that the problem came from this upgrade. We tested the last 4 versions released and found out that the version 0.11 was working fine but the version 0.11.1 reproduced the bug. Thus it made sense to review changes between both version: https://github.com/jbevain/cecil/compare/0.11…0.11.1
Since concurrency bugs often come from mutable states accessed simultaneously by several threads we searched for the usage of the keyword static in this diff to find such state. Bingo! We found this field:
Clearly the class PdbFunction is mutable.
It means that we have a single mutable PdbFunction object potentially accessed by several threads. A quick investigation reveals that the field match is only used by FindFunction(). And indeed, declaring match as a variable in the method fixes the problem. Notice that v0.11 actually contained this code!
Oddly enough I remembered we stumbled on the exact same bug in Cecil 10 years ago. This PDB reading part of Cecil is some code imported from Microsoft source to read PDB that became open source at that time. Originally the bug was in this Microsoft source. This bugs came back 10 years later spontaneously in Cecil. Somehow a copy/paste has been made too quickly! At this point I don’t have more details on the whole story.
Efficient strategies to avoid such regression bug
The bug was discovered on our side before reaching production thanks to these good practices:
- Automatic integration testing. Such concurrency bug is hard to catch. The fact that all concerned code is 100% covered by the Cecil test suite wasn’t not enough. This bug introduced with Cecil v0.11.1 is there since the 5th November 2019, almost a year ago! It means that all tools consuming Cecil v0.11.1 (or later) through multi-threaded code potentially have it. Apparently nobody noticed it until now. Maybe some users of tools using Cecil v11.1.1 or later experienced corrupted state or crashes at runtime? Our integration test suite caught it thanks to multi-threaded analysis of VS projects.
- Massive usage of assertions in code: everything that can be asserted must be asserted. Our code uses Debug.Assert(…) but it is just a technical details that we might change in the future. What matters is that the team must be automatically notified when an assertion gets violated at test-time. Most of the developers believe that assertions is for test code only. But this case shows well that assertions in the code itself also bring value.
Another good practice that can help finding such bug before it reaches production is static analysis. Tests and assertions require code execution. On the other hand static analysis considers the code as data. Its job is to spot suspicious patterns on this data. NDepend is a static analyzer and it proposes the rule Avoid static fields with a mutable field type. This is clearly an error-prone pattern, a code-smell. Interestingly enough this rule spots the root cause of this bug because the field match is typed with a mutable class.
One problem with static analyzers is that such tools emit a tsunami of issues on any real-world legacy code base. How to know which ones to look at first? It is recommended to focus on new issues introduced since a baseline. Typically the baseline is the last version released. By deciding to write clean code from now most code-smells introduced recently can be spotted automatically before they provoke bugs that reach production.
Conclusion: The 4 strategies to spot regression bugs early
Spotting and fixing regression bugs is difficult. I wrote this post Toward Bug Free Software: Lines of Defense a while back to explain our strategies to spot new bugs early, before they reach production. This way we can better spend our development resources on new features and users satisfaction.
With no surprise we invested a lot in:
- 1) Integration and unit-testing to automatically exercise as much code as possible (86.5% actually).
- 2) Asserting everything that can be asserted in code. This way both assertions declared in tests and assertions declared in code work together to detect automatically corrupted states and protect our products from regression
- 3) dogfooding NDepend on itself to detect new code-smells early.
These 3 strategies saved us countless times from releasing bugs. Nevertheless rare bugs still reach production because there strategies are not a mathematical proof of the correctness. This is why they must be completed with this pragmatic strategy:
- 4) A way to monitor what happens in production as explained in the referenced post Toward Bug Free Software: Lines of Defense. Production logs don’t lie.
The cover photo is a tortoise golden bug (Charidotella sexpunctata) seen on our TV couch during the April 2020 lockdown. Kids fixed this bug by releasing it in the wild.