Static analysis, as a concept, seems to earn itself a certain reputation. The general population may regard programming as a technocratic, geeky pursuit. But inside the world of programmers, static analysis has that equivalent rap. It’s a geeky subject even among geeks.
I suspect this arises from the academic flavor to static analysis. You hear terms like “halting problem,” “satisfiability,” and “correctness proofs,” and you find yourself transported back to some 400-level discrete course from your undergrad. And that’s assuming you did a CS undergrad. If not, your eyes might glaze over. Oh, and googling “static analysis” only to see things like this probably doesn’t help:
I have two CS degrees, concentrated heavily on the math side of things, and I specialize in static analysis. And that featured image makes my eyes glaze over. So let’s hit the reset button here. Let’s make the subject at least approachable and maybe, just maybe, even interesting.
Defining Static Analysis Dead Simply
Whether you’re a grizzled programming veteran, fresh out of a bootcamp, or can’t program a lick, you can understand the concept. I’ll use an analogy first, to ease into things.
When you write software, you write instructions in a format that you and other programmers understand. A program called the compiler (in many cases) then translates these into terms that computers understand and eventually into automation output. So think of programming as writing a grocery list for a personal shopper. You write down what you want, in easily understood terms. The personal shopper then maps this list to his knowledge of the grocery store’s layout and eventually produces output in the form of food that he brings you.
What, then, is static analysis in this world? Well, it’s analyzing the written grocery list itself and using it to speak to what the grocery shopping and groceries will be like. For instance, you might say, “Wow, 140 watermelons, huh? We’re going to need to rent a truck, so that’s going to cost you extra.”
When it comes to writing code, people usually reason about it by running it and seeing what happens. In our world, that means the shopper simply takes the list, goes on the shopping trip, and sees how things go. “Wow, this is a lot of watermelon,” he says as he fills the 15th cart full of the things. Only then does he start to understand the ramifications of this.
Static analysis capitalizes on the fact that you can understand things about the upcoming grocery run without actually executing it.
Static Analysis Back in the World of Programming
With programming, you think of two fairly common activities: testing and debugging. With the former, you write some code and then somehow verify that it does what you expect. This may happen via automated unit test, running the application and observing it, or both. With the latter, debugging, programmers run their code, looking to reproduce and then fix some reported defect.
Both of these activities are forms of executing the grocery shopping to see what happens. To get the equivalent of reasoning about the list, you need to consider a different activity. I’m talking about the code review.
Yes, code reviews are a form of static analysis. During code reviews (you looking at your own code or that of others), you look at the code itself, reasoning about its form and predicting its runtime behavior. You’re asking the analog of “are we trying to load too many watermelons into our car?”
Of course, when you think of static analysis, you generally think of automated tools. And you should definitely incorporate those into your approach. But anyone looking at code and reasoning about it, be they human or be they automated tool, is engaging in static analysis.
Kinds of Static Analysis
With that in mind, let’s look at some of the different flavors of this analysis. I’m dividing them up this way in order to make the universe of static analyzers a little less formidable. So think of these less as perfectly mutually exclusive, official categories and more as a loose taxonomy.
For each one, I’ll also draw back on our grocery shopping analogy to help clarify.
- Cosmetic. Does the code conform to team/language style standards? (Is the handwriting on the list legible, with only one grocery per line?)
- Design properties of code. Is the code minimally complex and maintainable? (Does the list of have insane things on it, like 140 watermelons? Or an elephant?)
- Heuristic/error checking. Does the code have these “violations” that we look for? (Does the grocery list contain items that people commonly order, mistaking them for something else?)
- Predictive. How will this code behave at runtime? (Will this list result in the shopper walking the store efficiently, or will it take hours?)
- Formal proof. Is this code correct? (Can we guarantee that the shopper will complete the mission on schedule, no matter what?)
Why is Static Analysis Hard?
Perhaps the last bullet in the list gives you an inkling as to the difficulty presented to people trading in static analysis. That last one dives into the realm of determining program correctness, which puts us squarely in the realm of academia. Proving programs correct is the holy grail of the static analysis and programming world. It also turns out to be insanely difficult beyond the most trivial of programs.
After all, programs run in an uncertain world, depending on erratic users and unpredictable inputs. Will your program run correctly? What if the user types his name into the date field? Or what if she tries to paste a JPG image in there? What if he unplugs the computer? To understand the real difficulty of this prospect, think of taking a grocery list from someone and guaranteeing that you could produce every item on it within an hour. Maybe you can. But maybe you get into a fender bender on the way over. Maybe a meteor hits the grocery store.
Reasoning about runtime behavior is really, really hard. For this reason, static analyzers avoid the idea of formal proof and instead make heuristic-based predictions. You can’t definitively say what will happen at runtime, but you can look at a null dereference or division by an integer and say, “Hey, with the wrong kind of inputs, this code could give you trouble.” And you can definitely say things about the structure of the code or the cosmetic style.
Static analysis is hard in direct proportion to how much it tries to predict about runtime behavior.
How to Leverage Static Analysis
So that brings up the question for any pragmatic soul, programmer or not: how should you use this? Once again, we can look to the analogy for some understanding. If you can catch an ill-fated grocery run before it ever gets started, then great. You don’t need to go anywhere to figure out that you won’t fit 140 watermelons into your Ford Fusion. But you can also fall prey to analysis paralysis. It helps nobody to spend four hours analyzing your grocery list when the trip would take an hour in the worst case.
The answer, then, lies in automation. Having humans review code and reason about its runtime behavior is expensive. It’s worthwhile, but if you can find a tool that removes both the imperfection and time-consuming nature of human review, then go for it. Get these tools, replace anything you can automate with them, and operationalize them into your process. That way, you get the benefit of their insight for very little time cost.
Static analysis may seem uber technocratic, academic, and inscrutable. And it can be, particularly when you tilt at the windmill of trying to prove your code correct, mathematically. But at its core, static analysis really just turns your code into data and analyzes that data. In this day and age, data analysis isn’t some geeky, abstract concept. It’s the backbone of making you and your business competitive.