InfoQ Homepage Presentations Coccinelle: 10 Years of Automated Evolution in the Linux Kernel

Coccinelle: 10 Years of Automated Evolution in the Linux Kernel

Bookmarks

View Presentation

Speed:

Download

49:22

Summary

Julia Lawall gives an introduction to the use of Coccinelle and gives an overview of its impact on the Linux kernel. Over the years, Coccinelle has been extensively used in Linux kernel development, resulting in over 7000 commits to the Linux kernel, and has found its place as part of the Linux kernel development process.

Bio

Julia Lawall is a Senior Research Scientist at Inria. Her research is at the intersection of programming languages and operating systems. She develops the tool Coccinelle and has over 2000 patches in the Linux kernel based on this work.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Lawall: I'm going to present a tool named Coccinelle that we've been working on, actually at this point, maybe for the last 15 years. The focus of our work is the Linux kernel. If you don't work on the Linux kernel, it's quite all right. I think there will still be interesting things that you can take away. I'm going to use the Linux kernel as an example because that's what I'm most familiar with. The Linux kernel is an open source operating system. It was first released in 1991. It's been going strong since then. Today, it's used in all of the top 500 supercomputers. If you're using Android, it's used on the little square box in your pocket. It's used on battleships, stock exchange, in the cloud, everywhere.

An interesting aspect of the Linux kernel is that it has been growing continually over time. In the very beginning it was 1,000 lines of code, very small. If you look here, 2006, so 14 years ago, it was 5 million lines of code. Today, it's 18.5 million lines of code. It's basically just a straight line. It keeps getting bigger all the time. Fortunately, this increase in the size of the code has corresponded to an increase in the number of people contributing to each release. In 2006, there were about 500 people contributing to each release, now there's around 2000 people contributing to each release. There's a developer base that's growing, but not all of these developers contribute to the same degree. Here we have just how many contributions each person has made over the entire history of the Linux kernel. You can notice that this is a log scale here. Basically, almost everyone has contributed only between 1 in 10 patches. These are not really experts in the code base. We have a few experts over here who've contributed hundreds or thousands of patches. There's a big disparity in the knowledge of the set of contributors.

Key Challenge

We have this key challenge, our code is continually growing. We would like to continue to make changes in it. There are all sorts of reasons for making changes, new features, security risks, performance issues, and so on. How can we keep the development process from falling under its own weight? The software gets bigger. It's easy to make some localized changes, maybe to change the interface of some function, the organization of some data structure, but then you're going to think all of the users of that function are going to have to be updated as well. That's the problem that we get when the software is only a few hundred thousand lines of code, maybe that's possible. When it's supposed to be 18 million lines of code, there are some changes that you think, it would really be useful to do this thing. Actually, think about all of the work that I'm going to have to do. It's not just all the work that I'm going to have to do on the code as it exists, we have all these people that are contributing new code all the time. Those people are not going to necessarily be up to date with the changes that we've just made. Whenever you make these changes, it's in danger of regression as people introduce more new code, which is in the old way. I think this is a problem that affects any large software project.

Example Change: init_timer → setup_timer

We're going to look at some of these issues, a couple examples of these issues in the context of the Linux kernel. These are large scale changes that have actually been performed. The person decided it was worth making the change even despite all of these potential difficulties. The first one is related to timers. A timer is a pervasive concept in an operating system. Basically, you want to do something at some point of time in the future. You set up a timer that says after 10 seconds, or something like that, I want to do this certain thing. Initializing the timer requires describing what thing we want to do, and what data that thing we're going to do is going to operate on. You might think initializing the timer requires knowing the time at which all of that should happen. That's not part of the basic timer initialization process. In general, you have a timer, it's going to do things in the future. Then you're going to want to set it for different times over the lifetime of the kernel.

The initialization strategy, this is quite ancient code, I would say 1.2, 1990s or somewhere. We have a data structure representing the timer here. It's called ns_timer. We parse that to a function called init_timer. Then we have two pieces of information we need to provide the data that's supposed to be used when the timer expires, and the function that should be executed when the timer expires. This seems like pretty simple code. It doesn't look particularly offensive. Still, we have three statements that are going to be repeated all across our code base to do just one thing. It's not really ideal in some sense. In 2006, a new approach to setting up timers was introduced, which was called setup_timer. Setup_timer is basically doing the same thing. It's just one function call instead of three operations. It takes those three arguments, the timer, the function that we're interested, and the data structure, and manages the issue all by itself. This is not a great revolution, but it has some advantages. It's more concise, we just have one line of code, instead of three lines of code. It's more uniform. An interesting thing about those three lines of code is there's no particular order in which they actually have to appear. Some people might initialize the fields first. Some people might do some other things, initialize the fields much later. It could actually be hard to find the code that was actually responsible for the timer initialization process. This new function has some nicer security properties, because, in general, function pointers are a source of security issues. This makes it more apparent exactly which function pointers are associated with this timer. It allows you to better control what's going on in the software.

Here we can see how these different functions have been used. The red line is the init_timer, the old way of doing things. The green line is setup_timer, which is the new way of doing things. From this graph, you can infer a story. Here we have the place where setup_timer is introduced. We see some growth here, and we see some declining here. It seems at least one person was motivated to go round and clean up at least part of the Linux kernel to use the new function and to get rid of the old uses. Then maybe that person ran out of steam. The code base was already quite big at that point. We were at 5 million lines of code or so. At some point, this declining drops off. Now we have an interesting phenomenon, which is for a certain amount of time, new timers are being introduced. Some of them are being introduced in the old way, and some of them are being introduced in the new way. Somehow, there's a gap in communication, people are not really aware of what they're supposed to be doing. After a while, the old way levels off. Maybe there started to be more awareness that we should be using this one which is still growing. We basically have these two different ways of doing the same thing in our code base, which is not really ideal for people who want to understand what they should do and how they should evolve the code in the future.

Then eventually, somebody was more motivated here. We have this inverse graph again, where a bunch of cases got fixed and a bunch of the old cases got removed. Finally, we come to this point. At this point in 2017, somebody focused in more on the security issue and decided that this init_timer thing was just not going to work, and made a concerted effort to go across the entire Linux kernel, and convert everything to setup_timer. Now we have just one way of setting up timers. Then he changed all of those one ways of setting up init_timers into something that was simpler and more secure. In the end, after 11 years, the whole amount of work got carried out. It was an extremely slow process. It took a long time for people to be aware of what they were actually supposed to be doing. The first example, the code, it doesn't change the functionality. We had a way of doing the thing and we found a nicer way to do the thing. The old way continues to work. The old way was actually functionally correct.

Example Bug: Missing of_node_puts

The second example I'm going to look at is actually a bug. This is a bug that you can become aware of at one place in your code. It's a common functionality, it's a common mistake that people can make. Actually, the bug occurs at many places in the Linux kernel. The bug is related to reference counts. This is where we can manage memory in C, which doesn't have any garbage collection. Basically, when you want to access the structure, you should call a special function to say I'm accessing the structure so we can increase the reference count. When you're not using the structure anymore, you can call another special function. The gap is, though, I want to use the structure of put, is I don't want to use it anymore. If everybody has done a put, then we can perhaps free this structure when no one else is using it.

These are quite simple functions, but they also make the code cluttered and tedious, and so on. There's been made some effort to hide the uses of these functions in other things. In particular, there was a nice effort to create a bunch of macros which form iterators. Then these iterators allow you to iterate over these structures without ever seeing the reference counts at all. You get access to an element, and then you drop the reference count on the previous one, and you update the reference count on the current one, so that you can go over the structure like this. Each time you visit a new element, you reduce the reference count of the element you were visiting previously. That's very nice, because you can make an entire loop which never shows any uses of these reference count operations. On the other hand, the thing that you don't see is the thing that you easily forget. Then there's a problem because if we jump out of loop in some way with a break, or a return, or a goto, then we're not going to benefit from this next iteration that will drop the reference count on the previous value. Then we will end up with a memory leak. Actually, in these loops, we're often searching for things and so these jumps are actually quite common. That's a problem.

Here, we can see how often that problem has occurred over time and how often people have actually fixed the problem. Basically, we have different zones that have different kinds of development properties. In the beginning, actually, there were very few of these iterators, it was quite an uncommon thing. The few people who were using them understood how to use them properly. Actually, we had more good code than we had bad code. Somehow in here, in 2011, the use of these data structures started to really take off. People found these iterators. They were useful. They need to iterate over things, but they didn't really understand what they were supposed to be doing. We get lots of new uses. There's a huge increase here. There's basically a flat line over here. Very few people were actually using this new construct properly. Then here, there's some effort to clean things up. In the previous case, people make an effort to clean things up, but it doesn't go as far as it needs to somehow, and then more incorrect uses start to get introduced. Then in the last few months, there was another big effort to clean things up. This actually divides the number of these incorrect uses by about two. You see this little extra thing at the end, people are introducing more incorrect situations. This seems to be a challenge in communicating how to do things to different developers.

Assessment

These examples illustrate a number of different issues. One of them is that these kinds of changes are actually a bit difficult to make. We're not looking for one isolated piece of code. It's the interaction between different kinds of code that make up the problem. In particular, in the reference count case, we have the iterator that might be on one line. Then we have a breaker continue buried somewhere in the body of that iteration loop. It might be 10 lines away. It might be 30 lines away. It really depends on the structure of the code. It's very hard to connect the two together. If you were going to search for all occurrences of the loop, you would be finding lots of things that don't have the problem. Certainly, if you were searching for all jumps, then you would also find things that are completely irrelevant. It's hard to use basic tools like grep to actually find the cases where this problem occurs.

Afterwards, these changes occur all across the Linux kernel. Both of these concepts are things that actually affect the entire code base. There may be hundreds, and we saw how many different occurrences there were. There may be hundreds of different files that we have to go and visit. Each file, we have to find the issue and then we have to update the issue. Every time the human is updating something, there's a chance that the human will do something incorrect. I've seen cases where somebody was supposed to remove one line and they just accidentally remove two lines. Something can always go wrong, even for simple changes. There are also several variants. In the case of the timer, those instructions can appear in different orders. In the case of the reference count situation, there are actually about eight of those different iterators. Then we have to take into account break, and goto, and return. There's a bunch of different things that one has to think about. It's easy to overlook some of them, or not be aware of them. Then, the thing I try to emphasize is in an open source setting, it's hard to propagate the information to all of the developers about what should be done. Some developers may have out of date information or they may be copying from out of date code. Then even if you try to fix a problem, people will keep reintroducing it into the code base.

What Is Coccinelle?

We propose our tool Coccinelle to try to address these issues. What is Coccinelle? Coccinelle is a pattern based program matching transformation tool for C code. We have also worked a bit on C++ and on Java. It's been developed since 2005. It's been open source since 2008. It's very widely used in Linux kernel community. You can download it. It's got some moderate amount of documentation. It works pretty well. The basic idea is to allow code changes to be expressed using patterns which are like patches. Patches are basically the thing you're familiar with, with diff. It looks like the code. It expresses what change should happen. The goal of our work is to allow developers to automate large scale code changes in their software, but in a way that fits with the existing habits of the Linux kernel developer, or of maybe software developers more generally.

Starting Point: A Patch

The starting point of our approach is the notion of a patch. If you use tools like Git, you're familiar with this notation. Basically, we have a description of a change. We have some information about where that change should occur. We have the name of a file, the lines on which the change should happen. Then we have the exact description of the change. The description of this change is just an extract of code. Some of the lines are going to be annotated with minus indicating those lines should be removed, and some other lines are going to be annotated with plus, indicating that those lines should be added. This is a really nice notation, a really powerful notation because it expresses exactly what change is supposed to occur. You can take these five lines here, and you can give them to someone else who has a copy of the same code base. They can apply these five lines automatically to their code. Then they will have exactly the new functionality that you've proposed. It's also extremely readable. In the Linux kernel, people are always sending around patches, asking for comments. Maintainers are looking at the patches and studying them, and seeing if everything looks appropriate before they apply them to their code. It's a notation that developers are familiar with looking at. It is really tightly connected to the code base. On the other hand, it's too tightly connected. It's connected to one specific place in one specific file, or one specific set of lines. It's not in itself going to be a good solution if we want some way to automate these kinds of changes across the entire code base.

Semantic Patches

We started with the notion of a patch and we try to generalize it a bit to make what we call a semantic patch. Basically, a semantic patch is a patch but where you can abstract over things in a somewhat controlled way, so that the original patch can apply more generally to the code base. Here's an example. We'll start again with a patch that I showed previously. This patch has basically the idea of the thing that we want to do. We want to change the init_timer function into the set_timer function. Then there are some other lines that we want to remove because they become redundant by using setup_timer. Then there's some other information we don't care about. This ns_timer is the timer which is going to be specific to this particular driver. Other drivers are going to have other timers. We don't want to always require ns_timer to be here, or ns_poll, or this value, and so on. Also, this initialization of the, expires field, is just noise with respect to the thing that we're doing. We want to keep the things that we're interested in and get rid of the other stuff. The first thing we can do is get rid of the, expires initialization, because that's completely irrelevant to what we're interested in. We just take that line that we don't want and we replace it by ... The idea is if you were in a restaurant and you wanted to explain to your colleague, we have to make this new transformation, this is how it works. This might be how you too might describe it to someone else. First, we have init_timer, and then we turn it into a setup_timer. Then there's some other stuff so we just put ... and then we're going to remove some other things afterwards. It's supposed to be natural to how people might think about doing the change in a general way.

We still have some details here, ns_timer, ns_poll, and so on. We need to abstract over those as well. This is information, we can't just throw it away, because we're going to need to remember that the argument init_timer is going to become the first argument to setup_timer. What we're going to do instead in these cases is introduce what we call metavariables. Metavariables, they're just going to be names that are going to get bound to various expressions in various places. Here we have an init_timer call. It has some argument, and then whatever its argument is, timer. That's going to become the first argument of the setup_timer call that we're trying to construct. Whatever is the value which is stored in the data field, that's going to become the third argument to setup_timer. Whatever is the value stored in the function field is going to become the argument to setup_timer. This describes how you should transform the code. That can be any code no matter where it occurs in the Linux kernel.

This is a complete semantic patch. Now you can take this specification and you can run it over your code base, and it will update all of the init_timer, or hopefully, we'll see. It will update the init_timers to setup_timers. In thinking more about this, though, I want to make one small generalization. Here we say init_timer and the data initialization don't actually have to be right next to each other. There can be arbitrary stuff between them. There's no real reason for data and function to be right next to each other. Either we can allow arbitrary stuff between them as well, so I just threw in another ... You can work from your example and then use some issue intuition about how the other places where this problem occurs in your code base, and come up with a specification that expresses what you want to set up some time.

Results

To test this, I made an artificial dataset. I actually took all of the occurrences of this change that had ever happened in the Linux kernel. I collected around 600 files that at some point had an init_timer call in them. This amounted to 828 calls. Our semantic patch updated around 300 of them. On the one hand, 300 is really great. There's a lot of work that we've saved. These 300 instances were probably in around 300 files. Visiting all 300 files to fix all these things would be very time consuming and boring. This is a really huge change, so there's still 500 of them left. It would be nice to see what went wrong. Why didn't our rule actually update everything? After we've run our rule, after we've changed our code, we can just use grep, find are there some other init_timers, and what do the contexts in which they're used actually look like. Here's one of them. It's quite easy to find. We have an init_timer. This one initializes the function field before the data field. In the previous case, we initialized the data field before the function field. On the one hand, the strong point of our approach is we're very close to the code. On the other hand, sometimes being close to the code is a liability. We have these small variations that need to be addressed as well. Fortunately, since we are close to the code, it's not very hard to figure out how to change the specification and make it more general so that it can do this case as well.

Basically, we had this rule before. We had init_timer first, and then the data field and then the function field. Now we've found a need for treating the function field before the data field. We can just make another rule, which has these operations and the other order. You can see that just by copy pasting this, and moving some lines around, we get a huge benefit, because now we're up to 656 out of the 828 calls. Then you can iterate this process a bit more. The idea of the tool, it's fairly easy to run and the specifications are quite readable. You can iterate it over again until you get the thing that does exactly what you want it to do.

There's some other issues about code ordering that are in the remaining cases. If we end up with a semantic patch that has 6 rules, so previously, we saw 2 rules. If you have 4 more rules and 68 lines of code, then you can do 99% of the init_timer and setup_timer calls. The ones that don't work are the ones that were the motivation for the whole security issue, for example, in the beginning. They're quite strange because they init the timer in one function, and maybe they init the function field in another function, they init the data field in another function. Those things that are probably better just for the human to try to understand the problem and update the situation by yourself. That's the approach that we advocate. We want to help you do automatically the things that are easily automatable and then leave your time for doing by hand the things that are actually harder to express.

How to Know When the Calls Are Done

Participant: When you say it covers 808 of the 828 calls, in reality is there not a scenario where you actually don't know the total number of calls to that function because that's exactly what you're trying to cover? You're trying to get to every single one. How do you know when you're done?

Lawall: I agree with you. In general, you don't know when you're done. In this particular case, you do know when you're done. Our goal is to get rid of all the calls to init_timer, regardless of in what context they're being used. In this case, you just grep for init_timer, and then you're done. We can look at the next example, which is the reference counting. In this case, it'd be much harder to know when you're done. Because the lines of code can be widely separated. There's nothing obvious that we can grep for. When we search for these issues, we might want to look for these different iterators but not all of the iterators have the problem. Unless you actually want to spend all the time to manually audit all of the remaining iterators, it'd be hard to know when you're done. It's a trade-off between how much do you trust the tool and how much do you trust your patience, something like that?

Optional Semantics

Participant: [inaudible 00:26:08] the ones that are like defaults in your function, like .aw.b defined, in some cases. Why not support optional semantics, like some optionality. It could be there or not be there because reasonable defaults are covered by the function you implemented.

Lawall: This is the reference count issue. This is one of those possible iterators. There's actually a whole bunch of other of them. This is one possible way that we can jump out of the function. Here, basically, we have an iterator and then we go down through the body of the iteration. That's what the ... is expressing. If along that iteration, if we don't find any occurrence of this of_node_put, and if we do reach a break, then we're going to need to put an of_node_put in front of the break. If you think about the body of the function along some execution paths, you may have a break, and along some execution paths you may not have a break. That's why we have this question mark, which is the optionality. You might not find a break everywhere. It's like when you do find a break, then you need to make the change. The only way that you could actually be 100% sure that you have done everything is look at all the occurrences of all the iterators and see what's going on.

False Positives

Participant: Did you have false positives?

Lawall: Yes. This is a programming language, it does whatever you tell it to do, and you can tell it to do incorrect things and it's not our problem.

Participant: I was thinking about where you have the init call and then you have an arbitrary line of code, and you separate it in front of the field [inaudible 00:28:10].

Lawall: I didn't really write this rule in a very defensive way. One can do what is shown here with this when stuff. This allows you to control the things that can happen along the way. For example, the initializer is the thing that you're searching for to something else along the way, if that's a concern, then you can put a when, and you can say, we shouldn't have that along the way. One could ask, why didn't we make the tool do that automatically? Why doesn't the tool do some clever program analysis so that when we have ... we're sure that things are not redefined, or whatever. Our point of view was that, in general, program analysis have some limitations. If we had made it do some program analysis, then we would have to explain to you what limitations it has. Since we say it does no program analysis, everybody can understand exactly what it means, it doesn't do anything. If you wanted to do something, then you have to express yourself, what should be checked for. I think it's more clear.

Participant: Also, I believe in ordering because for example you have the node_put (Child) here also. That seems every time it's like n number of lines, you've had m squared variations for the ordering, let's say, for the function and data.

Lawall: For function and data you have to take into account the different ordering. When we initially designed the system, we have ... here, and so we thought maybe we'll have circle, circle, circle, and circle will do things in all permutations. That was a feature that was never implemented, and we never felt any compelling desire to have it. I think it's much easier to just make the 2 instances and the 2 variations, we've never seen 10 variations due to this issue. Here, there's no ordering problem. The break has to appear inside of the body of the, for_each. There's only one thing in there.

Participant: That's where everything [inaudible 00:30:22].

Lawall: Yes.

Participant: Are you sure that there isn't already an of_node_put there, because it seems to be adding an init.

Lawall: Because I said this here.

Participant: That when != is part of the DSL, it's not in the code.

Lawall: When we have this ... this is part of the language. We wanted to stay as close to C as possible. We had to make a few compromises. The ... is one compromise, it's anything from here to here. The when, is another compromise. With when, you can say along these ... certain things shouldn't occur.

Participant: When does the ... hit?

Lawall: When you hit the thing that is after the ... This ... here is between a brace and a break. The brace is unique, because the brace is all associated with matching levels, but it can't contain a break. When you have a...b, that's the closest a to the closest b, there's no a or b in between them. Obviously, the ... will also end if you encounter one of these things that you're not supposed to have along the way, then it's going to give up.

Participant: There are a bunch of rules you can just do ... in?

Lawall: Yes, but I've just said them all. It's ok.

Participant: [inaudible 00:32:02] or goto, or exception, because you have to handle the three.

Lawall: I actually removed it from the talk, but there's a or notation. In general, you can have any code you want, but column zero in a patch is special. We have adding something. We have question mark in column zero. We can also put a bar in column zero that's a or, and it's got parentheses around all of the stuff that's being disrupted.

Participant: [inaudible 00:32:33].

Lawall: The Linux kernel is 18 million lines of code, if we were really parsing all those 18 million lines, and if we were really looking through all of them to find every occurrence of this pattern, it would probably be quite expensive. There's a lot of hacks in advance to pre-filter things. The only way this is going to work is, for example, if the file contains a for_each_child_of_node, and if it contains a break. If it doesn't have those things, then there's no point to parse the file. Maybe there are a few hundred occurrences of this operator, so we're probably only going to look at actually a few hundred files. Then it does the same thing at the function level. It orients itself. The performance, I don't know what it is, but seconds, minutes, certainly not hours.

Participant: This looks like it can be done in parallel easily?

Lawall: Yes. The parallelism works best at the file level. It gives it a good granularity to work on. If you have a big machine and you have thousands of files, it will just do things in the right way.

Participant: Does it look at header file? [inaudible 00:33:57]?

Lawall: There's a big question about header files. In general, we try to avoid header files completely. There's a bunch of different reasons for that. In the context of the Linux kernel, we're not really sure which header files we're supposed to be using. The Linux kernel has a very complex build system. Specific files will have certain minus I directives on them. We don't want to interpret the makefile. Those header files are extremely large, you can take over a 1000-line C file and turn it into a 10,000 line file by expanding all the header files. I think there may be 15,000, 20,000 C files, we don't want to take each of them and multiply its size by 10. It will be a huge performance cost to actually process all of those header files. Another issue is that we can't just expand all the header files, so run GCC minus I, or something like that, to get the expanded files. Because we want to do source to source transformation. We want the code to come out exactly like it looked on the way in. Basically, we have our own parser, and our parser tries to negotiate around the macros and other strange things that it encounters. If you need information in the header file, save some options, and then you can cause it to get type information from the header files, or actually include all of those header files, if that's really what you want to do. That's a huge performance hit if you're recursively including all the header files. For this, there's absolutely no reason. There's nothing in a header file that we need in order to be able to do this. We can get good performance in that way.

Participant: How sensitive is the matching to various different styles [inaudible 00:35:54] or spacing around comments?

Lawall: It completely ignores the spacing, and comments, and all these things. You might have, for example, a comment in here in your code. It really doesn't matter. On the other hand, when it generates code, like here we are generating some code, and Coccinelle by default, it follows the Linux kernel coding style more or less for generating code. For example, if it generates a function call, which is more than 80 characters, it tries to insert some new lines. For indentation, it looks around in the function and sees how the other lines are indented. If you're indenting by two spaces, it will try to indent by two spaces. If you're indenting with tabs, it will indent with tabs, and so on. Here, for example, some people perhaps like to put spaces after their parentheses and before their close parentheses. By default, it's not going to do that. Whatever you write, it's going to do it in a Linux kernel way. There is an option which says a simple spacing that will cause it to follow whatever spacing you put here. It tries to adapt to the coding style. Pretty printing in general is a really hard problem. If you have something different than what's done in the Linux kernel, you might not be completely satisfied in that respect.

Participant: I suppose it generates the patches to be applied in the codes for real people to [inaudible 00:37:27].

Lawall: There's different options, the default option is just going to generate a patch. Then you can look at the patch, you can see if you like the patch. You can attach, it's just a text file. There's no checksums or anything like that in it, so you can remove some parts if you don't like them, or if you feel unsure, or whatever. That's the default option. There is another option that will cause it to just overwrite the source code if that's what you prefer.

Participant: Since there's no code analysis, does that mean if those letters are simple to port it to Java, which makes it a bit complicated in that process?

Lawall: Actually, I didn't do the port to Java, so I can't completely answer the question. Obviously, there's some syntactic differences. The challenging thing about Java is the type system inheritance, and stuff like that. The type system, you have to decide how much you want the tool to know about the type system. Following our point of view that we don't want to include all the header files, we decided that we actually don't want to know anything about the type system. Things have the type that they have. There's one exception, which is if you have a type, like tx equals, and then the expression over here has some other type. If it's able to figure out the other type, it actually remembers both pieces of information. The other thing is catch and throw exceptions. It means with exceptions, we don't really know what the possible control flow is. I'm not sure what the strategy was. The two options seem to be just to think that anything can fail, or to think that nothing can fail. Probably, neither would be satisfactory in all situations. It's a limitation for these languages.

Participant: Apart from the [inaudible 00:39:34] not taking into account the semantics of the language. Do you see any downside in using a tool like Clang, the actual classes of the files [inaudible 00:39:49]?

Lawall: The downside that we see, I would say in some sense, the tools are complementary. With Clang, there's a lot of things you have to learn. You have to understand the abstract syntax tree internal representation that is being used. The benefit, though, is that you can do much more precise analysis. The analysis that exists off-the-shelf, it probably has some analysis, and so on. The benefit of our approach is it's much closer to the code. If you want to write a rule that replaces the function foo by the function, ABC. You can do that in one second. The cost to get into using the tool is much lower. I think these are completely different. We're targeting different situations. I think Clang is a very useful tool for doing what it's doing.

We have the scattered code. We've seen how the ... allows us to separate code fragments and describe the relationships between them. The changes occur widely across the code base. Coccinelle makes that a non-issue. It will do everything for you. Of course, afterwards, you have to go and look at all of the things that it's done. The problem has been shifted a little bit. We haven't actually removed all problems, but we have gotten rid of an initial bottleneck. The changes occur in many variants. We've seen how that can be addressed. We can just make a few different rules, and then we can cover the entire space. We can also use this iterative process where we apply the rule, and see what has been done, and see what has not been done, in order to motivate what should be done in the future. Developers are unaware of changes that affect their code. There's two solutions that we can propose based on our unambiguous notation. One of them is that when people make changes, they can put the semantic patch. It's an easy-to-read artifact that they can put in their commit logs. When developers are trying to understand what they should do, they should look back and see what other people have done and see how the change should be performed. Another approach is we can also take these semantic patches and we can put them in a library of rules that need to be checked. Then a continuous integration system can apply it to the code base. Every commit can be checked that it's using setup_timer, for example, instead of using the old version.

Impact: Patches in the Linux Kernel

I'm going to talk a bit about the impact of Coccinelle on the Linux kernel. This is the different years. This is when we started to have a tool that was actually useful. 2008 is when it became open source, up to today. In the beginning, this dotted line is the Coccinelle developers. We made our tool. We tried to use it on a Linux kernel to see if it was doing what it was supposed to do, and to try to interest other people into using it. Then we have some lines here. There's a red line which are maintainers. Maintainers are people who are responsible for a particular part of a Linux kernel. Then the orange line are just some other Linux developers. Their use varies over time. There's a big jump here for the maintainers. It's nice to see that in the last year, there's been a big jump both for maintainers and other Linux developers. Then the remaining line that's interesting are interns. This shows that people who don't have much experience either with the Linux kernel or even with the C language, can actually use the tool to contribute to the code base.

Impact: Cleanup vs. Bug Fix Changes among Maintainer Patches Using Coccinelle

We're also interested in what do people use the tool for? In the two examples I showed, one of them is a cleanup and one of them is fixing a bug. These are the two options I have here. We looked at only just what the maintainers had done just to have a restricted view so we didn't have to look at too much code. People do cleanups much more with a huge jump in 2015. There's actually a bunch of different tools for finding bugs in a Linux kernel that are used regularly that is part of the continuous integration system. Whereas Coccinelle is unique in allowing you to do these more complicated transformations.

Impact: Maintainer Use Examples

Here are some examples of what people have done, removing an unused function argument. If it's unused, that's perhaps simple. All you have to do is go and find all the uses of the function and remove that argument. Still, it affects 11 different files. It takes some time to go and do all of that work. Similarly, redundant field in a data structure. This was effecting even more files. Here's an example, this actually seems a bit similar to the others, but it's much more complicated. For interrupt handlers, there was an argument that was found to be redundant. The idea was to remove that argument everywhere it occurs. The problem is sometimes that argument was actually being used. Then code had to be changed inside the body of the function in order to compensate for the fact that the information was not being provided. It's a core feature of an operating system. This affected almost 200 files. It's really across the entire kernel, ARM, x86, PowerPC, all different architectures and so on.

Impact: 0-day Reports Mentioning Coccinelle per Year

The final one is this zero day build testing service. This is a service from Intel. Intel subscribes to Git Trees of hundreds of developers. Every time people commit something into the tree, then this testing service will check their commits using a bunch of different tools. There are a bunch of semantic patches, which are actually a part of the Linux kernel distribution. Those are getting run on all of the commits as they're integrated into the kernel. You can see what has been done over different years. Some of these semantic patches are able to actually find and fix bugs. Some of them only just report a message about a bug because the change is actually not very systematic. There's been a lot of impact. Hundreds of bugs that only stay at the level of a developer, they never actually get up to affect any possible users. An interesting one is locks. People often forget to unlock their locks.

Conclusion

Coccinelle brings automatic matching and transformation to a system software developer. The idea is to make something that's easy for developers to use, express in ways they're familiar with working. That will enable the needed evolution so that we can take our huge code base, and we can say, this would be a useful thing to do. Let's just go ahead and do it. Even if it is going to affect hundreds of lines of code, and hundreds of files, still, that shouldn't be a hindrance to keeping the code base moving forward in a good way. We have almost 8,000 commits in the Linux kernel that have been based on Coccinelle. We want to make this evolution as easy as possible, so we're working currently on how we can infer these semantic patch rules from examples so that a human doesn't even have to write the rules. We are also working on support for C++, and we have a prototype implementation for Java.

Since Linux is used everywhere, and since Coccinelle has really been used across the entire Linux kernel. Probably everybody in this room uses Coccinelle modified code in some way. Maybe you're not running Linux on your machine, but maybe you have an Android phone, or you use Google. All these different big companies are using Linux for collecting all their information. Everything is publicly available. We have a webpage with some moderate amount of documentation. We always keep our GitHub version up to date with the latest changes and the Java version is also available on GitHub.

See more presentations with transcripts

Recorded at:

Jun 26, 2020

Julia Lawall

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?