BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Peddle the Pedal to the Metal

Peddle the Pedal to the Metal

Bookmarks
41:07

Summary

Howard Chu gives tips and techniques for writing highly efficient and scalable software drawn from decades of experience. The guiding principle is a simple one, and can be applied nearly everywhere. The talk is focused on programming in C.

Bio

Howard Chu founded Symas Corp. with 5 other partners and serves as its CTO. His work has spanned a wide range of computing topics, including most of the GNU utilities, networking protocols and tools, kernel and filesystem drivers, and focused on maximizing the useful work from a system. His current focus is database oriented, covering LDAP, LMDB, and other non-relational database technologies.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Chu: I'll give you somewhat of an idea of what we're going to cover. First, just to set some context of what we're trying to do, why we're trying to do it, some tools to help you in your quest for performance, my own experience working on OpenLDAP, the problems that we ran into or discovered along the way, and the solutions we used to fix them. One of the things you find as you improve a code base, is once you fix one problem, several more become apparent. So you see more problems, you need more tools. And while you're refining and improving, sometimes you'll also find that incremental changes aren't good enough to get you where you want to be. Sometimes you need more drastic measures.

Context, Philosophy & Impact

This is experience working on the OpenLDAP project for last almost 20 years. From the very first version of the code to the version that we're running today, it has accelerated by over a factor of 100. Now, you know, we're not quite programming right on the metal, I'm not using assembly language anywhere, this is all Portable C language stuff. This chart shows you where search performance was on the very left at the very first version of OpenLDAP, and on the very right, the current version. As I mentioned up here, from 60 milliseconds per single operation to 0.6 milliseconds, 0.6 milliseconds also happens to be the ping time on that network. So we're running at network speed.

It's interesting, if you look at what other famous computer scientists have written about performance and their perspectives over time. So Donald Knuth wrote this in 1967, which happens to be the year I was born. The idea then was you needed to understand something about the machines you're working with. And then a few years later, it's almost like he changed his perspective. He said, "We're focusing too much on optimization. Maybe we should back off of that a little bit." Now, this quote has probably circulated through all of your consciousnesses. Optimization is the root of all evil. But that's the misquote that gets handed out a lot of the times. You have to take it really in its full context, of premature optimization is the root of all evil, and take those numbers with a huge grain of salt. His perspective there is, 97% of the time you don't need to worry about optimization.

And I think, in the decades from 1974 into the 1990s, everybody took that a little too much to heart because you shouldn't be able to get a 100 to 1 performance boost on a well-written piece of code. You shouldn't be able to do that. But I believe people have shied a little too far away from thinking about optimization. So we need to bring this back into focus and think about what's good, what makes code run efficiently on a machine and be a little more aware.

Of course, you know, the choices you make in writing your code can differ quite a bit, depending on if you're starting a new project from scratch or if you're refactoring existing code. How many of you have been faced with optimizing an existing project? Yes, pretty much, everybody. And how many of you have started a brand new project from scratch and thought about optimization in your design process? Very healthy. So, one of the things about the phrase premature optimization is kind of tricky too. There's context to that. Some things are so well-established in the literature already, that to think about them at the beginning of the project still isn't premature. If you're writing a project today, you don't use BubbleSort. That's just an easy one. So there are some things that you can think about, even at the very beginning of a project that would not be premature.

Of course, as you go along, if you're doing very well, eventually you hit a limit. You get to diminishing returns, and every decision you make, where you gain something, you also lose something else, the tradeoffs start to add up. But I think if you look across the industry, most of the code is nowhere near that limit. Most of the code that you see can easily slash a huge amount of what it's doing without affecting any losses. Also, sometimes it's really clear that the solution in front of you is the best it's going to get, and there is no need to do any more work on it.

For example, if you just want to add up the values of all these numbers in an array, the simplest and most straightforward solution is probably the best one. You’ve got array of A elements, let's just do a simple for loop through it, and you're done. That's great. You don't need fancy algorithms, you don't need divide and conquer, or anything else like that. In fact, if you try to do a fancy algorithm here, all you're doing is making your code more complicated. While the basic number of add operations involved here is the same, you've added a whole bunch of extraneous overhead in your more complex algorithm. So, one of the guiding principles should be, simplicity is better.

A couple of the speakers mentioned this earlier today, your code has to be correct first, always. When you have correct code, it's fairly easy to make it faster. It's a lot faster than the opposite, where you have fast broken code and trying to fix it, make it do the right thing. So, yes, you always have to get correct code first; optimization is kind of a secondary consideration. But the other thing you really do want to try as much as you can, is to get everything right the first time around, because if you look at this and say, "Well, I don't have time to think about this that hard. I can't get it right the first time around." If you don't have time, when are you going to have time to come back and fix it? Yes, it's not going to happen.

Then a final consideration, which I always inject in here, computers are supposed to be fast. These things processors are running 3 gigahertz or whatever. So, if your code is correct, even if it gives you the right answer, if it gives it to you too late, that code is broken, at least in my perspective.

Profiling Tools

How do you even know that you have a problem? I think if you've been following the performance track today, this answer is fairly obvious. You’ve got to profile your code. You have to measure what it's doing to see if there's a problem and where the problems are. Now, there are a bunch of different approaches to profiling. They each have different strengths and weaknesses. Today, we can use Linux perf, which is a pretty wonderful tool. It can tell you all kinds of things very easily, very non-invasively. Used to be called OProfile, that's when I was using it first. It uses statistical sampling, like was mentioned in the previous talk, Richard's talk. And because it uses very low overhead sampling, you can run it on a live system with fairly little impact. But one of the things that I noticed about it is, it can still miss details. The call graphs that you get out of it can be missing entire hierarchies of the call tree. So you can't solely rely on that.

Other tools that I've also used, one of these is called FunctionCheck, and it uses the GCC compilers instrument functions feature. So if you compile a project with the flag F instrument functions, then GCC will emit custom profiling calls at the beginning and exit of every function. Then if you provide your own shared library that fills in the stubs that GCC expects, you can do whatever you want in the Prologs and epilogs of every function in your code. And so, FunctionCheck takes advantage of this and lets you measure the time, and also, as a side benefit, it tracks memory allocations.

Now, it's not the easiest thing to use this, because you need a specially compiled binary, and also, anything that you link to, that hasn't also been compiled with instrumentation, isn't going to give you any information. So you get a fairly coarse profile. It only tells you how much time was spent in a function, but it doesn't tell you where in that function. But the advantage of it is, the call graphs are always accurate. There are no elements missing in the call tree.

Then the other option that there is, is one of the sub tools of Valgrind, called callgrind, actually. This is really pretty complete. It can give you instruction level profiling, complete call graphs. The only problem is it's running it through an emulated machine, so it's at least 100 times slower than running in real time. How many of you have used Valgrind? If you haven't used, you should definitely check it out, because it's got a lot of cool features. Again, the only big problem with it is it runs so much more slowly, so you don't always get an accurate view of the overall allocation of time.

One of the other tricks I've learned with this is, if you're using a sampling profiler like Perf, run it on the slowest CPU you can because it makes all the time samples a lot easier to capture. I used to run it on a Pentium M laptop that it would set down to the battery save mode. So it would run at 600 megahertz, and I would get the best profiles that way, so. Perf is so easy to use, this should always be your first go-to tool. The results can be fairly obvious. In the very first version of OpenLDAP, which was inherited from the University of Michigan, we found 95% of our execution time was in the C library. Basically only 5% of the time was executing LDAP protocol. So this code was pretty horrible to begin with. I would say it was representative of code written in the mid-1990s. This was people who did not think about memory allocation because they weren't taught to think about it. I think they were taught not to think about it. And they used the standard C string library, because that's what every C programmer does, even though it's really not that great.

Problems and Effective Solutions

First of all, you never actually know how bad things are until you look. So you always have to take at least that first look. Of course, you can miss some details with Perf, with a sampler. So it's always a good idea to use more than one tool. How many of you have seen DRY, don't repeat yourself? Normally it's applied to writing a source code. You don't want to copy and paste functions or whatever. You try and minimize repetition in your source base. I would say, also don't repeat yourself at execution time. You don't want to compute the same information over and over again, and throw it away each time if you're going to use it a lot.

On the opposite, you don't want to cache information if you don't use it that often, or if it's very easy to retrieve it again. So, looking into our problem, 40% of execution time in the string library, we find that 25% of our time is just on strlen(), counting the length of strings. Length of strings that we received over the network, which were transmitted in a binary protocol that always sends us the length. This was completely unnecessary work and we found a way to fix this, which is, well, already the code base uses structured strings, which is a string pointer with an explicit length variable. This is already in the code base. Why aren't they taking advantage of it? Well, now we are. After addressing this, we could get string length completely removed from the runtime profile. So right away, we gained 25% runtime performance.

The next thing it's doing is dissecting these strings that are coming in, tearing them apart, and then putting them back together again, using strcat(), which is what C string library gives you. How many of you have seen, Shlemiel the painter problem? The idea here is, you've got Shlemiel, his job is to paint the lines down a roadway. Every day, he paints some lines, and the following day, he gets less done, and every day, he gets less and less work done. And someone asked him, "Well, what's going on Shlemiel? Why are you slowing down so much?" He says, "Well, every day I have to walk back further to the paint bucket." The Shlemiel the painter problem with strcat() is, it always starts at the beginning of the string. So as you keep concatenating strings to it, it gets longer and longer and longer, and slower and slower and slower. So strcat() is a horrible function.

We fix this in OpenLDAP just by instead of having a copy function that always returns the pointer of the destination, we always return a pointer to the tail of the copy. Since you have the tail of the copy, you can just keep adding to it immediately from where you are. Now, this was 2001. This was a custom function of ours. Now you can find exactly this feature in a modern C library. It's called stpcpy(). But here's something that still isn't in modern C libraries that really needs to be. This is a string copy function that will never overflow its buffer. It's a lot easier to use than strcpy or strcpy_s or whatever, because you don't have to re-compute the remaining space on the buffer whenever you use it. All you do is you is, say, here's the end of the buffer, I don't care how much I put into it. I don't have to compute any lengths at all. Just keep stuffing into it until you can't stuff any else in. It also has the same feature where it returns a pointer to the end of what it copied. Again, you can just easily concatenate stuff together very quickly and completely safely.

The standard C string library is kind of horrible. It's so easy to do a better job of what it does. When you have the opportunity to fix problems like this, you should do it every chance you get. If you've got an application that does a lot of string handling and you've been using the standard C string library, you can probably do a much better job. If you need to manage strings around, you probably need to use something like the struct berval or you keep an explicit length. It will save you a lot of work.

Malloc was 50-some percent of our profile. Again, it was horrible. These days, we have super high performance Malloc libraries that we can use, JEmalloc, or Google's TC Malloc. This should not be your first reaction. You should not just say, "Oh, let me put a better Malloc library in there." If you don't fix anything else, the most you're going to get out of them is maybe 10%. So, you really want to examine where your memory allocations are occurring and see if they really are doing something that's necessary.

The most common occurrence that we found was in functions that return an object to the caller where the object is filled with certain bits of data. I think this is a horrible pattern, but it's one that C++ uses all the time. You get constructors who will say, "Here's an object for you." And it's so easy to fix this. You say, "Hey, pass in the container." Most of the time the container can be a local variable and the caller can be on the stack. So you go from stupid Mallocs that are going to be immediately freed as soon as you've consumed the data, to no Mallocs. So, C++ is horrible. Don't do things the way C++ does them.

These were very easy wins. That again, got us another 25% boost in performance, at least. So the next few bits will be a little bit more involved. Another frequent issue is we've parsed some data, we're going to build a structure out of it, and we're going to start with a known size and just keep reallocing as we add elements to it. Again, this is horrible. It gets slower, the more you use it. So if you do things like this, the smart thing to do is iterate through all of the elements you think you're going to add in, count up their sizes, do a single Malloc for the right size, and then set the fields in. So instead of doing a series of Malloc, realloc, realloc, realloc, realloc, you're just doing one alloc. That'll save quite a bit of time.

Then, we're talking about a network service that takes incoming requests over the wire, parses them up, does something to process them, and then spits a reply back. Now, again, we're talking about a binary protocol where every message has its length already encoded in it. We know the size of what we're dealing with, we've got a buffer that's big enough to hold it, and then we parse it into individual fields. The old code would Malloc a new copy for every element to the field that it was parsing out. So again, this is a whole bunch of Mallocs that are happening, really for no good reason because the original network buffer is still there. So we change this to wherever we parse the value, we just set a pointer into the original network buffer. Instead of having lots of Mallocs and memcpys, we now have zero memcpys.

This should be a simpler one. If everything you're doing is request-based, if you have individual requests that don't affect anything else that don't leave any global state, then it's really effective to have an arena that's associated with individual request. Then you can just allocate a whole bunch of stuff as you need to process that request. Once the request is done, you can throw the entire thing away. You can just reset it back to its virgin state without having to go through the trouble of freeing one element at a time. The funny thing about this is this is the dynamic memory model that Pascal used back in the 1980s. It's really simple and it's really fast.

Some of the other more obvious techniques- if you know you're going to need a bunch of request structures, pre-allocate a whole bunch of them at startup time. It's also a good idea to keep a small number of them reserved. For example, if your server gets overloaded, you always have a couple of request structures available that you can use to send replies back and say, "Hey, too busy," or something. So you never get an unresponsive server.

With these techniques, Malloc doesn't even show up in our runtime profiles anymore. It doesn't show up at least in the top 100. Just another note, if you make some mistakes along the way, it happens. Now you might discover you've got some memory leaks in your code. I encourage you to check out this project on GitHub, which I wrote, which is at least eight times faster than the Malloc tracer in TC Malloc. It's six times faster than the one in glibc. So it'll save you a lot of time. It won't heavily impact your runtime performance.

After you've tackled all the big hot spots on your call profile, you might see that everything's kind of flat and there's no longer any big obstacles that are apparent. If the performance is where you want it to be, then that's cool, you're done. If you still need more performance out of it, you might have some problems, you've got some harder thinking to do. Also, there are a lot of interesting overheads that won't show up in a source level call graph. For example, if you spawn a lot of threads, which are supposed to be cheap, supposed to be easy to spawn them and throw them away as much as you want, the reality is there is a cost to starting up a thread and shutting down a thread. If you're going to do a lot of thread work over and over again, you really don't want to incur that cost repeatedly. So again, don't repeat yourself; don't do the same processing steps more than necessary. I would say, if you're using threads and you're just spawning them on demand, you really need to switch to a thread pool, where you just keep reusing the same threads over and over again.

I don't know if this is something you think about very often, but do you typically have debug or log functions scattered throughout your code? Yes, no? In OpenLDAP, this is littered all over the code base. The basic idea is you say, "Here's a message, output it if the log level is selected to be this value." So the debug function always has this check, is the provided value matched with the current system debug level, and if it is, then we'll do the message processing. If it's not, we just return and do nothing. It turns out if you have this scattered through your code quite a bit, that just the overhead of calling into a function and returning again immediately, is a significant impact on your throughput. On a modern CPU, it kills your Branch Target Buffer, it kills your return cache. There are a few things that it's just not good.

The simple thing to do here is, you use a macro to invoke your DEBUG function. In the macro, you put the "If" statements that checks your debug level, and then you're golden. In situations like this, it's cheaper just to do a single If, than it is to do a function call to a no up function, you know, try to avoid that. Here's another that is probably more common in languages like C++. The LDAP search operation has something like 12 parameters, base, scope, filter, attributes only, values only. We pass this thing in from the network and then we pass this from the front-end into the back-end, and we do syntax tracking and all these things. At every function that we pass this off to, we got to send all 12 parameters across. It was kind of horrible.

And as Alan Perlis says, having a large number of parameters to a function, tends to mean you've done something wrong. But even if you've got it correct every single time, you're wasting time, because every one of those parameters has to get pushed onto the stack. And then they get popped off the stack by the callee, and the callee does what it wants to do, and they call something else and has to push them back onto the stack. It's kind of horrible.

In our code, we replaced all of that with a single structure. A structure still has 12 fields or whatever it is, but when you're passing it around, it's only costing us one pointer in the parameter list. Surprisingly, this can be up to 8% of your execution time. And this doesn't show up in the source level profile. You can't see that.

If you were in Nitsan's talk earlier today, you might have caught his mention of false sharing. Basically, if you have shared data that's accessed from multiple CPUs at the same time and if the fields of your data straddle the CPU cache line, then every CPU that accesses it is causing cache invalidation on every other CPU that accesses it. It's kind of horrible. It's not a problem that occurs on a single processor machine. But it's something that you run into more and more, especially these days, we’ve got 16, 32 core, whatever boxes. So it's something you have to be more aware of now.

The advice here about ordering elements within a structure, this is actually something that I was taught in the 1980s when I was learning FORTRAN. But it still applies today. If you can avoid it, you don't really want hidden padding in your structures. If you've got data structures that are going to be shared between threads, you do want them to be CPU cache line aligned. If you can't just declare them that way, the portable methods would be to use posix_memalign, or mmap, or V Malloc, whatever is available to you. If you want to see the impact of what's going on, again, the Linux Perf command can show this to you, because it can show you a cache hit and miss ratios and cache timing delays.

Another thing that tends not to show up in the source level call graph. Now, by default, these profilers show you CPU time used. And when you're sitting in a lock, you're using zero CPU time. So, mutexes typically don't show up in a normal call trace. There is this cool little tool called mutrace, whose only job is to measure mutex times. It shows mutexes and condition variables. We found, for example, in OpenLDAP, we had high contention on a single mutex for our thread pool work queue. Kind of makes sense if there's only one of them. The interesting thing is testing on a quad-core box, if we split this up into 4 queues, we actually get a better than 4X performance boost. It's nonlinear. So, this is again something to be aware of. It won't show up in a normal call trace, you have to go explicitly looking for it.

You discover over the course of time - I'm talking about a development process where we discovered these things, over intervals of months - you eliminate one problem and something else pops up. Sometimes it's amazing to use, like how did this big problem hide from us so long? But it's just a matter of changing the dynamic of your program. As you fix certain problems, other timing relationships change. And so certain things that weren't a problem before, become more of one. The way to survive this from a morale standpoint is to keep good notes and just to show that you have been making progress over time. Those notes have to be test parameters, profile results, all of these things that you can use to give yourself documentation that says, "Oh, yes, we changed this function, we've eliminated this bottleneck, so this is something new."

Then at the end of all of this, refinement, iteration, whatever, you may still discover it's not where it needs to be. Then the only solution is to start over. How many of you have used Berkeley DB in your projects? What's your experience with it? Positive? OpenLDAP, we've used it for quite a long time. It always was a massive source of complexity for us because Berkeley DB itself is too slow for the request rates we want to process. So we have our own way of caching on top of it. And this turns out to be massive overhead because that means data can appear in three places at once. It can be in the file system cache. It's almost always on the file system cache anyway. It can be in Berkeley DB's own cache, and then it's in our entry cache, data cache. So the techniques we used here to make Berkeley's performance acceptable were hugely wasteful, just in terms of memories.

The other thing too is Berkeley has its own transaction management system and its own locking system. If you use it properly, you still get deadlocks in all of the operations you're doing. Even when it's being used correctly, you have to detect deadlocks and abort and back off and retry. So it's pretty poor for throughput. Then the other thing is you have to coordinate its locking system with the locks we had to use to protect our own cache. We're now talking about two levels of locks and three levels of caches. It was pretty much horrible over the course of years. It was always the cause of new bug reports.

Sometimes you have to think and say, "Gosh, we keep on having problems with this thing. Is this particular thing what we should be doing?" It's like, “If Berkeley DB is this slow, that we have to build all these band aids around it, why are we using it?" Well, it took us nine years to realize maybe that's not the thing to do. With LMDB, we looked at all the things that we hated about Berkeley, lock management, cache management. And we said, "Get rid of all those things." Now, in order to have the flexibility to make such a large change, is this making, the entire local data store, toss it out and put in a new one. We were kind of lucky that we already had a modular back-end interface. So actually plugging in new data stores isn't too traumatic for us. But it could be a much larger problem and you could get lost in the muck along the way to it. So you have to make sure that you have the original design goals well in mind, and make sure that you're actually solving them with a new solution that you bring in.

So it turns up with LMDB, since we're using MVCC, we can actually do reads without any locks at all. Because we use a read-only memory map, we can actually return pointers to database data without doing any memcpys or Mallocs. In fact, when you run OpenLDAP today with back-MDB, and do a pure read search load on it, the read threads actually do no blocking calls at all. The only system calls that they invoke are the ones to write packets back to the client. Also, because of the way we implemented MVCC and the read-only memory map, the database structure is actually completely immune to system crash corruption. You can kick the disk drives, you can pull out the power plug, whatever, the database will come up instantly, perfectly every time.

Just a recap. You do always have to aim for correctness first. Yes, you can tolerate slow code at the very beginning of the project. But, at least in my perspective, getting the right answer too late is still wrong. Fixing these kinds of problems is an iterative process. You're always going to discover a new bottleneck after you fix the last one. There are lots of tools out there that can help you with this process. You probably need to use several of them, because the more perspectives you get into the problem, the more clear the solution becomes. And sometimes after all this iteration, you might just have to throw things out and start over. The ultimate goal is to just do what's necessary. An efficient program just does what it needs to do to get your work done, and then gets out of the way, and that's really what you want to do.

 

See more presentations with transcripts

 

Recorded at:

Apr 23, 2019

BT