Memory and Native Code Performance
In a deep dive on performance, Eric Brumer explained why memory is often the most critical component. And while this session was on C++ development, much of what he said is applicable to managed code as well.
A single core in an Intel Sandy Bridge has six pipelines meaning that it can process up to 6 instructions per cycle. The core also has 64 kB of L1 memory and 256 kB of L2 memory usable for instruction and/or data caching. The L3 cache, which is shared across all four cores, ranges from 8 to 30 MB depending on the model.
Reading a single 32-integer from the L1 cache takes 4 cycles. The L2 cache takes 12 cycles, the same amount of time it would take to calculate the square root of a number. The L3 cache takes more than twice as long, 26 bytes. Reading from memory takes far, far longer.
Data access pattern
To begin the discussion on data access patterns, Eric offers this classic example of poor spatial locality with matrix multiplication and its solution:
On a powerful system swapping the j and k for loops can result in a 10x improvement. In a low power desktop machine that improvement can be as high as 18x. And to reiterate, the code was not actually changed, it was merely reordered.
Side note: The release version of Visual Studio 2013 is expected to be able to recognize and correct the above example. This requires an improved data dependency analysis engine that was not ready for the preview version released last week.
Imagine that you have two arrays of four million floats and want to add them element by element to a third array (c[i] = a[i] + b[i]; i = 0..3,999,999). In theory this would be perfectly scalable and parallelizing it across 5 cores would result in a 5x speedup, but Eric found that he was only getting a 1.6 speedup (32% scalability ratio). Increasing that to 10 cores bumped it to 2.4x speedup for a dismal 24% scalability ratio.
When reviewing the results, he found that all of the time was spent on loading memory. The three arrays totaled 48 MB, larger than the 30 MB of L3 cache his 10 core CPU offers. Upping the parallelization to 20 cores spread the load over 2 CPUs and doubled his cache allowance to 60 MB. This in turn resulted in an 18x speedup (90% scalability ratio).
Eric went on to say that really the number of cores doesn’t matter, it is all about the amount of cache available. He went on to explain that this is somewhat of an edge case. Most of the time programs will run faster when related threads are running on the same CPU, as they can more easily share cache. But in this example that is the exact opposite of what you want and the standard parallelization libraries are actively working against you.
In his next example, Eric showed a particle physics problem where 4096 elements were being acted upon by each other using the standard gravitational formulas. Using a normal loop he was seeing 8 frames per second. He was able to cut the number of iterations in half and increase the performance to 18 frames per second using Visual C++’s automatic vectorization and the 128-bit SSE2 instructions. The 256-bit AVK instructions cut the number of iterations in half again, but only resulted in 23 fps.
The problem turned out to be how the data was structured. For each particle the position data, which doesn’t change in a given frame, was comingled with the acceleration data being calculated. To load one 256 bit ymm register with the X coordinate of 8 particles, 9 separate instructions need to be executed. Another 18 instructions are needed for the Y and Z coordinates.
Rewriting this code so that the X, Y, and Z coordinates each ‘lived’ in their own contiguous arrays was beyond the scope of the research. So instead Eric simply copied them into three temporary holding arrays, ran a modified version of the main loop, and then copied them back out.
Still using the 256-bit AVK instructions, these extra memory copies actually improved performance to 42 fps. It should be noted that this improvement, 8 fps to 42 fps, was done on a single core on a laptop.
The next issue is memory alignment. Each line in a L1 cache is 64 bytes wide. If each element in your data structure is a factor of the cache line (e.g. 32, 16, 8, or 4 bytes) then you can fit them in a cache line with no wasted space. But that can only happen if the elements are aligned along the 64 byte boundaries.
Turning back to the previous example, Eric forced unaligned access by changing the code to skip the first element in the array. This resulted in an 8% performance penalty versus starting the j for-loop at 0.
According to the C standard, malloc is only required to return data that is 8 byte aligned. To get something that is 32 or 64 byte aligned you have to use _aligned_malloc. Work is currently being done in the compiler to automatically call _aligned_malloc when the data is being used in a loop body, but there is a risk of increasing fragmentation. The goal is to have something available in VS 2013 Update 1.
To learn more about these and other memory-related performance issues watch the Build 2013 video Native Code Performance and Memory: The Elephant in the CPU.