I recently started playing with GCC’s prefetch builtin, which allows the programmer to explicitly tell the processor to load given memory locations in cache. You can optionally inform the compiler of the locality of the data (i.e. how much priority the CPU should give to keep that piece of data around for later use) as well as whether or not the memory location will be written to. Remarkably, the extension is very straighforward to use (if not to use correctly) and simply requires calling the __builtin_prefetch
function with a pointer to the memory location to be loaded.
It turns out that in certain situations, tremendous speed-ups of several factors can be obtained with this facility. In fact, I’m amazed that I haven’t read more about this. In particular, when memory is being loaded “out of sequence” in a memory bandwidth-constrained loop, you can often benefit a great deal from explicit prefetch instructions. For example, I am currently working on a program which has has two inners loops in sequence. First, an array is traversed one way, and then it is traversed in reverse. The details of why this is done aren’t important (it’s an optical transfer matrix computation, if you’re interested) but the salient aspect of the code is that the computation at each iteration is not that great, and so memory bandwidth is the main issue. Here is the relevent section of code where the arrays are accessed in reverse:
/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}
Despite having exactly same number of operations in the forward and reverse loops, it turns out that the vast majority of time was being spend in this second (reverse) loop!
Why? Well, I can’t be entirely certain, but I assume that when memory is accessed, the chip loads not just the single floating point double being requested, but an entire cache line starting at that address. Thus, the data for the next couple of iterations is always loaded into L1 cache ahead of time when you’re iterating forward in address space. However, in the reverse loop, the chip isn’t smart enough to notice that I’m going backwards (nor should it be) and so it has to wait for the data to come from either L2 or main memory every single iteration. By adding a few simple prefetch statements to the second loop, however, the time spent in this section of code went way down. Here is the new code for the second loop:
/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
__builtin_prefetch(Trev1+dx-1,1);
__builtin_prefetch(Trev2+dx-1,1);
__builtin_prefetch(Tlay1+dx-1);
__builtin_prefetch(Tlay2+dx-1);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}
The prefetch instructions tell the processor to request the next loop’s data, so that the data is making its way through the memory bus while the current computation is being done in parallel. In this case, this section of code ran over three times as fast with the prefetch instructions! About the easiest optimization you’ll ever make. (The second argument given to the prefetch instruction indicates that the memory in question will be written to.)
When playing around with prefetch, you just have to experiment with how much to fetch and how far in advance you need to issue the fetch. Too far in advance and you increase overhead and run the risk of having the data drop out of cache before you need it (L1 cache is very small). Too late and the data won’t have arrived on the bus by the time you need it.
Why did I not prefetch the dTrev1 and dTrev2 memory locations? Well, I tried and it didn’t help. I really have no idea why. Maybe I exceeded the memory bandwidth, and so there was no point in loading it in. I then tried loading it in even earlier (two loops ahead) and that didn’t help. Perhaps in that case the cache got overloaded. Who knows? Cache optimization is a black art. But when it works, the payoff can be significant. It’s a technique that is worth exploring whenever you are accessing memory in a loop, especially out of order.