I recently started playing with GCC’s prefetch builtin, which allows the programmer to explicitly tell the processor to load given memory locations in cache. You can optionally inform the compiler of the locality of the data (i.e. how much priority the CPU should give to keep that piece of data around for later use) as well as whether or not the memory location will be written to. Remarkably, the extension is very straighforward to use (if not to use correctly) and simply requires calling the __builtin_prefetch function with a pointer to the memory location to be loaded.

It turns out that in certain situations, tremendous speed-ups of several factors can be obtained with this facility. In fact, I’m amazed that I haven’t read more about this. In particular, when memory is being loaded “out of sequence” in a memory bandwidth-constrained loop, you can often benefit a great deal from explicit prefetch instructions. For example, I am currently working on a program which has has two inners loops in sequence. First, an array is traversed one way, and then it is traversed in reverse. The details of why this is done aren’t important (it’s an optical transfer matrix computation, if you’re interested) but the salient aspect of the code is that the computation at each iteration is not that great, and so memory bandwidth is the main issue. Here is the relevent section of code where the arrays are accessed in reverse:

/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}

Despite having exactly same number of operations in the forward and reverse loops, it turns out that the vast majority of time was being spend in this second (reverse) loop!

Why? Well, I can’t be entirely certain, but I assume that when memory is accessed, the chip loads not just the single floating point double being requested, but an entire cache line starting at that address. Thus, the data for the next couple of iterations is always loaded into L1 cache ahead of time when you’re iterating forward in address space. However, in the reverse loop, the chip isn’t smart enough to notice that I’m going backwards (nor should it be) and so it has to wait for the data to come from either L2 or main memory every single iteration. By adding a few simple prefetch statements to the second loop, however, the time spent in this section of code went way down. Here is the new code for the second loop:

/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
__builtin_prefetch(Trev1+dx-1,1);
__builtin_prefetch(Trev2+dx-1,1);
__builtin_prefetch(Tlay1+dx-1);
__builtin_prefetch(Tlay2+dx-1);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}

The prefetch instructions tell the processor to request the next loop’s data, so that the data is making its way through the memory bus while the current computation is being done in parallel. In this case, this section of code ran over three times as fast with the prefetch instructions! About the easiest optimization you’ll ever make. (The second argument given to the prefetch instruction indicates that the memory in question will be written to.)

When playing around with prefetch, you just have to experiment with how much to fetch and how far in advance you need to issue the fetch. Too far in advance and you increase overhead and run the risk of having the data drop out of cache before you need it (L1 cache is very small). Too late and the data won’t have arrived on the bus by the time you need it.

Why did I not prefetch the dTrev1 and dTrev2 memory locations? Well, I tried and it didn’t help. I really have no idea why. Maybe I exceeded the memory bandwidth, and so there was no point in loading it in. I then tried loading it in even earlier (two loops ahead) and that didn’t help. Perhaps in that case the cache got overloaded. Who knows? Cache optimization is a black art. But when it works, the payoff can be significant. It’s a technique that is worth exploring whenever you are accessing memory in a loop, especially out of order.

Tagged with:
 

12 Responses to Accelerating code using GCC’s prefetch extension

  1. [...] Jonathan Birge » Blog Archive » Accelerating code using GCC’s … [...]

  2. Jared says:

    Great article. I hadnt heard of this builtin. Just FYI: the second __builtin_prefetch should probably be for Trev2 not Trev1.

  3. Thank your this article. This will have tremendous effects on my work and this gcc feature was not known to me. Again, thanks.

  4. Alfons Laarman says:

    That’s interesting. And you are right, I also couldn’t find anything really elaborative on this built-in (That’s why I’m replying your year-old article).

    How does your data structure look? maybe this has something to do with why dTrev1 and dTrev2 need not to be prefetched. I cannot say I agree with the explanation of the worse performance of the reverse loop. Cache lines are mapped at fixed aligned memory locations, so a line is not loaded exactly from the memory location that you are accessing. What other prefeteches did you try and what where the results?

    • Jonathan says:

      I don’t have good notes for the other prefetches that I tried, other than the ones I mentioned in the article. While you clearly know a lot about CPU architecture, I disagree with your disagreement. Even though cache lines cannot be loaded from arbitrary addresses, that only matters for the first few memory accesses of the loop (how many being a function of the size of a cache line and the random luck of where the array ended up in memory. Regardless of where in a cache line the first address of the array falls, eventually we’ll be in a situation where the loop is running through an entire cache line, and then my argument applies in full. For long enough loops, the granularity of cache line mapping will be negligible. Did I misunderstand your argument?

      The data are all just arrays of doubles, at least in the snippet I showed. I don’t know why some prefetches worked and other didn’t. I’m not sure it is easily predictable, either. There’s a lot going on, I would imagine, and my guess is that predicting cache hits is like predicting the weather.

    • Alfons laarman, congratulations.

  5. Sexy post! Oh yea, here’s a tip: SAEIE and WMNS!

  6. basur says:

    excellent post,
    thanks

  7. buy gold says:

    What was the most important thing Thomas Jefferson did during his political career