Accelerating code using GCC’s prefetch extension

I recently started playing with GCC’s prefetch builtin, which allows the programmer to explicitly tell the processor to load given memory locations in cache. You can optionally inform the compiler of the locality of the data (i.e. how much priority the CPU should give to keep that piece of data around for later use) as well as whether or not the memory location will be written to. Remarkably, the extension is very straighforward to use (if not to use correctly) and simply requires calling the __builtin_prefetch function with a pointer to the memory location to be loaded.

It turns out that in certain situations, tremendous speed-ups of several factors can be obtained with this facility. In fact, I’m amazed that I haven’t read more about this. In particular, when memory is being loaded “out of sequence” in a memory bandwidth-constrained loop, you can often benefit a great deal from explicit prefetch instructions. For example, I am currently working on a program which has has two inners loops in sequence. First, an array is traversed one way, and then it is traversed in reverse. The details of why this is done aren’t important (it’s an optical transfer matrix computation, if you’re interested) but the salient aspect of the code is that the computation at each iteration is not that great, and so memory bandwidth is the main issue. Here is the relevent section of code where the arrays are accessed in reverse:

/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}

Despite having exactly same number of operations in the forward and reverse loops, it turns out that the vast majority of time was being spend in this second (reverse) loop!

Why? Well, I can’t be entirely certain, but I assume that when memory is accessed, the chip loads not just the single floating point double being requested, but an entire cache line starting at that address. Thus, the data for the next couple of iterations is always loaded into L1 cache ahead of time when you’re iterating forward in address space. However, in the reverse loop, the chip isn’t smart enough to notice that I’m going backwards (nor should it be) and so it has to wait for the data to come from either L2 or main memory every single iteration. By adding a few simple prefetch statements to the second loop, however, the time spent in this section of code went way down. Here is the new code for the second loop:

/*
* Step backward through structure, calculating reverse matrices.
*/
for (dx = n-1; dx > 0; dx--)
{
Trev1[dx] = Trev1[dx+1]*Tlay1[dx] + Trev2[dx+1]*conj(Tlay2[dx]);
Trev2[dx] = Trev1[dx+1]*Tlay2[dx] + Trev2[dx+1]*conj(Tlay1[dx]);
__builtin_prefetch(Trev1+dx-1,1);
__builtin_prefetch(Trev2+dx-1,1);
__builtin_prefetch(Tlay1+dx-1);
__builtin_prefetch(Tlay2+dx-1);
dTrev1[dx] = dTrev1[dx+1]*Tlay1[dx] + dTrev2[dx+1]*conj(Tlay2[dx]) +
Trev1[dx+1]*dTlay1[dx] + Trev2[dx+1]*conj(dTlay2[dx]);
dTrev2[dx] = dTrev1[dx+1]*Tlay2[dx] + dTrev2[dx+1]*conj(Tlay1[dx]) +
Trev1[dx+1]*dTlay2[dx] + Trev2[dx+1]*conj(dTlay1[dx]);
}

The prefetch instructions tell the processor to request the next loop’s data, so that the data is making its way through the memory bus while the current computation is being done in parallel. In this case, this section of code ran over three times as fast with the prefetch instructions! About the easiest optimization you’ll ever make. (The second argument given to the prefetch instruction indicates that the memory in question will be written to.)

When playing around with prefetch, you just have to experiment with how much to fetch and how far in advance you need to issue the fetch. Too far in advance and you increase overhead and run the risk of having the data drop out of cache before you need it (L1 cache is very small). Too late and the data won’t have arrived on the bus by the time you need it.

Why did I not prefetch the dTrev1 and dTrev2 memory locations? Well, I tried and it didn’t help. I really have no idea why. Maybe I exceeded the memory bandwidth, and so there was no point in loading it in. I then tried loading it in even earlier (two loops ahead) and that didn’t help. Perhaps in that case the cache got overloaded. Who knows? Cache optimization is a black art. But when it works, the payoff can be significant. It’s a technique that is worth exploring whenever you are accessing memory in a loop, especially out of order.

22 thoughts on “Accelerating code using GCC’s prefetch extension

  1. Pingback: The Buzz » Blog Archive » Velvetmemory’S Weblog

  2. Jared

    Great article. I hadnt heard of this builtin. Just FYI: the second __builtin_prefetch should probably be for Trev2 not Trev1.

    Reply
  3. Alfons Laarman

    That’s interesting. And you are right, I also couldn’t find anything really elaborative on this built-in (That’s why I’m replying your year-old article).

    How does your data structure look? maybe this has something to do with why dTrev1 and dTrev2 need not to be prefetched. I cannot say I agree with the explanation of the worse performance of the reverse loop. Cache lines are mapped at fixed aligned memory locations, so a line is not loaded exactly from the memory location that you are accessing. What other prefeteches did you try and what where the results?

    Reply
    1. Jonathan Post author

      I don’t have good notes for the other prefetches that I tried, other than the ones I mentioned in the article. While you clearly know a lot about CPU architecture, I disagree with your disagreement. Even though cache lines cannot be loaded from arbitrary addresses, that only matters for the first few memory accesses of the loop (how many being a function of the size of a cache line and the random luck of where the array ended up in memory. Regardless of where in a cache line the first address of the array falls, eventually we’ll be in a situation where the loop is running through an entire cache line, and then my argument applies in full. For long enough loops, the granularity of cache line mapping will be negligible. Did I misunderstand your argument?

      The data are all just arrays of doubles, at least in the snippet I showed. I don’t know why some prefetches worked and other didn’t. I’m not sure it is easily predictable, either. There’s a lot going on, I would imagine, and my guess is that predicting cache hits is like predicting the weather.

      Reply
  4. Pingback: Snes9x for ps2 - Page 119 - PSX-SCENE: The oldest and most trusted Playstation Scene Community

  5. PlayStation Network

    I’d like to thank you for the efforts you’ve put in writing this blog.
    I’m hoping to see the same high-grade content from you later on as well.
    In fact, your creative writing abilities has motivated me to get my very own website
    now 😉

    Reply
  6. Allen Dobrovolsky

    Great selection, the open source ones are actually quite professional. Very well said, your blog was really reliable for giving us information about creating an organization that is very helpful.

    Reply
  7. jp

    Hi,

    really nice post! Thanks.

    About this:
    “You can optionally inform the compiler of the locality of the data (i.e. how much priority the CPU should give to keep that piece of data around for later use) as well as whether or not the memory location will be written to”.

    How can you do that?
    If I understood correctly, the function __builtin_prefetch loads to cache the given data. Right? Is there any function to inform the compiler about the locality?

    Thanks

    Reply
    1. Jonathan Post author

      Sorry for the delay in reply. Perhaps I was using the wrong terminology, but by “locality” I simply meant the probability of the data being needed in a register. So, I meant by telling the compiler to pull in a set of data ahead of time, you were implicitly informing the compiler that those pieces of data would need to be used at the same place in the code later, which I was (perhaps wrongly) referring to a locality. Keep in mind I’m not a computer scientist!

      Reply
  8. Amoorea Stockist

    I keep listening to the rumor lecture about getting free online grant applications so I have been looking around for the most excellent site to get one. Could you advise me please, where could i find some?

    Reply
  9. Certainly

    This is an old post, but it’s nonetheless interesting.

    However, I’m unable to replicate such a performance boost (not any performance boost at all btw)
    with a very similar computation loop at hand.

    I’m starting to wonder if it could be related to differences in hardware architectures since 2009.
    Maybe modern CPU have become better at predicting backward patterns ?
    Maybe compilers have made some changes too ?

    Anyway, I’m simply wondering if the conclusion is still applicable today.

    Reply
  10. fotografia profesional universidad

    Pablo recibió de manos de Andrea Sendón un cheque valorado en 3.0285€ de la escuela Efti, Andrea ha sido miembro del jurado y coordinadora del curso internacional de bodas impartido por Efti Estos premios ya consolidados se están convirtiendo en toda una fábrica de fotógrafos mediáticos, recordamos premiados anteriores como Raquel Bentito , Pablo Beglez Dani Alonso ahora reclamados por escuelas de fotografía, asociaciones para trabajos internacionales.

    Reply

Leave a Reply to Jonathan Cancel reply

Your email address will not be published. Required fields are marked *