stm32 (or arm in general), usage of prefetch buffer?

Thread Starter

bug13

Joined Feb 13, 2012
1,954
Hi team

I am playing with the prefetch buffer on my stm32 today, I have found no performance difference with it enable or disable? (In both tests my code finished at 190ms) I thought it can speed up instruction execution speed as it prefetch instruction. Did I use it wrong, or did I understand it wrong?

This is the register I am talking about:
C++:
/* enable prefetch buffer */
FLASH->ACR |= FLASH_ACR_PRFTBE

/* disable prefetch buffer */
FLASH->ACR &= (~FLASH_ACR_PRFTBE)
Here is how I test it:
C:
volatile uint32_t counter = 0;
volatile uint32_t ticks_start = 0;
volatile uint32_t ticks_diff = 0;

uint8_t test_data1[] = {0x9A, 0xEB, 0x05, 0x8A, 0x02, 0x8A,      /* header */
                    0x81, 0x5A, 0x80, 0x00,                     /* payload */
                    0x33, 0x30, 0x20, 0x4E,
                    0x6F, 0x76, 0x20, 0x30,
                    0x30, 0x3A, 0x30, 0x30,
                    0xFF, 0x00, 0xFF, 0xFF,
                    0x1A, 0xED};   /* packet CRC */

/* calculate CCITT CRC16 */
uint16_t calculateCRC16(uint8_t *data_ptr, int size);

void mainr(){

    for(;;){

        /* get ticks at start */
        ticks_start = HAL_GetTick();

        /* do it 1000 times */
        uint32_t dummyVal = 0;
        for(uint32_t i = 0; i < 1000; i++){
            dummyVal = dummyVal + calculateCRC16(test_data1, sizeof(test_data1));
        }
     
        /* Calculate ticks difference
         * prefetch buffer enable,  finished at 190ms
         * prefetch buffer disable, finished at 190ms
         */
        ticks_diff = HAL_GetTick() - ticks_start;
     
        /* number of iteration */
        counter++;

    }
}
 
Last edited:

402DF855

Joined Feb 9, 2013
271
Seems like you expect test_data1 to reside in flash when the CRC routine runs, but it may be in RAM. Check the map to see where it was placed. Declaring it const may leave it in flash, depending on your build tools and the part (which I have no knowledge of).
 

Thread Starter

bug13

Joined Feb 13, 2012
1,954
Seems like you expect test_data1 to reside in flash when the CRC routine runs, but it may be in RAM. Check the map to see where it was placed. Declaring it const may leave it in flash, depending on your build tools and the part (which I have no knowledge of).
So I have tried const uint8_t and static const uint8_t, here are the map file information:

const uint8_t
.rodata.test_data1
0x0000000008001930 0x1c Core/Src/mainUser.o
0x0000000008001930 test_data1
static const uint8_t
.rodata.test_data1
0x0000000008001930 0x1c Core/Src/mainUser.o
Not sure what the difference is, and don't know what .rodata mean (I googled, but couldn't find anything meaningful). I do understand .data. So by that logic, .rodata is read only data in ram??

Regardless, performance is about 191ms (instead of 190ms as of the test did before), still no difference between with prefetch buffer enable or disable.

I am using STMCubeIDE v1.5.x, arm-none-eabi-gcc v7.3.1
 

Thread Starter

bug13

Joined Feb 13, 2012
1,954
You may want to look into STM32 CCM (core coupled memory) in order to enhance memory access speed.
I am just trying to understand what the instruction prefetch buffer [FLASH->ACR |= FLASH_ACR_PRFTBE]does, will play with CCM next :)
 

nsaspook

Joined Aug 27, 2009
8,394
In general about cached architectures.

Usually the prefetch buffer is the last thing checked during an instruction or data miss if there are also I‐Cache and/or D-cache buffers also in play. If the program loop is cached then there won't be additional prefetch buffer accesses unless there is a miss outside of the cached I/D range duet to a branch or thread change of execution. It all depends on how the entire pipeline is designed.
 

Thread Starter

bug13

Joined Feb 13, 2012
1,954
In general about cached architectures.

Usually the prefetch buffer is the last thing checked during an instruction or data miss if there are also I‐Cache and/or D-cache buffers also in play. If the program loop is cached then there won't be additional prefetch buffer accesses unless there is a miss outside of the cached I/D range duet to a branch or thread change of execution. It all depends on how the entire pipeline is designed.
So does it mean I can't really write code to test the performance until I dig down to the details of how ARM implemented? I guess I will always enable it then.

But why give you the option to disable it if it doesn't need to be disable? It doesn't make sense.
 

402DF855

Joined Feb 9, 2013
271
.rodata is read only, so if the array is declared const, that makes sense. It'd be useful to know if 0x0000000008001930 maps to flash or RAM. Even for const data some architectures might copy the const data out of flash and into RAM. If the CRC is computed in RAM then timing would likely be impacted by data cache performance, and your flash prefetch setting wouldn't be a factor.

You might be able to replace the array address with a hard coded location in flash. You probably don't care what values are CRCed, usually the time to compute is a factor of length not content. I'd use a large array length to make flash access time enough of a factor to impact total computation time.
 

mckenney

Joined Nov 10, 2018
116
So does it mean I can't really write code to test the performance until I dig down to the details of how ARM implemented? I guess I will always enable it then.

But why give you the option to disable it if it doesn't need to be disable? It doesn't make sense.
You should probably say what STM32 model you're using. The Prefetch mechanism (whatever they're calling it now) has gone through considerable evolution over the years. I haven't yet encountered a case where I wanted to turn (leave) it off, though I suppose such a case exists. I have this vague idea that in later series (H7?) it's always-on.

1) The Buffer (really any cache) depends on locality of reference. -Osize might help you here.
2) In earlier implementations of the Prefetch the Buffer was pretty small.
3) I think that in earlier implementations the Prefetch was connected only to the I-Bus, not the D-bus. In this case you would want your data in RAM (not "const") and global so the copy happens during C initialization, before you start measuring.
4) As you speed up your CPU, the Prefetch will come up against Flash wait-states. I suggest you do your measurements in CPU clocks, rather than time. The crossover point will be visible.
5) If your MCU has I/D-caches, turn (leave) them off, and measure their effects separately.
 
I haven't yet encountered a case where I wanted to turn (leave) it off, though I suppose such a case exists
If timing needed to be rock solid and consistant.

When I think about pre-fetch and not necessarily ARM, I think of the instruction set, memory and pre-fetching. The idea here is that most code runs linearly with few branches. So, grabbing the next "anticipated" instruction makes sense.
 

Thread Starter

bug13

Joined Feb 13, 2012
1,954
.rodata is read only, so if the array is declared const, that makes sense. It'd be useful to know if 0x0000000008001930 maps to flash or RAM. Even for const data some architectures might copy the const data out of flash and into RAM.
I should have checked the address, I don't usually look at the map file, so my brain didn't click when you asked me to check the maps file. I think .rodata is in the flash. According to the linker script:
Code:
  /* Constant data into "FLASH" Rom type memory */
  .rodata :
  {
    . = ALIGN(4);
    *(.rodata)         /* .rodata sections (constants, strings, etc.) */
    *(.rodata*)        /* .rodata* sections (constants, strings, etc.) */
    . = ALIGN(4);
  } >FLASH
You should probably say what STM32 model you're using.
It's a stm32f303k8, so it's not a H7 or some high end one you are thinking about.


Anyway, here is my latest test, it still have no difference with is enable or disable, I up the clock to 64Mhz (was 32Mhz). If I have done my code correctly, the data should be in CCMRAM, RAM and FLASH.

I calculated the same data 25,000 times (was 1000), unit is in ms.

On a side note, not much difference between CCMRAM and RAM, is it to be expected?

Screenshot 2020-12-18 075822-result.png
 

mckenney

Joined Nov 10, 2018
116
It's a stm32f303k8, so it's not a H7 or some high end one you are thinking about.
Per reference manual (RM0316, Rev 6) Sec 4.2.2, prefetch only happens for instructions (ICode bus).

Also I was reminded that PRFTBE is initially set (=1). Do you explicitly set/clear it before each trial?

Also, I wonder how calculateCRC16() works. Is it table driven (const .data again) or does it use bit-shifting?
 

Thread Starter

bug13

Joined Feb 13, 2012
1,954
Per reference manual (RM0316, Rev 6) Sec 4.2.2, prefetch only happens for instructions (ICode bus).

Also I was reminded that PRFTBE is initially set (=1). Do you explicitly set/clear it before each trial?

Also, I wonder how calculateCRC16() works. Is it table driven (const .data again) or does it use bit-shifting?
Yes, silly me. Should have read the datasheet more carefully. I removed the code to enable the instruction prefetch buffer, but didn't explicitly clear the bit. Now the test works better.

Here is the new test, running at 64Mhz
Screenshot 2020-12-18 120510-com.png

PS:
my calculateCRC16() use bit shifting.
 
Last edited:

BobaMosfet

Joined Jul 1, 2009
1,780
prefetching is not about variables, it's about instructions and pipelining. In short, the concept is to read instructions in a non-linear way. Intel came up with this idea originally as they wanted to be able to read instructions out of order, so they could analyze code before execution in order to predetermine how best to execute it- whether in parallel, or in different orders, etc, in order to be able to execute an overall program more efficiently and quickly.

Prefetching isn't guaranteed to make things faster, particularly in small code-bases, and depending on what logic the code is executing. And again, prefetch is for instructions, that's why the prefetch caches are small (relatively speaking).

Sadly, prefetching meant the end of self-modifying code on the fly in large ways.
 
Top