stm32 (or arm in general), usage of prefetch buffer?

bug13 · Dec 16, 2020

Hi team

I am playing with the prefetch buffer on my stm32 today, I have found no performance difference with it enable or disable? (In both tests my code finished at 190ms) I thought it can speed up instruction execution speed as it prefetch instruction. Did I use it wrong, or did I understand it wrong?

This is the register I am talking about:

C++:

/* enable prefetch buffer */
FLASH->ACR |= FLASH_ACR_PRFTBE

/* disable prefetch buffer */
FLASH->ACR &= (~FLASH_ACR_PRFTBE)

Here is how I test it:

C:

volatile uint32_t counter = 0;
volatile uint32_t ticks_start = 0;
volatile uint32_t ticks_diff = 0;

uint8_t test_data1[] = {0x9A, 0xEB, 0x05, 0x8A, 0x02, 0x8A,      /* header */
                    0x81, 0x5A, 0x80, 0x00,                     /* payload */
                    0x33, 0x30, 0x20, 0x4E,
                    0x6F, 0x76, 0x20, 0x30,
                    0x30, 0x3A, 0x30, 0x30,
                    0xFF, 0x00, 0xFF, 0xFF,
                    0x1A, 0xED};   /* packet CRC */

/* calculate CCITT CRC16 */
uint16_t calculateCRC16(uint8_t *data_ptr, int size);

void mainr(){

    for(;;){

        /* get ticks at start */
        ticks_start = HAL_GetTick();

        /* do it 1000 times */
        uint32_t dummyVal = 0;
        for(uint32_t i = 0; i < 1000; i++){
            dummyVal = dummyVal + calculateCRC16(test_data1, sizeof(test_data1));
        }
     
        /* Calculate ticks difference
         * prefetch buffer enable,  finished at 190ms
         * prefetch buffer disable, finished at 190ms
         */
        ticks_diff = HAL_GetTick() - ticks_start;
     
        /* number of iteration */
        counter++;

    }
}

402DF855 · Dec 16, 2020

Seems like you expect test_data1 to reside in flash when the CRC routine runs, but it may be in RAM. Check the map to see where it was placed. Declaring it const may leave it in flash, depending on your build tools and the part (which I have no knowledge of).

MrChips · Dec 16, 2020

You may want to look into STM32 CCM (core coupled memory) in order to enhance memory access speed.

bug13 · Dec 16, 2020

402DF855 said:
Seems like you expect test_data1 to reside in flash when the CRC routine runs, but it may be in RAM. Check the map to see where it was placed. Declaring it const may leave it in flash, depending on your build tools and the part (which I have no knowledge of).

So I have tried const uint8_t and static const uint8_t, here are the map file information:

const uint8_t

.rodata.test_data1
0x0000000008001930 0x1c Core/Src/mainUser.o
0x0000000008001930 test_data1

static const uint8_t

.rodata.test_data1
0x0000000008001930 0x1c Core/Src/mainUser.o

Not sure what the difference is, and don't know what .rodata mean (I googled, but couldn't find anything meaningful). I do understand .data. So by that logic, .rodata is read only data in ram??

Regardless, performance is about 191ms (instead of 190ms as of the test did before), still no difference between with prefetch buffer enable or disable.

I am using STMCubeIDE v1.5.x, arm-none-eabi-gcc v7.3.1

bug13 · Dec 16, 2020

MrChips said:
You may want to look into STM32 CCM (core coupled memory) in order to enhance memory access speed.

I am just trying to understand what the instruction prefetch buffer [FLASH->ACR |= FLASH_ACR_PRFTBE]does, will play with CCM next

nsaspook · Dec 16, 2020

In general about cached architectures.

Usually the prefetch buffer is the last thing checked during an instruction or data miss if there are also I‐Cache and/or D-cache buffers also in play. If the program loop is cached then there won't be additional prefetch buffer accesses unless there is a miss outside of the cached I/D range duet to a branch or thread change of execution. It all depends on how the entire pipeline is designed.

bug13 · Dec 16, 2020

nsaspook said:
In general about cached architectures.

Usually the prefetch buffer is the last thing checked during an instruction or data miss if there are also I‐Cache and/or D-cache buffers also in play. If the program loop is cached then there won't be additional prefetch buffer accesses unless there is a miss outside of the cached I/D range duet to a branch or thread change of execution. It all depends on how the entire pipeline is designed.

So does it mean I can't really write code to test the performance until I dig down to the details of how ARM implemented? I guess I will always enable it then.

But why give you the option to disable it if it doesn't need to be disable? It doesn't make sense.

402DF855 · Dec 17, 2020

.rodata is read only, so if the array is declared const, that makes sense. It'd be useful to know if 0x0000000008001930 maps to flash or RAM. Even for const data some architectures might copy the const data out of flash and into RAM. If the CRC is computed in RAM then timing would likely be impacted by data cache performance, and your flash prefetch setting wouldn't be a factor.

You might be able to replace the array address with a hard coded location in flash. You probably don't care what values are CRCed, usually the time to compute is a factor of length not content. I'd use a large array length to make flash access time enough of a factor to impact total computation time.

mckenney · Dec 17, 2020

bug13 said:
So does it mean I can't really write code to test the performance until I dig down to the details of how ARM implemented? I guess I will always enable it then.

But why give you the option to disable it if it doesn't need to be disable? It doesn't make sense.

You should probably say what STM32 model you're using. The Prefetch mechanism (whatever they're calling it now) has gone through considerable evolution over the years. I haven't yet encountered a case where I wanted to turn (leave) it off, though I suppose such a case exists. I have this vague idea that in later series (H7?) it's always-on.

1) The Buffer (really any cache) depends on locality of reference. -Osize might help you here.
2) In earlier implementations of the Prefetch the Buffer was pretty small.
3) I think that in earlier implementations the Prefetch was connected only to the I-Bus, not the D-bus. In this case you would want your data in RAM (not "const") and global so the copy happens during C initialization, before you start measuring.
4) As you speed up your CPU, the Prefetch will come up against Flash wait-states. I suggest you do your measurements in CPU clocks, rather than time. The crossover point will be visible.
5) If your MCU has I/D-caches, turn (leave) them off, and measure their effects separately.

KeepItSimpleStupid · Dec 17, 2020

mckenney said:
I haven't yet encountered a case where I wanted to turn (leave) it off, though I suppose such a case exists

If timing needed to be rock solid and consistant.

When I think about pre-fetch and not necessarily ARM, I think of the instruction set, memory and pre-fetching. The idea here is that most code runs linearly with few branches. So, grabbing the next "anticipated" instruction makes sense.

bug13 · Dec 17, 2020

402DF855 said:
.rodata is read only, so if the array is declared const, that makes sense. It'd be useful to know if 0x0000000008001930 maps to flash or RAM. Even for const data some architectures might copy the const data out of flash and into RAM.

I should have checked the address, I don't usually look at the map file, so my brain didn't click when you asked me to check the maps file. I think .rodata is in the flash. According to the linker script:

Code:

  /* Constant data into "FLASH" Rom type memory */
  .rodata :
  {
    . = ALIGN(4);
    *(.rodata)         /* .rodata sections (constants, strings, etc.) */
    *(.rodata*)        /* .rodata* sections (constants, strings, etc.) */
    . = ALIGN(4);
  } >FLASH

mckenney said:
You should probably say what STM32 model you're using.

It's a stm32f303k8, so it's not a H7 or some high end one you are thinking about.

Anyway, here is my latest test, it still have no difference with is enable or disable, I up the clock to 64Mhz (was 32Mhz). If I have done my code correctly, the data should be in CCMRAM, RAM and FLASH.

I calculated the same data 25,000 times (was 1000), unit is in ms.

On a side note, not much difference between CCMRAM and RAM, is it to be expected?

mckenney · Dec 17, 2020

bug13 said:
It's a stm32f303k8, so it's not a H7 or some high end one you are thinking about.

Per reference manual (RM0316, Rev 6) Sec 4.2.2, prefetch only happens for instructions (ICode bus).

Also I was reminded that PRFTBE is initially set (=1). Do you explicitly set/clear it before each trial?

Also, I wonder how calculateCRC16() works. Is it table driven (const .data again) or does it use bit-shifting?

bug13 · Dec 17, 2020

mckenney said:
Per reference manual (RM0316, Rev 6) Sec 4.2.2, prefetch only happens for instructions (ICode bus).

Also I was reminded that PRFTBE is initially set (=1). Do you explicitly set/clear it before each trial?

Also, I wonder how calculateCRC16() works. Is it table driven (const .data again) or does it use bit-shifting?

Yes, silly me. Should have read the datasheet more carefully. I removed the code to enable the instruction prefetch buffer, but didn't explicitly clear the bit. Now the test works better.

Here is the new test, running at 64Mhz

PS:
my calculateCRC16() use bit shifting.

nsaspook · Dec 17, 2020

bug13 said:
Yes, silly me. Should have read the datasheet more carefully. I removed the code to enable the instruction prefetch buffer, but didn't explicitly clear the bit. Now the test works better.

Here is the new test, running at 64Mhz
View attachment 225211

PS:
my calculateCRC16() use bit shifting.

PEBCAK

bug13 · Dec 17, 2020

nsaspook said:
PEBCAK

LOL, I have to google what it means!

MrChips · Dec 17, 2020

Same as "a loose wire behind the keyboard"
or "a loose nut behind the wheel".

BobaMosfet · Dec 19, 2020

prefetching is not about variables, it's about instructions and pipelining. In short, the concept is to read instructions in a non-linear way. Intel came up with this idea originally as they wanted to be able to read instructions out of order, so they could analyze code before execution in order to predetermine how best to execute it- whether in parallel, or in different orders, etc, in order to be able to execute an overall program more efficiently and quickly.

Prefetching isn't guaranteed to make things faster, particularly in small code-bases, and depending on what logic the code is executing. And again, prefetch is for instructions, that's why the prefetch caches are small (relatively speaking).

Sadly, prefetching meant the end of self-modifying code on the fly in large ways.

MrChips · Dec 19, 2020

BobaMosfet said:
Sadly, prefetching meant the end of self-modifying code on the fly in large ways.

Self-modifying code was a brilliant idea for an accident waiting to happen.

Thread starter	Similar threads	Forum	Replies	Date
M	HB100 Doppler radar IF amplifier with MCP6002 for STM32 ADC — gain and filter review	Analog & Mixed-Signal Design	3	May 1, 2026
E	Strange ADC Drift on My STM32 Board Whenever Relay Turns On	Microcontrollers	4	Apr 24, 2026
	How to connect drv&microcontroller gnd	Power Electronics	2	Dec 6, 2025
	STM32, libopencm3 and interrupt service routines	Microcontrollers	4	Nov 17, 2025
E	STM32 DAC controlling a current source	Microcontrollers	5	Nov 3, 2025

stm32 (or arm in general), usage of prefetch buffer?

Join our Engineering Community! Sign-in with:

stm32 (or arm in general), usage of prefetch buffer?

bug13

402DF855

MrChips

bug13

bug13

nsaspook

bug13

402DF855

mckenney

KeepItSimpleStupid

bug13

mckenney

bug13

nsaspook

bug13

MrChips

BobaMosfet

MrChips

You May Also Like

ST’s New High-Precision Op Amp Takes Aim at the 4 V to 36 V Range

For the Netherlands, Photonics Initiatives Secure Global Leadership—Part 2

Novosense’s Isolated CAN Transceiver Offers +/-70 V Bus Fault Protection

Spinning Disks Sputter as AI Heats up Data