Cortex-M cache coherence using ChibiOS/HAL

Modern micro controllers are becoming more and more complex, recent Cortex-M cores can be equipped with a data cache because the increasing core frequencies. Unfortunately cache coherence, when multiple bus masters are present, is not handled in HW so software must take case of it.

Examples:

  • DMA engines.
  • Multiple cores.
  • Other kinds of bus masters.

Cache Organization

There are several parameters to be considered for cache memories

Cache Total Size

It is the amount of cache RAM, a bigger cache has better performance. This parameter does not affect SW cache handling.

Cache Line Size

It is the smallest cache RAM amount that can be mapped over a physical address, it is always a power of two. On Cortex-M devices the cache line size is always 32. This information is important for software handling.

Cache Associativity

Caches have a number of “ways”, so we can have 2-ways caches, 4 ways caches and so on. More ways means that the cache can be more flexibly associated to physical memory, higher associativity makes for more efficient caches. This parameter does not affect SW cache handling.

Cache Type

There are two kinds of cache memories needing slightly different solutions.

  • Write Through (WT), only read accesses are cached, write operations are immediately performed on both the cache and the underlying RAM. This is not the best for performance but it is simple to implement and easier to handle.
  • Write Back (WB), both read and write operations are cached, the cache memory can contain “dirty” lines: CPU-written data that has not yet been written to the underlying RAM. The write operations are delayed and not predictable. This is done for best performance because multiple write operations could be performed on the same cache line, postponing the (slow) write operation in RAM increases performance.

The issue

Essentially there are two variants of the problem:

  1. Bus masters reading from a cached RAM.
  2. Bus masters writing to a cached RAM.

The problem is that, thanks to the cache, the CPU and other bus masters could “see” different data at the same address, accesses must be synchronized in a way to enforce coherence between bus masters.

Possible solutions

  1. Place DMA-accessible buffers in a non-cached RAM. On the STM32F7, for example, the TCM memory is DMA-accessible and not cached. It can be conveniently used for DMA buffers without constraints.
  2. Make some of the RAM non-cached by using the MPU. The Cortex-M MPU allows to enforce memory attributes for defined regions, including cache handling.
  3. SW handling of cache coherence, this is what we will discuss in this article.

Coherence Operations

Lets define two kind of operations on cache:

  • Cache Invalidate invalidates the cache entries associated to a range of addresses. Eventual unwritten data in WB caches is lost and not written to RAM.
  • Cache Flush makes sure that any unwritten cached data is written in underlying RAM. Write Through cache memories do not need this operation, there is no unwritten data to be handled.

Alignment Issues

Note that because the lines-organization of cache memories, invalidating or flushing a memory area can also affect adjacent locations. For example, if the cache line size is 32 (0x20) then invalidating the cache between addresses 0x00001003 and 0x00001047 would cause invalidation of addresses between 0x00001000 and ''0x0000105F'. This can easily cause SW errors because invalidating a buffer would cause involuntary invalidation of adjacent variables causing hard-to-debug software errors.

Because of this buffers accessible by multiple bus masters must always be aligned to cache lines size, both the start address and the buffer size must be aligned.

ChibiOS provides address alignment macro in the compilers abstraction module and cache handling functions in all Cortex-M ports:

Declaing DMA Buffers

This is an example of DMA buffers declaration.

#include "hal.h"
#include "cc_portab.h"
 
#define BUFFERS_SIZE 36
 
CC_ALIGN(CACHE_LINE_SIZE) static uint8_t txbuf[CACHE_SIZE_ALIGN(uint8_t, BUFFERS_SIZE)];
CC_ALIGN(CACHE_LINE_SIZE) static uint8_t rxbuf[CACHE_SIZE_ALIGN(uint8_t, BUFFERS_SIZE)];

The two declared buffers are guaranteed to be aligned to a cache line address with a size of, at least, 36 bytes. Note that if the device has no cache then the macros do nothing so it is possible to write portable code.

Operations on Buffers

The following code shows how to flush/invalidate buffers.

Buffer Invalidation

This is an example of cache invalidation before a DMA engines writes into a buffer.

  /* Invalidating the buffer before letting the DMA write in it.*/
  cacheBufferInvalidate(rxbuf, BUFFERS_SIZE);
 
  /* Receiving data from SPI using DMA.*/
  spiReceive(&SPID2, BUFFERS_SIZE, rxbuf);

Note that there is a reason if the invalidation is performed before the DMA operation and not after, apparently it would make sense to invalidate the cache after the DMA operation but consider this scenario:

  • The cache (WB type) has unwritten data spanning over the buffer, write to RAM can happen in any moment.
  • The DMA starts writing in the buffer.
  • The cache write operation happens and overwrites some DMA-written data.
  • The DMA operation ends.
  • The CPU invalidates the cache over the buffer, but it is too late.
  • The CPU reads corrupted data from the buffer.

Invalidating the cache over the buffer before the operation ensures that the above scenario cannot happen.

Buffers Flushing

Before letting a bus master start reading data from a buffer we need to make sure that the data actually reached the RAM before starting the DMA operation. Flushing is only required for WB kind of caches.

  /* Flushing cache before letting DMA read from the buffer.*/
  cacheBufferFlush(txbuf, BUFFERS_SIZE);
 
  /* Sending data to SPI using DMA.*/
  spiSend&SPID2, BUFFERS_SIZE, txbuf);