While porting the ChibiOS HAL to the new STM32F7xx inevitably the issues with cache coherency popped up. Unfortunately the DMAs do not update/invalidate the cache in HW so the burden of coherency is on the shoulders of software developers.
The issue has two aspects, lets consider DMA engines reading from RAM or writing in RAM.
The data cache present in Cortex-M7 devices works using a write-back mechanism, this means that data written by the CPU to RAM does not necessarily reach the RAM immediately but can be parked in cache for an indefinite long time. This means that the DMA engines can read data from RAM data that is not an exact copy of the data that the CPU wrote.
On the other hand, data written by DMA engines to RAM does not invalidate the corresponding cache lines so the CPU could read cache content that is no more an exact copy of data in RAM.
This is a list of solutions we considered but discarded for various reasons.
Disabling Data Cache over the whole RAM array would resolve all problems. The rationale for trying this is that the STM32F7xx Reference Manual states that the RAM is accessible without wait states. Zero wait states would mean that caching RAM is not necessary, unfortunately this is not true, the device offers zero-wait-states-LIKE performance when the data cache is enabled. Disabling the data cache simply results reducing the device performance to about 1/3 of its potential.
We saw this solution in some ST's STM32Cube-F7 demos. Putting the cache in write-through mode is, unfortunately, an incomplete solution. It fixes the problem for DMA transmission buffers but it does nothing for DMA receive buffers. In addition it reduces the system performance of about 10%..20% because this is a less efficient caching mode. This solution also requires the use of the MPU and this adds extra complexity.
The following solutions can be adopted for an efficient handling.
Probably this is the most efficient solution: dedicate a portion of RAM for DMA buffers and make it non-cacheable using the MPU or place buffers in DTCM RAM (always not cached).
This solution simply requires the application to handle the invalidation and/or flushing of the cache over DMA buffers. The HAL offers two function that easily allow to secure buffers for use with the DMA.
Buffers declaration, note that the buffers mush be aligned to a cache page boundary.
#define SPI_BUFFERS_SIZE 128U #if defined(__GNUC__) __attribute__((aligned (32))) #endif static uint8_t txbuf[SPI_BUFFERS_SIZE]; #if defined(__GNUC__) __attribute__((aligned (32))) #endif static uint8_t rxbuf[SPI_BUFFERS_SIZE];
The following code exchange data over the SPI using a transmission buffer and a receive buffer. MISO and MOSI are connected together so the data is looped back. You can see that the cache handling is not particularly difficult.
/* Bush acquisition and SPI reprogramming.*/ spiAcquireBus(&SPID2); spiStart(&SPID2, &hs_spicfg); /* Preparing data buffer and flushing cache.*/ for (i = 0; i < SPI_BUFFERS_SIZE; i++) txbuf[i] = (uint8_t)i; dmaBufferFlush(txbuf, SPI_BUFFERS_SIZE); /* Slave selection and data exchange.*/ spiSelect(&SPID2); spiExchange(&SPID2, SPI_BUFFERS_SIZE, txbuf, rxbuf); spiUnselect(&SPID2); /* Invalidating cache over the buffer then checking the loopback result.*/ dmaBufferInvalidate(rxbuf, SPI_BUFFERS_SIZE); if (memcmp(txbuf, rxbuf, SPI_BUFFERS_SIZE) != 0) chSysHalt("loopback failure"); /* Releasing the bus.*/ spiReleaseBus(&SPID2);
dmaBufferInvalidate() are also present on devices without cache but do nothing in that case. This is done in order to preserve SW compatibility across all devices.