@jpm On my STM32H735 I've achieved nearly 3400 MB/s of memory bandwidth to the DTCM.
Embedded Linux also comes with a massive overhead compared to bare metal. One of my upcoming projects will take a dual core Cortex-A35 SoC intended to run Linux, but see if I can bring it up bare metal and do stupid things with it.