What's confusing me is that pings at low rate work fine.
So this seems to be something related to, maybe, sending a lot of data in a short time window or something. I haven't caught it in the act yet.
Starting to wonder if it's the TCP/UDP checksum offload block as nothing else is jumping out at me. Bypassing that and we'll see if things work better.
Yeeep it was the offload. Bypassed the offload and a *debug* binary is now pushing 704 Mbps of iperf traffic.
I can only imagine how fast a release build with less buggy checksum offload will be.
Very interesting.
It's *not* the offload per se. the offload is just triggering it.
All of my frames have four trailing 0x00 bytes, and the offload's calculated checksum is off by 0x04, consistently.
I suspect that somewhere further up the chain the 32 vs 64 bit path is adding trailing data to frames that shouldn't be there.
Waiting for a LA to compile into that part of the design but at a high level I can see the bug now. The MDMA is rounding the frame up to a 64 bit boundary while writing to the TX FIFO.
This is OK, the existing datapath roundsd up to a 32 bit boundary already, and I had a register in the TX buffer where I write the *actual* frame length so I know to ignore the trailing padding.
Something there probably didn't account for a >= vs == case or something so once you add more than one word of padding it probably breaks something.
aaaaand there it is. Expected frame length (set by writing to FETHTX.LENGTH) is 5d4, we send 5d8 bytes of data, *and those extra bytes got pushed into the fifo*.
Hmmm.
There might be two bugs, not just one.
Iperf now works (at 766 Mbps!) but SSH doesn't.
Looks like the last byte is being truncated from at least some SSH TCP segments. Probably my fix is incomplete...
Aaand fixed. SSH works again.
Iperf: 766 Mbps
SFTP /dev/zero: 32.6 MB/s (260.8 Mbps)
SFTP QSPI flash: 14.6 MB/s (116.8 Mbps)
This is using 64-bit transfers for MCU -> FPGA transactions on Ethernet transmit only, everything else still 32 bit.
@azonenberg is it related the mtu size?
Try and increase the ping packet size and if you get the same results
@madkiwi I'm 99% sure this line in the checksum offload is at fault
//TODO: backpressure when fifo is full
buf_tx_ready <= 1;
I bet now that I've increased the MCU-FPGA bandwidth I actually can throw enough data at the FPGA to fill this buffer.
@azonenberg so, you’re close to saturating a gigabit link from a 500MHz-class MCU?!
@jpm Yep! With a bit of help from the FPGA, but still.
@jpm It can SFTP faster than a raspberry pi 4 too. (I don't have a 5 handy to benchmark against)
@azonenberg I’m always surprised by just how slow the RasPi really is. The Pi4 was a big jump in performance mostly because of the giant increase in memory bandwidth
@jpm On my STM32H735 I've achieved nearly 3400 MB/s of memory bandwidth to the DTCM.
Embedded Linux also comes with a massive overhead compared to bare metal. One of my upcoming projects will take a dual core Cortex-A35 SoC intended to run Linux, but see if I can bring it up bare metal and do stupid things with it.
@azonenberg yeah the Pi4 tops out at about 3200MB/sec from its main memory, you can see the effect from folks who wedge a NVMe SSD into the PCIe bus. Plus no hardware crypto acceleration, it’s not a great system for doing real work on
@jpm There's a reason I'm on the H735.
@jpm Oh and my firmware is like 120 kB.
@azonenberg The thing I find most interesting here is how I *know* there are FPGA engineers out there that just drop in the vendor gateware and call it a day.
@ChuckMcManis There are lots of engineers who throw embedded Linux at a problem and call it a day.
I like running bare metal.
@azonenberg Nice work!!
@whitequark I continue to be cranky that the STM32H735 AHB2 (which as far as I can see is only used by crypto IPs) is not reachable by any DMA that can access the DTCM.
So there's no way to do a cache coherent DMA of data from the TCM to crypto or back.
Even with cache, DTCM is faster than AXI RAM by enough of a margin that all of my firmware is keeping Ethernet frame data in it.
Long term I think I'm going to end up pushing more and more stuff to FPGA offload and having the MCU not actually handling a lot of datapath.
But as you can see here, it's not half bad at doing the datapath either.
@azonenberg I do yeah!
needless to say i'm team all-on-FPGA, your MCU suffering for weeks really drives the point in
@whitequark I mean, my goal here is mostly to maximize the MCU-FPGA bandwidth. No matter how I partition the workload, improved performance of the link is better all around.
And hey, at least I didn't use a Zynq.
@whitequark The whole reason I'm doing this is to lay groundwork for future projects like my Ethernet switch.
Implementing a SSH server on an FPGA or a softcore sounds like a nightmare (in particular without a hard TRNG IP you can use for session key generation).
I absolutely don't need this much bandwidth between the MCU and FPGA for that project, but optimization is fun. If you're not counting clock cycles and instructions, are you even enjoying your day?
@azonenberg dunno, i would race two async clocks against each other
@whitequark I haven't trusted FPGA RNGs since I ran a test a while back of on die ring oscillators and found them injection locking to each other through power rails.
@azonenberg sure, i mean two async clocks from two PLLs
@whitequark I just know enough about crypto to not trust home grown RNGs.
@azonenberg i don't trust ST to build a good one either
@whitequark Fair enough, but ARM or Cadence or Synopsys or whoever they licensed the IP from probably didn't do the worst job.
Long term I want to build a randomness pool that's seeded by the TRNG over time and hashes itself to accumulate entropy even if the TRNG output is somewhat biased.
@azonenberg i've seen enough atrocious MCU RNG implementations that i don't even trust them to do that
what was the last thing to be broken, pico's PRNG? at least for the FPGA one, i know the ways in which it will be bad. the ST one is opaque
@whitequark Fair enough.
@azonenberg i def should build a design for this and then try and break it
@azonenberg @whitequark Have you seen https://harmoninstruments.com/posts/trng.html ? I still don't fully trust it given there's been no 3rd party review. I'll probably mix in data from the management microcontroller TRNG if I use it in future designs.
Should be applicable to US+ as well. There's also the option of using the finer resolutions of the delay blocks there to sample a jittery edge.
@dlharmon @azonenberg re: metastability, i've wondered about it too but it seems that on recent logic the window is absurdly small
@whitequark @dlharmon Yeah modern nodes have very short windows which makes catching metastability related bugs tricky.
~5 years ago I caught one in a UART core I had been using for quite a long time that would occasionally drop a byte when I was sending a lot of data. Turns out I wasn't synchronizing the input properly (i.e. at all).
@azonenberg @dlharmon how did you know it's metastability specifically and not just jitter resulting in capturing the wrong thing sometimes (i.e. a normal race)?
@whitequark @dlharmon Because I was getting failure states that were logically impossible and could only be explained by a flop being both high and low while sampled by other logic.
@whitequark @dlharmon And adding a 2-flop sync between the rx pin and the existing uart core completely eliminated the problem.
@azonenberg @dlharmon oh, that's really cool
I've never caught metastability in the wild, and I've even tried to cause it on purpose