What's confusing me is that pings at low rate work fine.
So this seems to be something related to, maybe, sending a lot of data in a short time window or something. I haven't caught it in the act yet.
Starting to wonder if it's the TCP/UDP checksum offload block as nothing else is jumping out at me. Bypassing that and we'll see if things work better.
Yeeep it was the offload. Bypassed the offload and a *debug* binary is now pushing 704 Mbps of iperf traffic.
I can only imagine how fast a release build with less buggy checksum offload will be.
Very interesting.
It's *not* the offload per se. the offload is just triggering it.
All of my frames have four trailing 0x00 bytes, and the offload's calculated checksum is off by 0x04, consistently.
I suspect that somewhere further up the chain the 32 vs 64 bit path is adding trailing data to frames that shouldn't be there.
Waiting for a LA to compile into that part of the design but at a high level I can see the bug now. The MDMA is rounding the frame up to a 64 bit boundary while writing to the TX FIFO.
This is OK, the existing datapath roundsd up to a 32 bit boundary already, and I had a register in the TX buffer where I write the *actual* frame length so I know to ignore the trailing padding.
Something there probably didn't account for a >= vs == case or something so once you add more than one word of padding it probably breaks something.
aaaaand there it is. Expected frame length (set by writing to FETHTX.LENGTH) is 5d4, we send 5d8 bytes of data, *and those extra bytes got pushed into the fifo*.
Hmmm.
There might be two bugs, not just one.
Iperf now works (at 766 Mbps!) but SSH doesn't.
Looks like the last byte is being truncated from at least some SSH TCP segments. Probably my fix is incomplete...
Aaand fixed. SSH works again.
Iperf: 766 Mbps
SFTP /dev/zero: 32.6 MB/s (260.8 Mbps)
SFTP QSPI flash: 14.6 MB/s (116.8 Mbps)
This is using 64-bit transfers for MCU -> FPGA transactions on Ethernet transmit only, everything else still 32 bit.
@whitequark I continue to be cranky that the STM32H735 AHB2 (which as far as I can see is only used by crypto IPs) is not reachable by any DMA that can access the DTCM.
So there's no way to do a cache coherent DMA of data from the TCM to crypto or back.
Even with cache, DTCM is faster than AXI RAM by enough of a margin that all of my firmware is keeping Ethernet frame data in it.
Long term I think I'm going to end up pushing more and more stuff to FPGA offload and having the MCU not actually handling a lot of datapath.
But as you can see here, it's not half bad at doing the datapath either.
@azonenberg I do yeah!
needless to say i'm team all-on-FPGA, your MCU suffering for weeks really drives the point in
@whitequark I mean, my goal here is mostly to maximize the MCU-FPGA bandwidth. No matter how I partition the workload, improved performance of the link is better all around.
And hey, at least I didn't use a Zynq.
@whitequark The whole reason I'm doing this is to lay groundwork for future projects like my Ethernet switch.
Implementing a SSH server on an FPGA or a softcore sounds like a nightmare (in particular without a hard TRNG IP you can use for session key generation).
I absolutely don't need this much bandwidth between the MCU and FPGA for that project, but optimization is fun. If you're not counting clock cycles and instructions, are you even enjoying your day?
@azonenberg dunno, i would race two async clocks against each other
@whitequark I haven't trusted FPGA RNGs since I ran a test a while back of on die ring oscillators and found them injection locking to each other through power rails.
@azonenberg sure, i mean two async clocks from two PLLs
@whitequark I just know enough about crypto to not trust home grown RNGs.
@azonenberg i don't trust ST to build a good one either
@whitequark Fair enough, but ARM or Cadence or Synopsys or whoever they licensed the IP from probably didn't do the worst job.
Long term I want to build a randomness pool that's seeded by the TRNG over time and hashes itself to accumulate entropy even if the TRNG output is somewhat biased.
@azonenberg i've seen enough atrocious MCU RNG implementations that i don't even trust them to do that
what was the last thing to be broken, pico's PRNG? at least for the FPGA one, i know the ways in which it will be bad. the ST one is opaque
@whitequark Fair enough.
@azonenberg i def should build a design for this and then try and break it
@azonenberg @whitequark Have you seen https://harmoninstruments.com/posts/trng.html ? I still don't fully trust it given there's been no 3rd party review. I'll probably mix in data from the management microcontroller TRNG if I use it in future designs.
Should be applicable to US+ as well. There's also the option of using the finer resolutions of the delay blocks there to sample a jittery edge.
@dlharmon @azonenberg re: metastability, i've wondered about it too but it seems that on recent logic the window is absurdly small
@whitequark @dlharmon Yeah modern nodes have very short windows which makes catching metastability related bugs tricky.
~5 years ago I caught one in a UART core I had been using for quite a long time that would occasionally drop a byte when I was sending a lot of data. Turns out I wasn't synchronizing the input properly (i.e. at all).
@azonenberg @dlharmon how did you know it's metastability specifically and not just jitter resulting in capturing the wrong thing sometimes (i.e. a normal race)?
@whitequark @dlharmon Because I was getting failure states that were logically impossible and could only be explained by a flop being both high and low while sampled by other logic.
@whitequark @dlharmon And adding a 2-flop sync between the rx pin and the existing uart core completely eliminated the problem.
@azonenberg @dlharmon oh, that's really cool
I've never caught metastability in the wild, and I've even tried to cause it on purpose
@whitequark @dlharmon I wrote this uart core on an xc3s50a as one of my first FPGA projects ever.
Kept on using it for years afterward and always just assumed that my random data drops were EMI or signal integrity problems or something l.
Then one day I took a closer look at my RTL from 2010 and realized what was happening.
@azonenberg @dlharmon we really need to add a built-in CDC checker to Amaranth...
@whitequark @dlharmon CDC checks won't help if the input is coming from off chip and inherently async.
@azonenberg @dlharmon sure they would, all toplevel inputs would be considered asynchronous to any clock in the design unless marked otherwise
as a result you will have a violation without a 2FF
@azonenberg @dlharmon I suppose more accurately what I want to build would be described as "timing check during synthesis"
rather than merely considering CDCs it uses a more abstract model than P&R to conservatively avoid violations
@whitequark @azonenberg Same, I've tried and failed to create metastabilty or at least something I could say for certain was. Perhaps the ~18 ps resolution on 7-series MMCM fine phase shift wasn't enough. I'd have thought jitter would have taken care of the rest. Maybe my clock was too low and the metastability resolved before the 2nd stage FFs could sample it.
I should try it again with a DDS (~fs res.), like 700 MHz clocks.
@dlharmon @whitequark This was on xc6s and with logic directly sampling a LVCMOS33 input buffer driven by a slow signal.
I think the slow slew rate of the input led to a much larger metastability window, to the point that one in a couple million edges would make the input flop go metastable.
@azonenberg @whitequark My attempt was purely with internal signals, the output of a LUT in the same CLB as the FF.
@dlharmon @whitequark Yeah I've never seen internal metastability only garbage data from timing violations.
First stage flops fed by a low slew input is a worst case scenario.
@whitequark @azonenberg @dlharmon that's odd. my first circuit had metastability issues. it was a SPI clock recovery circuit on the smallest ice40, an oversampled counter that was reset by data edges of the input signal.
from this i concluded that metastability is not rare at all, but my case might have been exceptional: clocks of the sender and receiver were just a little bit different so this could drift into regimes that had the timing error more reliably