Andrew Zonenberg: "FPGA/ASIC people: when you hav…" - IOC.exchange

Recent searches

Search options

Only available when logged in.

Andrew Zonenberg @azonenberg@ioc.exchange

FPGA/ASIC people: when you have a complex state machine fetching data from a memory with unpredictable latency (e.g. AXI bus that may have contention, or external DRAM) how do you figure out what to do when the data comes in?

For my application there's no reordering possible within a single stream of requests so I'm thinking of having a FIFO of context objects that will "jog my memory" so that once data comes back, I know why it was read and how to handle it.

May 19, 2025, 02:31 AM··Web

5boosts·13favorites

✧✦Catherine✦✧ @whitequark@mastodon.social

@azonenberg yep that's what I am generally using

Andrew Zonenberg @azonenberg

@whitequark This almost feels like it's going to turn into a task-flow architecture.

Where instead of reading data then processing the response, you issue a read request attached to a context object and then wherever it shows up gets the data along with instructions on what to do with said data

✧✦Catherine✦✧ @whitequark@mastodon.social

@azonenberg that's exactly how i designed the glasgow (in future, amaranth-stdio, probably) I/O core. you give it the data to put in the output IOB registers plus metadata, and some cycles later (depending on the platform and configuration), it gives you back metadata and input values

this makes it very easy to write things like high performance JTAG probes without worrying much about the implementation

all stream-based, like AXI4-Stream but in Amaranth

Andrew Zonenberg @azonenberg

@whitequark As of now I have a native sram interface to the memory (xilinx UltraRAM) but the latency is variable due to both the potential for me to add arbitration in front of it, and because I will likely be adding and removing pipeline stages for timing closure as the design evolves.

So i don't want my read-side logic to make any assumptions about latency

Andrew Zonenberg @azonenberg

@whitequark and then I'm using AXI4-Stream wrapped in SV interfaces for all of the actual Ethernet traffic across module boundaries.

The block I'm writing right now is essentially 24 separate per-port ingress FIFOs bolted onto a single 64-bit 25 Gbps crossbar port.

✧✦Catherine✦✧ @whitequark@mastodon.social

@azonenberg yeah that's exactly why I picked this design, SiliconBlue devices have 1 cycle of latency for DDR I/O and Lattice devices have 3 (also i think Xilinx does?). i don't want to be tied by design to one specific platform

Tube Time @tubetime@mastodon.social

@azonenberg @whitequark i used this approach for a JTAG/I2C adapter. for efficiency, i packed a number of transfers into a single USB packet and needed a way to disburse all the received data and response codes. the API recorded the pointers, then you'd call a "flush" function that would generate the actual transfer and handle the results, after which the data at those pointers became valid.

✧✦Catherine✦✧ @whitequark@mastodon.social

@tubetime @azonenberg this is exactly how my ARM7TDMI debugger works! it's a very effective approach

Aled;cd @ldcd@treehouse.systems

@azonenberg if I can, separate the fetching and processing into separate state machines.

Have the fetching state machines write commands to the processing state machines through a fifo (assuming ordering and infallibility, otherwise this is a memory with the AXI IDs and things are more complicated) to drive it's state transition to start the next operation

AXI is nice because it lets you split the read and write channels, and even more granularly the address/data/response channels.

Aled;cd @ldcd@treehouse.systems

@azonenberg this is basically what you're describing but I personally find thinking of requesting and processing as distinct cores leads me to a simpler design and makes it easier to pipeline things

Andrew Zonenberg @azonenberg

@ldcd So in this case I control enough of the system to guarantee that operations will always complete successfully and in-order. There's just a high, potentially variable latency.

Andrew Zonenberg @azonenberg

@ldcd I'm planning to implement requesting and processing as separate state machines within a single RTL module, with a "back channel" where the requester also writes a "here's what you do when this data shows up" message into the processing block

Aled;cd @ldcd@treehouse.systems

@azonenberg obviously depends on the sort of processing you need to do but if you fully separate them you can sometimes get an easy throughput increase by doubling up on processing cores and using a round robin arbiter.

Aled;cd @ldcd@treehouse.systems

@azonenberg it's very GPU brained but I tend to organize these things as distinct accelerators with simple command queues

Andrew Zonenberg @azonenberg

@ldcd The output is ultimately single stream.

I'm fetching Ethernet frame headers from 24 different logical FIFOs within a single physical SRAM bank, looking them up in the MAC table , then ultimately outputting a single AXI4-Stream of Ethernet frames (with a destination switch port bitmask in TDEST) into a crossbar.

The goal is mostly to have agility for refactoring the SRAM physical design to change pipeline depth as needed to make timing closure (i.e. not tying the RTL design to a specific FPGA speed grade and pipeline depth)

Andrew Zonenberg @azonenberg

@ldcd I'm probably going to be refactoring the MAC table to be a bit more explicitly stream oriented too but that's a separate block of hierarchy I'll worry about separately (It exists and is validated but needs some cleanup)

Andrew Zonenberg @azonenberg

@ldcd First-iteration block diagram before actually tryingto build it: https://serd.es/assets/latentred-block.png

I ended up combining the 24:1 arbiter; 32:64 expansion, and URAM ingress FIFOs into a single logical subsystem made of 24 URAM288s.

The write side is just 24 separate FIFO controllers writing to each URAM in parallel.

The read side is cascaded and has to implement all of the arbitration etc.

Aled;cd @ldcd@treehouse.systems

@azonenberg an AXI4 memory interface is secretly 5 AXI streams in a trench coat. So in a sense if you have an AXI4 memory interface you already have a stream oriented interface. That said if you're bottlenecked on read/write throughput not processing then that kind of architecture isn't needed

Aled;cd @ldcd@treehouse.systems

@azonenberg also love URAMs they make so many things much more straightforward

Drew McGowen @quantumdude836@universeodon.com

@azonenberg I assume having the FSM just wait for the memory read to complete is out of the question?

Andrew Zonenberg @azonenberg

@quantumdude836 Correct, I want to be issuing one memory read per clock with potentially a double digit cycle latency (the immediate application is a fully pipelined but large SRAM array with an arbiter in front of it, and this is just one of several blocks accessing it).

poleguy looking for lost tools @poleguy@mastodon.social

@azonenberg that sounds about right. I generally question any design if I end up with a complex state machine or variable latency. "What can we remove to make this better?"

But complex problems require complex solutions to solve them. So long as the solution complexity is appropriate to the problem then yeah: Track your state with a fifo of that's sufficient.

Don't make it more complex than necessary, but definitely don't make it less complex than necessary!

Andrew Zonenberg @azonenberg

@poleguy Yeah unfortunately there's inherently variable latency in anything dealing with external user input from a network.

I can *bound* the latency, assuming frames on other ports show up with exact worst case timing of competing MAC lookups, etc.

poleguy looking for lost tools @poleguy@mastodon.social

@azonenberg That makes me think of this:

https://martinfowler.com/bliki/TwoHardThings.html

:-)

I can't fully visualize your exact use case, so I can only answer in generalities.

martinfowler.combliki: Two Hard ThingsThere are only two hard things in Computer Science: cache invalidation and naming things -- Phil Karlton (bonus variations on the page)

dragosr @dragosr@chaos.social

LLMs these days review lots of ASICs, here are the common DRAM patterns:

In-order? One constant ID + depth-N FIFO. Simple, rock-solid.

Out-of-order per ID? Enable multiple IDs; resolve with tag-indexed RAM or small CAM.

Need strict original order downstream? Add a reorder queue or let a micro-controller drain in order.

Always track an error bit per request so completions can propagate faults precisely.

Dimension depth = (max latency ÷ average request issue interval) + margin.

dragosr @dragosr@chaos.social

@azonenberg remember error cases:

Most examples assume normal forward progress. In practice you eventually hit reset, partial-reconfiguration, link retrain or ECC-poison events that invalidate some—but not all—outstanding reads. A single-depth “in-order” FIFO trivialises the flush problem (drop its valid flag and you’re clean), but once you introduce multi-ID scoreboards you need a replay/abort bit per entry plus a way to drain residual data beats still in flight.

Andrew Zonenberg @azonenberg

@dragosr I already have TRESET on the AXI stream hooked to the link-down flag so all of the FIFO pointers and such propagate resets.

Getting that reset propagated into this block is something I'm still working on: resets won't be able to abort an in-flight memory read but I need to make a note that the data should be discarded once it shows up

Andrew Zonenberg @azonenberg

@dragosr (and since this block is shared by the entire line card I can't reset when a single link drops, I just need to wipe any state tied to that port)

Drag & drop to upload