elefant @elefant

Recent searches

Search options

Only available when logged in.

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Bootup delay from RST# (blue) going high to MDC (yellow) beginning to toggle. 867 us, datasheet only requires 195 us.

And plenty of toggles on MDC before activity on MDIO (green).

Note that the actual PHY I/O signals are LVCMOS18; I'm probing MDIO at the PHY pins but the FPGA mirrors MDC and RST# to a 3.3V GPIO connector since it's tricky to get too many probes on a little QFN.

I did probe separately to confirm that MDC is reaching the actual PHY pins, and since it's linking up RST# is obviously clearing OK.

ngscopeclient screenshot showing RST# going high for a while before clock begins toggling

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Hmmmm, interesting, The VSC8512 isn't responding over MDIO either.

I wonder if it's something about my FGPA-side MDIO controller (weird timing or something) and the KSZ9031 is more forgiving? It's the only PHY I recall having used with it in the past.

Anybody have ideas?

ngscopeclient screenshot showing relative timing of MDIO and MDC

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Every time I look there's more spaghetti.

Prototype board on lab bench surrounded by probes and cables

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

OK, scratch that theory. I just found an old ngscopeclient dataset from my previous experiments with the DP83867.

I had MDIO working successfully using this same controller on it. In that particular case I was running at LVCMOS25 levels rather than the LVCMOS18 I'm using here, but it was the same FPGA IP.

So clearly the DP83867 *is* able to work with my controller. Which makes me lean back towards a hardware issue again.

But I'm still at a loss as to what could make MDIO fail but literally everything else work.

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Actually no, the PHY *was* running at LVCMOS18 levels. The FPGA was using LVCMOS25 and I had a level shifter.

So almost every single thing was the same on that board vs here. What changed??

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Awake for the day and troubleshooting more.

The remainder of the PHY is working fine, for sure. The fabric CDR block is locked and I'm getting valid 8b10b symbols.

After letting it run overnight, I have 5.1e12 symbols received without error on g13 (baseT link down) and 5.12e12 symbols with 71 errors on g12 (baseT link up). This is unsurprising as with the link up there's more power consumption, noise, variability to the data, etc.

But this gives a real world symbol error rate of roughly 7.1e-10. Given that symbols are 10 bits long, we extrapolate a BER of 7.1e-11. This is a slight underestimate since it doesn't catch errors that turn one valid 8b10b symbol into another valid one with correct disparity, but it's good enough as an OOM level approximation, and e-11 BER sounds plausible for a fairly short link on the same PCB.

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Can't find anything wrong with the pinout.

FPGA constraints showing MDIO on Y12 and MDC on AB12

Datasheet screenshot showing MDC on 16 and MDIO on 17

PCB screenshot showing MDIO on Y12 and MDC on AB12

PCB screenshot showing MDC on 16 and MDIO on 17

**Andrew Zonenberg** @azonenberg · Jul 22, 2023 *

Jul 22, 2023 *

Andrew Zonenberg @azonenberg

Resetting the PHY and probing strap pins one at a time to verify actual voltage during reset:

RX_CTRL = 441 mV = 0.245x VCCIO, comfortably in the middle of Mode 3 (autonegotiation enabled).

GPIO_0 = 10 mV = 0.005x VCCIO, very much mode 0 (RX0 clock skew = 0)

GPIO_1 = 10 mV, RX2/RX1 clock skew = 0

1V8_IO = 1.784V, all good

1V0 = 1.002V, all good

RST# is a nice clean rising edge from 0 to 1.78V, looks good there.

PWRDN# / INT# is 1.77V, no concerns there. (Even if the PHY was in power-down mode I'm pretty sure the MDIO interface would be up)

LED2 is at 4 mV, mode 1, RGMII TX1/TX0 clock skew = 0

LED1 is at 14 mV, mode 1, ANEG_SEL=0 (advertise all modes including 10baseT), TX2 clock skew = 0

LED0 is at 308 mV, 0.17x VDD, mode 2. Mirror disabled, SGMII enabled.

This is correct config for g12 which is what I'm probing; g13 is wired identically but should have mirror mode enabled (but I can also configure this via MDIO so not a big deal).

A1V8 is 1.79V, happy.

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

So that's the entire south side of the PHY verified correct levels.

Now let's check the west side where MDIO addressing is configured.

RX_D2/SGMII_RX_P = 561 mV = 0.31x Vdd. That's wrong, it's between the mode 3 and mode 4 strap ranges.

I have no strap resistors on this pin and it's AC coupled to the FPGA (so any biasing coming from the FPGA shouldn't affect it, I'm probing at the PHY side of the coupling cap).

Per datasheet it's supposed to have a 9 kΩ pulldown in strap mode, and be max 0.098x VDDIO if left floating (strap config for mode 1, which is what I want).

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Scoping with a longer time scale: it looks like after reset is asserted the pulldown starts discharging the coupling capacitor, but it takes a long time to do so.

So I either need to add an external pulldown (not relying on "open circuit = mode 0") or just assert reset longer so the on-die pulldown has time to do its thing. The latter seems to be easy enough, let's try that...

ngscopeclient screenshot showing reset being asserted and RX_D2 strap slowly falling during the reset period, but not reaching the Mode 0 strap range yet

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Well *that* took a log longer to find than I expected!

With a 4x increased reset pulse duration, the AC coupling caps on the SGMII have enough time to fully discharge and we get correct strap values on all of the SGMII pins.

I was barking up the wrong tree for a while assuming that incorrect MDIO address straps would lead to the device coming up at a different, unintended address, but it would always respond *somewhere*. Since it didn't show up at any address I assumed the problem was elsewhere.

Turns out the ranges don't overlap (in particular mode 3 is up to 0.284x VDDIO and mode 4 is above 0.694x) and I guess if you're in that middle ground it won't work at all, vs coming up in one mode or the other.

Now it comes up and is detected with a valid PHY ID so I can continue the full bringup cycle.

The VSC8512 isn't responding over MDIO either but I'll address that issue separately once I get to it. Probably totally unrelated problem.

Console screenshot showing DP83867 PHY detected with PHY ID = 0x2000 a231 which seems right

ngscopeclient screenshot showing R-C falloff of strap / RXD pin (green) after reset (blue) is asserted

**Andrew Zonenberg** @azonenberg · Jul 22, 2023

Jul 22, 2023

Andrew Zonenberg @azonenberg

Anyway, I guess I'll go back to the original plan of getting SSH up and building out more CLI commands (things like printing out low level phy debug info).

That will probably take the rest of the evening since I have some errands and family weekend stuff to do too.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Incoming Ethernet frames are now buffered in the FPGA and read out by the MCU for processing. I might tune the buffer size, it's pretty small for now, but it's usable.

Here's some broadcast traffic on my sandbox network. So far it's just being printed to the UART and not actually being processed by the IP stack, that's the next step.

Then on to transmits.

Serial console screenshot showing hexdumps of incoming raw Ethernet frames

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Hooked up the IP stack and added some logging hooks to indicate when it tries to transmit.

For now, all outbound frames are dropped because there's no code on the FPGA to actually operate the transmit path (and I haven't even defined the registers for the MCU to send a frame yet).

So that's next.

Console screenshot showing sizes of incoming and outbound Ethernet frames

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

And it's now pingable!

When I try to SSH to it, I hang after sending the SSH2_MSG_KEXINIT message. Unsure if this is an IP stack issue, a crypto driver issue on the STM32, or something else. Will troubleshoot tomorrow.

There seem to be some bugs where it'll enter a bad state and stop responding to pings as well.

Screenshot of the management processor responding to ping

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

The high latency, if you're curious, is mostly because of some blocking UART debug prints in the packet processing path. It doesn't take a whole lot of text at 115.2 Kbps to add 10ms of latency.

So that will speed up a lot in the future.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Found one of the problems: I cleared the FPGA IRQ line latch after reading the interrupt status register. But I only read a single Ethernet frame per IRQ assertion.

So if two frames show up before I've read the first one, the second one won't get read until a third one shows up, etc.

Eventually enough frames will get forgotten that the buffer fills up and all traffic stops flowing.

Got a trivial fix (don't latch IRQ, it's asserted nonstop if there's data in the buffer) and am building a new bitstream with it.

But soooomeone wanted to go to the park so it's gonna be a while before I'll know if the fix worked...

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

So at this point I've done at least preliminary "hardware isn't catastrophically busticated" bringup on just about everything other than the QSGMII PHY, which was unresponsive on MDIO during a 30 second first attempt but I didn't do any more extensive debug.

Power: voltages all in range, but no ripple measurements yet

Supervisor: working fine, firmware basically done except for watchdog/warm reboot functionality

Main MCU: no issues

Thermal: everything working fine with fans at max RPM. Haven't tried to PWM them yet but they're so quiet at max RPM I may not even bother

10G SFP+: links up and can read EEPROM and DOM

RGMII PHY: fully functional and passing traffic

SGMII PHYs: SGMII links up with e-11 BER and seems to work fine. MDIO alive. BaseT side seems fine on g12, but g13 only links up at 100mbit speeds. Suspecting solder issue but have not investigated yet.

QDR-II+ SRAM: fully functional

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Took out the debug prints and fixed the IRQ issue, as well as a hardware crypto engine issue.

Now I can ping it with 250us RTT (through a router and two or three switch hops).

And I get further when trying to SSH to it. Now the device kicks me off after the SSH2_MSG_NEWKEYS message. Wonder why?

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Hmmm. I'm seeing AES-GCM decryption fail, but when I dump out the calculated keys I get the same keys that OpenSSH is trying to use clientside.

Which makes me think it's some subtle difference in behavior between the STM32F7 crypto engine (which I originally wrote this code for) and the STM32H7 crypto engine (which I'm now using). This is going to be fuuuun.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Looking at the datasheets side by side it seems key endianness is different. But that wasn't enough to make it work so there's something else going on too.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Hexdumping more stuff it seems I'm getting valid decryption of the message now (a SSH_MSG_SERVICE_REQUEST with service of "ssh-userauth").

But the GCM tag is way off.

This is the frustrating thing about debugging crypto. If you get any of the magic incantations even slightly wrong, you get random gibberish with no clue about what went wrong.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

Progress: it seems I had to endian-swap the length fields in the last GCM block vs the STM32F7. Now it's getting all the way up to the point of the client seeing a SSH_MSG_CHANNEL_SUCCESS that I sent after successful password authentication, but the contents of the packet seem garbled so it aborts.

**Andrew Zonenberg** @azonenberg · Jul 23, 2023

Jul 23, 2023

Andrew Zonenberg @azonenberg

It lives!

Somehow I was sending replies to SSH_MSG_CHANNEL_REQUEST packets by writing *into the incoming packet buffer*, not the reply.

And by dumb luck I guess whatever uninitialized garbage was in the reply buffer happened to resemble a valid SSH_MSG_CHANNEL_SUCCESS message before, but not now? Lol.

Anyway, this is a great success. Kid is up from her nap so that's it for a while, tonight after bedtime I'll hook up the Curve25519 accelerator on the FPGA to speed session creation a bit, then work on a bunch of CLI commands to dump PHY information and such.

screenshot of a successful SSH connection established with the STM32

**Andrew Zonenberg** @azonenberg · Jul 24, 2023

Jul 24, 2023

Andrew Zonenberg @azonenberg

Bumped the optimization level on my firmware up from -O0 to -O2 because creating a SSH session was too slow.

But the FPGA curve25519 accelerator is still over 48x faster than the software implementation. Pretty happy with that :)

Console log showing software curve25519 computation taking 219 ms and hardware accelerated version taking 4.5 ms

**Andrew Zonenberg** @azonenberg · Jul 24, 2023

Jul 24, 2023

Andrew Zonenberg @azonenberg

Ephemeral ECDH key generation and shared secret calculation now use the FPGA accelerator and SSH session creation now feels about as fast as it does when logging into a regular PC.

I can probably extend the same accelerator block (with some minimal tweaks) to also support the public key side of signing, but for now crypto_sign() is still being done entirely in software and only the two crypto_scalarmult() calls in the SSH session creation are accelerated.

Still a massive improvement in responsiveness, it cut about 400ms of latency off session creation.

**Andrew Zonenberg** @azonenberg · Jul 24, 2023

Jul 24, 2023

Andrew Zonenberg @azonenberg

Added some code to poll PHY MDIO register state (not using irq pins yet) and the SGMII PHYs seem less happy. One refuses to report link up over MDIO (even if the LEDs are on and the link partner says it's up), the other reports link up but flaps.

Just when I thought that was working...

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

So I guess the first question is if I'm actually addressing the correct PHY and if it's in fact reporting link flaps. And how the MDIO link state compares to the SGMII autonegotiation state.

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

After some soldering, i think I'm ready to start debugging!

A very crowded lab bench with two differential probes hooked up to a prototype PCB

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

So, here's the basic setup.

Blue and green wires go to the MDIO bus, which is slow enough (2.5 Mbps with very low drive strength) that I'm not worried about reflections off a few inches of flying wire. Standard 10x passive probes clip to the other end of each.

The two black probes are Teledyne LeCroy QuickLink solder in probe tips. One is going to a D420-A and the other to a D1330; both have way more bandwidth than I need to see SGMII clearly.

I'd use my own AKL-PT5 probes for this measurement (well within their capabilities) except that I'd need to AC couple the measurement and somehow I only have one SMA DC block on the shelf right now. That will be rectified by the end of the week.

Photo of a PCB with oscilloscope probes soldered around a QFN packaged Ethernet transceiver

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

Initial observations: The SGMII RX waveform looks decent enough and passes the eye mask required for the FPGA to decode it. I've seen better, but I'm in no hurry to rework the board because of this.

Swing and drive strength on the TX seem a bit excessive, I should probably turn it down. The eye is wide open but the PHY could probably hear the FPGA from the next room over!

Valid MDIO traffic is present. This particular waveform has two packets at the start, then four, then two more.

The MCU reads four registers per polling cycle: basic control and status of PHY 0, then of PHY 1. After polling each PHY, it checks for a link state change and logs a message to the UART.

The long delay after the last packet suggests this is being detected as a link state change.

ngscopeclient filter graph showing the data analysis.

TX and RX pairs are identically processed: recover clock and threshold, eye pattern, 8B/10B decode, then SGMII and autonegotiation decodes on the 8B/10B.

In addition, the MDIO/MDC lines are thresholded and fed into an MDIO protocol decode.

ngscopeclient screenshot showing bursts of activity on the MDIO bus and SGMII eye patterns

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

The SGMII link appears to be up the whole time, and at no point does it fall back into the negotiation state. So the link is probably *not* actually flapping; this smells like a bug on the microcontroller side.

We expect g13 (phy address 1) to be down, and g12 (phy address 0) to be up at 1 Gbps.

Looking at the actual MDIO bus traffic in this capture, we have:

* PHY 1 ctl: 1G/full
* PHY 1 stat: Down
* PHY 0 ctl: 1G/full
* PHY 0 stat: Up
* PHY 1 ctl: 1G/full
* PHY 1 stat: Down
* PHY 0 ctl: 1g/full
* PHY 0 stat: Up

Nothing seems obviously wrong here.

ngscopeclient screenshot showing the first MDIO read: register 0x00 of PHY 0x01 has value 0x1140

ngscopeclient protocol decode dump showing MDIO transactions (see main toot for discussion)

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

OK, this is starting to smell like an FPGA issue.

We know the actual MDIO traffic on the wire is fine, but sometimes we're reading 0x7949 for the basic control register on port 12.

Interestingly, this is the same value we just read from the basic status register on port 13.

And then at 22.420, we read the basic control register for port 13 as 0x116d. 0x6d is the value we just read from the basic status register on port 12.

So I think there's some kind of bug in the FPGA MDIO-to-QSPI bridge where sometimes it will return a previous value instead of what was actually read.

Console screenshot of register values being read from PHYs.

The Basic Status register for the active port (12) is 0x796d and the inactive port (13) is 0x7949.

The Basic Control register for both is almost always 0x1140, but is occasionally misread as 0x7949.

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

New FPGA bitstream with some additional debug logic, as well as changing the FPGA output buffer from DIFF_HSTL_I_DCI_18 to LVDS.

TX data eye measured at the PHY side is still reasonably open, but way lower amplitude than before. I'll double check the spec later but this should be plenty open enough.

ngscopeclient screenshot showing new, lower swing TX eye pattern

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

And here's the bug, caught red handed.

We start with the MDIO transceiver being busy with a read of address 0x00. The read data register is still 0x7949, the previous value, because the read is still in progress.

At T=7862 the MCU begins a 4-word burst read of REG_DP_MDIO (0x004c). This is a 32-bit little endian register with the read value in the low 16 bits, a bunch of write-only configuration, and a busy flag in the MSB.

By T=7887 when we read SPI address 0x004f (where the busy flag is) the read has just finished.

So the MCU thinks it's successfully read the whole register.

The fix is pretty simple: latch the busy flag when address 0x4c is read (the entire 32-bit register has to be read in one go, byte access is not supported). The MCU will then read {busy, 0x7949} just like it did on the previous poll, then read the correct value on the subsequent polling cycle.

Vivado ILA screenshot showing the bug in action. See toot text for description

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

Yay, no more flapping!

Tomorrow's problem: while g12 links up fine at gigabit speed, last time I tried g13 would struggle a bit then come up at 100 Mbps (verified by link partner).

That's probably a hardware issue of some sort since g12 and g13 are supposed to be identically configured. The RJ45 pinout is mirrored because of the tab-up vs tab-down jacks, but that should (famous last words) be fine because the DP83867 has a register to enable ABCD -> DCBA mirroring, which I think I've set correctly.

I made a quick pass over the schematic and nothing seemed otherwise different, it was largely copy-pasted other than the PHYADDR strap pins.

SSH console screenshot showing ports mgmt0 and g12 up at 1 Gbps

Serial console screenshot showing g12 and mgmt0 linking up

**Andrew Zonenberg** @azonenberg · Jul 25, 2023

Jul 25, 2023

Andrew Zonenberg @azonenberg

Yet another command that I wish "real" switches had.

There will of course be fancy commands that include nice detailed decodes of port state. But sometimes there's no substitute for getting close to the metal.

admin@switch# int mgmt0
admin@switch(int-mgmt0)# show reg 0
Register 0x00 = 0x1140
admin@switch(int-mgmt0)# show mmd 0 reg 4
MMD 00 register 0x0004 = 0x0003

**Andrew Zonenberg** @azonenberg · Jul 26, 2023

Jul 26, 2023

Andrew Zonenberg @azonenberg

Well, that was a slightly larger yak than I originally expected but it's been thoroughly shaved.

SSH clients on the switch can now see log messages. For now this is enabled by default, although long term I might have this controlled by a per-unit configuration setting or off by default with a Cisco-style "terminal monitor" command to start seeing log messages.

During development I want ALL the logs so I'll leave it like this for now.

SSH console log showing a command being executed followed by a "not implemented" log message

**Andrew Zonenberg** @azonenberg · Jul 26, 2023

Jul 26, 2023

Andrew Zonenberg @azonenberg

Next step will be to implement some of the commands I copied over (commented out) from the Ethernet tap board, and make any tweaks needed to support the additional PHY chipsets on the board.

In particular, I want to be able to send test patterns out both DP83867's to check for soldering issues before I debug the 100mbit-only link issue further.

**Andrew Zonenberg** @azonenberg · Jul 26, 2023

Jul 26, 2023

Andrew Zonenberg @azonenberg

Ok, I should sleep...

But on the plus side, I have the code to send test patterns working (including the three special test patterns that the DP83867 specifies in addition to the IEE-defined ones).

Won't be able to actually debug the g13 100mbit issue until tomorrow after work but I should have all the groundwork laid now.

SSH console screenshot showing commands to enable and disable test patterns on a switch port

Andrew Zonenberg @azonenberg@ioc.exchange

Oh I'm sorry you wanted *less* cable spaghetti? i swear you said you wanted *more*. I even bought a new roll of ESD tape to wrangle it all.

Got the baseT test fixture cabled up so I can troubleshoot g13's link issues after work, but didn't have time to collect any data yet.

If you haven't seen it before, this is a handy dandy little gizmo consisting of two RJ45 jacks connected back to back by dual directional couplers.

This gives me 16 SMA outputs with 10 dB attenuated views of each of the 8 wires in the twisted pair cable, seen from both directions. I'm using an 8 channel PicoScope 6824E to look at all 8 lines coming out of the DP83867, ignoring the inbound data from the other side.

Very crowded lab bench with the switch prototype and a sea of probes, cables, and fixtures attached to it

Test fixture with two RJ45 connectors and eight blue coaxial cables coming off one side and feeding into an oscilloscope

Closeup of the test fixture. Two RJ45 jacks connect to each other with bare copper PCB traces. At each end of the path are eight directional couplers to tap off the signal going to the oscilloscope

Closeup of the bench setup showing the switch prototype surrounded by dozens of wires carefully taped to the bench to prevent anything from moving

Jul 26, 2023, 02:51 PM··Mastodon for Android

27boosts·59favorites

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Hmmm.

Set up a test pattern on g13 and expected to see it coming out all pairs of the link, but only seeing it on pair A.

Thought this pointed to a soldering issue, except I'm seeing it on g12 as well (which links up just fine).

So I guess I need to read the datasheet and see if there's a test pattern mux register or something I'm missing...

Screenshot of ngscopeclient showing a 125 MHz sinewave on pair A but nothing on pairs B, C, or D

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Yep, there is. MMD 0x1f register 0x25, TMCH_CTRL, defaults to only sending the test pattern out pair A.

With that fixed, on g12 I'm seeing the test pattern on all pairs. So we know our register config is correct there.

ngscopeclient screenshot showing jitter test waveforms

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

And now here's what we see on g13. One of these is not like the other.

Probably a solder defect but I'll need to pull the board to investigate. Decabling this will take a while...

ngscopeclient screenshot showing jitter test waveform on all four pairs. Three of the four look nice and uniform, but pair D is weak and distorted.

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

I wish it was a solder defect. The truth is worse.

Not sure how this got through design review...

Schematic screenshot showing D+ on pin 9 and center tap on pin 8

Datasheet screenshot showing D+ on pin 8 and center tap on pin 9

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Looking at the layout, bodging this is going to be fuuuun.

g10, g8, and g4 have pair D routed on layer 6 of 8. Getting to them (assuming I come from the back of the PCB to avoid desoldering the connector) will mean drilling down 250 μm - annoying but not too bad.

g13, g6, g2, and g0, all have pair D router on layer 3 of 8. Getting to *this* from the back side will mean drilling down almost 1.3 mm. That will be decidedly less fun.

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

The good news is that I have almost 1mm of width and as much length as I need to play with. There's basically nothing on other layers that I'm likely to hit.

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

And worst case, this isn't a fatal issue for a prototype. Having half the ports only run in 100baseTX mode, or even not work at all, would surely be annoying. But it wouldn't prevent me from using the board as a development platform for the full scale 24 port switch, which was the real goal.

But I'd like to make it fully functional if I can.

Not happening tonight, though. I've got too much else on my plate with time constraints.

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

One good thing about this bodge is that it's going to be hard (I know better than to say impossible) to screw the board up more.

There's no other signals in that area that I could potentially damage, so as long as I don't drop the board while I'm fixturing it or something, it will be no less broken post bodge than before I started. And if all goes well, it'll work better.

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Actually I might try some fixturing work and a preliminary cut while waiting for stuff to run on another project.

My microscope ring light was too fat to clear so I bodged up an LED headlamp with some tape.

Headlamp precariously taped to the side of a microscope

View through microscope showing area of the planned rework with an 0.25mm endmill hovering just above the board

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

First test cut. Through layers 8 (back) and 7 (ground plane). There's an LED trace on layer 6 we might get close to, but if it's damaged not a huge deal, plenty of other places to reconnect if required. Layers 5 and 4 are power planes we need to not short, then 3 is where the actual bodge will happen.

Microscope view of a rectangular cavity being cut into a PCB between two rows of pins

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Down to layer 5.

Microscope photo of milled cavity showing copper layer at the floor of the hole

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

First connector (on the DP83867s) bodged. Not attempting the rest (on the VSC8512) until I've brought it up.

Ended up milling all the way down and cutting the track then reconnecting on the surface. there's a small stub off a via which isn't great but it'll probably be fine on a prototype.

I'll save the other six for later. If the phy doesn't work, no point spending time reworking the RJ45s.

Underside of an RJ45 connector with a yellow jumper wire reconnecting a misrouted trace

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Looks like that fixed it at least.

Console screenshot showing port g13 is working

Jitter test waveform showing good signal on g13

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Initial signs of life out of the QSGMII PHY!

It's responding to MDIO with the correct address, but twice (?) and at 8 addresses (this is a 12 port PHY). Suspecting a timing issue related to the level shifters on the MDIO bus, but not sure yet. Dropping the MDIO clock frequency by 10x from 2.5 MHz to 250 kHz didn't fix it.

The actual PHY side seems OK, it links up with my laptop on every port I've tried (aside from the known pair D issue on the upper row of ports).

Console screenshot showing VSC8512 device ID detected at PHY addresses 0-7 and 16-23, rather than 0-11 as expected

**Andrew Zonenberg** @azonenberg · Jul 27, 2023

Jul 27, 2023

Andrew Zonenberg @azonenberg

Also whoops I misspoke. The Ethernet test fixture is 16 dB couplers not 10. The directional coupler I use for TDR stuff is 10 dB and I mixed them up.

Too much RF hardware :p

**Andrew Zonenberg** @azonenberg · Jul 28, 2023

Jul 28, 2023

Andrew Zonenberg @azonenberg

Reading the programming guide in the VSC8512 datasheet.

Why??? IEEE has a perfectly well defined way to access up to 2^16 extended registers. You don't need to roll your own way to do it.

Datasheet screenshot showing five pages of PHY registers, muxed by writing a selector value to register 31

**Andrew Zonenberg** @azonenberg · Jul 28, 2023

Jul 28, 2023

Andrew Zonenberg @azonenberg

Loaded an FPGA bitstream that instantiates the QSGMII transceivers on the FPGA.

Power consumption climbed to 12.7W and the FPGA die temperature is up to 48.5C.

The 1V0 rail for the GTXes is sagging to 975.5 mV under load, since it's just pi filtered off of the main FPGA 1V0 rail without an independent remote sense. This is within spec... barely. But definitely something I will want to work on in the future. The full LATENTRED switch (with eight transceivers) will definitely need a dedicated SERDES power rail with independent regulation.

The FPGA 1V0 rail is doing just fine, 1.0015V at the test point and 0.996V measured by the on die ADC.

ngscopeclient screenshot showing GTX_1V0 rail sagging to 975.5 mV under load

**Andrew Zonenberg** @azonenberg · Jul 28, 2023

Jul 28, 2023

Andrew Zonenberg @azonenberg

The thermal pad and heatsink pressure seem fine. Heatsink surface temperature is only 5C below die temperature so not much of a gradient there.

Thermal image of FPGA heatsink showing 44C surface temperature

Thermal image of FPGA heatsink showing 43.6C surface temperature

**Andrew Zonenberg** @azonenberg · Jul 28, 2023

Jul 28, 2023

Andrew Zonenberg @azonenberg

FPGA logic reports none of the QSGMII links are up.

Not entirely surprising since I've never actually tested the QSGMII block in hardware, but still a bit annoying.

I think that's it for today. Tomorrow I'll decable the whole setup (again), and probably try to bodge one or more of the VSC8512 RJ45s as long as i have it off the bench.

Then get test leads on the VSC8512 MDIO bus (to see if anything funky is happening with timing there, I still can only talk to 8 of the 12 PHYs... might be a register misconfiguration too though), and probably land a high BW probe on one or more of the QSGMII lanes to see what's happening with that.

**Andrew Zonenberg** @azonenberg · Jul 28, 2023

Jul 28, 2023

Andrew Zonenberg @azonenberg

Quick handheld probe measurement off the QSGMII TX line from the FPGA.

Definitely some logic bugs, we're supposed to have K28.1 in lane 0 and all I'm seeing is K28.5.

The eye (measured at the PHY side of the coupling capacitor) is pretty wide open, but I will definitely want to tweak driver settings given the closure in the right half. Need to check this against the QSGMII eye mask but I don't have the specs for that in ngscopeclient yet (also a job for tomorrow).