Resolution
Thanks for all the help!
After a lot of improvments to the system I was able to get the SPI bus throughput test on the RK3588 to be consistently 21.5Mbit/s.
Wrote 233600 bytes in 86 ms
Estimated IO upper bound: 21728 kbps
Wrote 233600 bytes in 86 ms
Estimated IO upper bound: 21728 kbps
Wrote 233600 bytes in 87 ms
Estimated IO upper bound: 21480 kbps
Mostly documenting this for anyone who runs into similar issues in the future.
The trace shared by @ajudge above was hugely valuable as a baseline for me to know where I should be looking for optimizations.
I should note that in my earlier testing there was a lot of variability in the throughput as reported by the spi bus test ranging between 10mbps to 15mbps with the most of the results being around 12mpbs.
UDP iperf3 tests were getting up to 10-15mpbs tx with rx being particulalry bad (around 6-8mpbs)
first tranche of changes
| system | improvement |
|---|---|
| SPI driver | disable all power management functionality |
| SPI driver | enable “rockchip,rt” dt parameter |
| DMAC driver | disable all power management functionality |
| Morse driver | align spi txn that use DMA |
| performance governor | use “performance” instead of “schedutil” (duh) |
this point i was seeing anywhere between 16-18 mpbs with the occasional 20mpbs.
And iperf UDP tests were showing similiar results both rx and tx around 17-18 mpbs.
Most of the problems here were spikes in delays of transactions and just unncessary delays before and after CS active/inactive to clk start/end and delays between cs active periods (scheduling related).
second tranche of changes
| system | improvement |
|---|---|
| IRQ Pinning | move the dma and spi to a dedicated (big core) each |
| IRQ Pinning | move other irqs off cores 4 and 5 |
| cpu states | disable sleep states on cpu 4 and 5 |
This got me to consistent 21mpbs spi bus tests.
and 20mpbs tx and rx on iperf tests.
last improvement
to 21.5mbps iperfs
All the 16byte transactions about the 1536 waste a lot of time on the bus
this may be cheating, but changing the MTU to 1486 instead of 1500 reduces an entire spi transaction after the block transfer, which helps a little bit. (not sure how well that holds up in real world scenarios).
our final iperf3 test now shows:
$ iperf3 -c 192.168.69.1 -u -b 30M -t 30 -p 5202
Connecting to host 192.168.69.1, port 5202
[ 5] local 192.168.69.205 port 39012 connected to 192.168.69.1 port 5202
[ ID] Interval Transfer Bitrate Total Datagrams
[ 5] 0.00-1.00 sec 2.56 MBytes 21.5 Mbits/sec 1874
[ 5] 1.00-2.00 sec 2.60 MBytes 21.8 Mbits/sec 1898
[ 5] 2.00-3.00 sec 2.52 MBytes 21.2 Mbits/sec 1844
[ 5] 3.00-4.00 sec 2.52 MBytes 21.1 Mbits/sec 1841
[ 5] 4.00-5.00 sec 2.55 MBytes 21.4 Mbits/sec 1866
[ 5] 5.00-6.00 sec 2.42 MBytes 20.3 Mbits/sec 1772
[ 5] 6.00-7.00 sec 2.42 MBytes 20.3 Mbits/sec 1768
[ 5] 7.00-8.00 sec 2.59 MBytes 21.7 Mbits/sec 1893
[ 5] 8.00-9.00 sec 2.52 MBytes 21.1 Mbits/sec 1843
[ 5] 9.00-10.00 sec 2.49 MBytes 20.9 Mbits/sec 1821
[ 5] 10.00-11.00 sec 2.50 MBytes 21.0 Mbits/sec 1827
[ 5] 11.00-12.00 sec 2.54 MBytes 21.3 Mbits/sec 1855
[ 5] 12.00-13.00 sec 2.53 MBytes 21.2 Mbits/sec 1848
[ 5] 13.00-14.00 sec 2.58 MBytes 21.7 Mbits/sec 1888
[ 5] 14.00-15.00 sec 2.52 MBytes 21.1 Mbits/sec 1841
[ 5] 15.00-16.00 sec 2.60 MBytes 21.8 Mbits/sec 1899
[ 5] 16.00-17.00 sec 2.49 MBytes 20.9 Mbits/sec 1822
[ 5] 17.00-18.00 sec 2.51 MBytes 21.1 Mbits/sec 1838
[ 5] 18.00-19.00 sec 2.52 MBytes 21.1 Mbits/sec 1842
[ 5] 19.00-20.00 sec 2.56 MBytes 21.5 Mbits/sec 1873
[ 5] 20.00-21.00 sec 2.52 MBytes 21.2 Mbits/sec 1844
[ 5] 21.00-22.00 sec 2.50 MBytes 20.9 Mbits/sec 1825
[ 5] 22.00-23.00 sec 2.50 MBytes 21.0 Mbits/sec 1829
[ 5] 23.00-24.00 sec 2.52 MBytes 21.1 Mbits/sec 1840
[ 5] 24.00-25.00 sec 2.58 MBytes 21.6 Mbits/sec 1884
[ 5] 25.00-26.00 sec 2.59 MBytes 21.8 Mbits/sec 1897
[ 5] 26.00-27.00 sec 2.50 MBytes 21.0 Mbits/sec 1829
[ 5] 27.00-28.00 sec 2.51 MBytes 21.1 Mbits/sec 1835
[ 5] 28.00-29.00 sec 2.50 MBytes 21.0 Mbits/sec 1831
[ 5] 29.00-30.00 sec 2.49 MBytes 20.9 Mbits/sec 1821
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-30.00 sec 75.7 MBytes 21.2 Mbits/sec 0.000 ms 0/55388 (0%) sender
[ 5] 0.00-30.04 sec 75.7 MBytes 21.2 Mbits/sec 0.578 ms 0/55388 (0%) receiver
further improvements
as was mentioned above the interblock delay could be reduced, but likely won’t make a huge impact.
Full spi test:
Bus IO write estimator
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
batch(es): 16
rounds: 10
Wrote 233600 bytes in 86 ms
Estimated IO upper bound: 21728 kbps
Bus timing profiler
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
rounds: 16
timing (us)
bus claim : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
bus release: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
read 32 : 26 27 26 26 26 26 24 26 26 26 26 26 26 26 25 26
read bulk : 458 509 462 456 453 454 453 510 503 455 459 455 455 454 463 455
write 32 : 28 27 26 27 26 27 26 26 26 26 26 26 39 593 36 48
write bulk : 458 454 552 457 455 452 451 451 452 451 457 496 544 456 457 507
SKB allocation profiler (100 skbs w/ 1562 bytes)
alloc: 44 us
free: 45 us
Bus IO write estimator
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
batch(es): 16
rounds: 10
Wrote 233600 bytes in 86 ms
Estimated IO upper bound: 21728 kbps
Bus timing profiler
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
rounds: 16
timing (us)
bus claim : 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
bus release: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
read 32 : 31 27 26 26 26 26 26 26 25 26 26 26 26 26 22 26
read bulk : 459 459 469 506 475 518 459 508 458 486 464 461 514 458 473 454
write 32 : 29 29 29 43 29 29 29 29 43 30 29 29 29 71 27 27
write bulk : 455 452 468 463 452 540 455 499 455 456 455 457 464 466 497 456
SKB allocation profiler (100 skbs w/ 1562 bytes)
alloc: 49 us
free: 45 us
Bus IO write estimator
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
batch(es): 16
rounds: 10
Wrote 233600 bytes in 87 ms
Estimated IO upper bound: 21480 kbps
Bus timing profiler
packet size (bytes): 1460
overhead (bytes): 102
padding (bytes): 2
rounds: 16
timing (us)
bus claim : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
bus release: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
read 32 : 27 27 26 26 26 26 26 26 26 26 26 26 26 26 25 26
read bulk : 457 454 454 453 454 452 453 454 453 454 454 454 453 454 506 503
write 32 : 27 26 26 26 26 26 26 26 26 27 26 26 26 26 26 40
write bulk : 455 451 452 452 451 452 451 452 451 451 452 451 451 451 452 451
SKB allocation profiler (100 skbs w/ 1562 bytes)
alloc: 49 us
free: 43 us

