- Matrix multiplication latency: 1us, full time latency (write to FPGA memory, move to edge registers, do matmul, read back): 0.042003 seconds.
This output stationary 8x8 systolic array is implemented in hardware using Xilinx' Vivado written in Verilog/System Verilog. The firmware part is written in C using Xilinx Vitis which gives write, load, perform, and read instructions to the FPGA. This first release would not be the only release and further improvements and optimization to both hardware and software would be implemented in the near future.