docs/todo.txt

- migrate memory controller to AXI4
- learn how an axi4 interface for pcie might look
- migrate gpu controller to have an AXI4 interface to a pcie placeholder
- get working with pytorch, at least for something simple, like adding 1 to an int tensor
- migrate to use llvm assembler instead of assembler.py?
- make a general single-source 'compile' script (?)
- split `global_mem_controller.sv` into `global_mem_controller.sv` and `global_mem.sv`
- rename 'gpu_runtime' to 'gpu_runtime_and_simulator'
- add simulated latency to global_mem_controller for calls from gpu_controller
- check if my coding guidelines talk about `n_` notation
- check what does 'zero' mean in riscv32?
- update coding guidelines now we are using verilator
- update timings for generated 32bit mul, in int mul .md
- maybe eliminate the pos + 1 in multiplication code, by various heuristics?
- reduce int division unit dpt
- parallel instruction execution
- floats:  div, modulus, sub, beq, glt
- add piccie for gate level simulation to readme/docs
- create register file
- fuse LUI and ADDI
- fuse LUI and JALR
- fuse AUIPC and JALR
- fuse DIV and MOD
- implement fmad?
- instruction cache
- level 1 cache
- memory can have byte access (via cache)
- check endian-ness relative to risc-v spec
- check spir-v vs risc-v
- gpu structure
- change bits of address to 64-bit, to allow 16GB+ of memory
- (figure out a way to have asserts displayed over if it is the last run through the comb block before the flip-flops flip?)
- (maybe create a lint checker that tests taht if statements have)
- (add tests to bheavioral tests, that feed in `x`, and check caught somehow?)
- (out of order execution?)
- (use actual memory module?)

things I need that not thinking of implementing myself in short-term
(i.e. opportunity to implement yourself if you are interested :) )
- PCIe4 interface, to receive commands from computer's CPU
- DDR4 controller, to control the global GPU off-chip memory
- Network on a chip (NoC)

done:
- add `default_nettype none to each module
- create prototype for memory controller for axi4
- learn AXI4
- create new docker image for latest branch of llvm
- move runtime out of prot into src (do this once tests are passing in current location first...)
- fix single source runtime, so we only assemble each kernel once, and then cache that
- added single source matrix_mul
- added single source mul_float
- added single source sum_float
- get floats working in single-source
- added mul_ints
- fix tests for single source. need to figure out llvm installation in container
- update diagrams with new file names
- make launch_kernel.cpp able to create li an, [number] code
- look into compilng and running gpu kernel
- look into running verilator without using cmake
- add CI check for broken doc links ,and fix the broken doc links
- be able to compile a simple kernel, in single source, to riscv32 assembler
- be able to pass data backwards and forwards between c++ and hardware
- [older stuff truncated]

things for gpu:
- get_local_id(), get_global_id()
- barrier
- global mem
- local mem

things to ponder:
- how to read/write registry locking vector?
- use arbiter to request write access to registry file one cycle before writing?
- cases where we want to write:
   - LUI. have the data immediately, no need to wait for anything...
      - mind you, it's usually associated with an ADDI next. can we fuse them? (or bypass?)
   - ADDI. need a cycle to load rs
   - LOAD. need lots of cycles to retrieve from memory. we dont know in advance which cycle the data will arrive, but it
     takes so long, so an extra cycle, who cares?
     - for reading from cache, it's faster, but we probably know better when the data will arrive? (?)
   - DIV: we know when the data will arrive, since we count down the pos
   - various other immediate ops: need at least one cycle to load rs1, rs2