-
Notifications
You must be signed in to change notification settings - Fork 15
/
TODO.todo
91 lines (85 loc) · 7.49 KB
/
TODO.todo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
Processor:
✔ Add exceptions @done (20-02-25 23:57)
✘ CSR for unaligned info? @cancelled (20-02-25 23:57)
☐ Non-returning hardfault exception for e.g. exception during exception
☐ Sweep up all the unhandled exception cases (e.g. proper speculation of bus faults through from instruction fetch)
☐ Add and test timer IRQ
☐ Check CSR compliance
☐ Need to populate ID fields?
☐ Instruction retire counter!
☐ Compliant debug support (must run with upstream openocd)
☐ Support for M extension
✔ Create divide/multiply circuit @done (20-02-26 00:00)
✔ Integrate into pipeline, pass compliance tests @done (20-02-28 21:05)
✔ Kill ongoing div/mul on IRQ raise @done (20-03-02 00:45)
☐ Fuse MULH/MUL and DIV/REM sequences
✔ Get riscv-formal M-extension checks passing with ALTOPS enabled @done (20-03-02 00:44)
☐ Get all riscv-formal rv32imc checks passing with muldiv instantiated
☐ Standalone formal test for mul/div results
☐ Support two-bus-master variant
☐ Write separate top levels for 1mst and 2mst
☐ Bring up compliance tests on 2mst
☐ Bring up riscv-formal on 2mst
☐ Yosys synth check for through-paths between the two busmasters
☐ Bring up riscv-formal CSR checks
☐ Integrate riscv-formal into regressions, perhaps optionally.
☐ Iterate on decode synths to try and pack gates down
Still an open issue. Have investigated the instruction decompressor and it's no larger than the crazy hand-Espresso'd one in SWeRV. Rest of decode potentially still has some savings in the non-RV-specific parts.
☐ Port more advanced/larger C testcases
☐ Identify other sources of high gate count (general golfing)
Graphics:
DMA:
✘ Spec it! @cancelled (20-02-26 00:06)
Not going to have room for a DMA. Processor memcpy is reasonably fast. Realistically we are already bottlenecked on external SRAM.
Other Chip:
☐ More aggressive busfabric verification (UVM? Formal?)
☐ Audio
Initially, some FIFOs and simple PWM.
Can then add e.g. interpolator and improved modulation. Not critical.
PCB:
☐ Suitable power domain crossing to allow CP2102 to be unpowered while RISCBoy is running from battery (sucks down many mA)
___________________
Archive:
✘ Support bursts in AHB-lite fabric @cancelled (20-02-26 00:07) @project(Other Chip)
Useful for later projects with caches + SDRAM, not for this one.
✘ GPIO: multiple-pad peripheral inputs @cancelled (20-02-26 00:07) @project(Other Chip)
(not totally critical for this project)
✔ Support partial crossbars in ahbl_crossbar @done (20-02-26 00:07) @project(Other Chip)
✔ Bootloader supportingflash erase/program via UART @done (20-02-26 00:05) @project(PCB)
✘ Consider hardware unaligned load/store support @cancelled (20-02-25 23:58) @project(Processor)
(e.g. perform just byte accesses; just has to be better than software).
✘ Consider a generic extension interface which could also be used by M extension @cancelled (20-02-25 23:58) @project(Processor)
YAGNI
✔ Add CSR support, with parameters to disable. @low @done (20-02-25 23:58) @project(Processor)
✔ Why does float printf still not work :D @done (20-02-25 23:57) @project(Processor)
Everything else I try seems to pass!
There are some questionable things in the code, e.g. performing a
sh on some address, and an lw immediately afterward, which
causes Xs to get written to register file. Definitely needs investigation.
✘ Investigate negedge read for register file to improve timing @cancelled (19-04-29 18:57) @project(Processor)
On hold at the moment; this is not on the critical path on iCE40.
✔ Work through riscv-formal tests and fix all the bugs that crop up. Do a detailed writeup. @done (19-04-29 18:55) @project(Processor)
✔ Implement RVFI for use with riscv-formal @done (19-04-29 18:55) @project(Processor)
✔ Is the jump target adder in X really necessary? For JALR we can use the ALU. For branches it's PC-relative, so could be computed in D (and we are probably already computing this.) @done (19-04-19 03:55) @project(Processor)
Did this optimisation. It's not a *huge* win due to the extra muxes inserted, but still a win.
✔ Stick some traffic generation into the execution tb so that stall logic can be stressed a little. Separately re-run the compliance tests with this enabled. @done (19-04-19 03:54) @project(Processor)
✔ Fix this: @done (19-04-19 03:53) @project(Processor)
Currently early jumps are performed whenever the CIR contains a valid JAL or backward branch. This interacts poorly with X's stall logic: if X is stalled AND the jump request goes through (jump requests can be accepted whilst hready is low; just goes into fetch counter and subsequent cycles perform normal fetch attempts from the now registered jump target) then the CIR will be clobbered. This is an issue for JALR, which will lose its side effects, and is particularly an issue for branch mispredicts.
Trying to hold off the jump requst using X's stall signal seems a no-go, because this is a function of hready, and we've just created a path from this to haddr etc.
A different solution is to somehow protect the CIR so that the instruction currently in D will eventually progress down the pipeline and have its intended side effect. Should use this same protection signal to gate the jump request. The frontend can still be fetching during this time, but the data will just stack up in the FIFO.
Note this also enables an optimisation floated a while back which saves a cycle when a load/store is followed by a branch/JAL: boost the priority of early jump accesses over load/store accesses. Jumps create a fetch bubble due to pipelined bus operation, but if the jump access occurs before that of a load/store occupying X, the load/store address phase will be aligned with the fetch bubble, and fill it. When this happens the jump is still sitting in D, but its jumpiness has been defused, and it will meekly follow the load/store down the pipeline to produce its secondary side effect.
Fixed by CIR locking mechanism.
✘ SD controller @cancelled (19-04-19 03:49) @project(Other Chip)
Can be a simple SPI, or more sophisticated e.g. some kind of XIP with cache,
or manual paging into internal buffers which can then be random-accessed,
or protocol-oriented but with FIFO interface rather than internal buffers
(i.e. optimised for streaming into main memory)
update: removing SD card; want to keep board simple and low-BoM-cost. Will use a large (Q)SPI flash, and use sw bootloader on the FPGA to update this over UART shell, kind of like Arduino. Bitstream and user programs stored in same flash.
✔ Remove wait states from sync SRAM controller @done (19-04-19 03:48) @project(Other Chip)
✔ Proper async SRAM controller @done (19-04-19 03:48) @project(Other Chip)
After some thinking, this seems to break down into two parts;
✘ A generic AHB-lite narrower that translates wide transactions on one bus into multiple transactions on another bus @cancelled (19-04-19 03:47) @project(Other Chip)
Decided not to do this as 2:1 ratio has special properties (e.g. ability to double-pump reads) so not much use in a fully generic solution at the moment
✔ A simple async sram controller where the SRAM width is the same as the data bus width @done (19-04-19 03:47) @project(Other Chip)
✔ Factor out decode to separate module @done (18-11-30 08:31) @project(Processor)
✔ ALU: shared adder for add sub, and LT/LTU changes this necessitates @done (18-11-30 08:31) @project(Processor)