Vector Extension Support Design #89
Replies: 9 comments 6 replies
-
Hi @aarongchan |
Beta Was this translation helpful? Give feedback.
-
Current design/thoughts: For out-of-order: Assumptions: Need to rename/track the following:
These have just been some of my thoughts, additionally @sequencer had posted about https://github.com/chipsalliance/t1 in the issue for this, so we could just model after their implementation here. I would want to spend some time walking through their design, but if we model after there's, then we have a nice hardware implementation that we're modeling after. |
Beta Was this translation helpful? Give feedback.
-
Basically, from my point of view, when designing a Cray-like vector machine, the purpose of renaming might be a wrong direction:
In T1, for |
Beta Was this translation helpful? Give feedback.
-
Great set of requirements/initial ideas! @aarongchan: small correction on your notes above. The following statement is more representative:
Regarding shorted AVL, to keep things simple, we can make the assumption that vector operations are always tail-agnostic (i.e. filled with 1s like you mention). If you do decide, however, to support undisturbed, remember that you will need the original source PRF of the destination as one of the sources. So to extend your example Some other considerations: those vector-scalar operations will need their source scalar value satisfied before progressing in any design you pick. Being that those scalar values are coming OoO from either middle machine (IEX/FEX) or load/store, register read ports for float/int will be required before the vector operation can progress. Fortunately, for LMUL > 1 cases, it's the same scalar value. 😉 Good catch on the gather/reductions. They are ... tricky. Looking forward to your design/idea overview next meeting. |
Beta Was this translation helpful? Give feedback.
-
Document to minutes from vector meetings as well as current discussion and design: Current Vector Design Direction |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
There are no rules for ensuring consecutive physical registers for vector operations. From the SW's POV,
You are correct in your thinking -- let as many micro-ops dispatch/execute independently as possible. From the middle machine's POV, these are separate instructions anyway and it has no idea (nor needs to know unless you're considering gathers/reductions) that they are related. The ROB, however, does need to know so it can complete the entire vector operation at ones. You are correct. I'll let @kathlenemagnus poke on your reductions requirements. She's better at that than I am. 😄 |
Beta Was this translation helpful? Give feedback.
-
Current Vector DesignThe current vector design is to support vector execution units in the Olympia design. This means no separate VPU or vector coprocessor as seen in designs such as Ocelot or other state of the art implementations of vector processors. The goal of the current work is to add basic vector support functionality to Olympia and further work could be to then design a vector co-processor to work with the scalar processor of Olympia. DesignThe design is thus a unified system, where both vector and scalar instructions pass through most stages together. The only time they will diverge is at the issue queue. There will be separate issue queues for vector instructions, which are user defined. At this time, masking will be ignored and AVL is always assumed to be vector length. VSETVSet instructions are used to set effective length multiplier (LMUL), vector length (VL), and set element width (SEW). The main design case around VSET is that VSET is needed before any vector instruction can be processed thus this scenario presents a problem:
In the above code scenario, the first VSET must be blocking in terms of vector to prevent any vector instructions to be processed. This is due to the fact that the VADD needs to know what VSET is setting the CSRs to before being processed. Due to this, we block at fetch and decode on vset in our design. We only block subsequent vector instructions in the decode group. So if we have:
we can process the two scalar adds, because the VSET doesn't affect them, but we have to block on VADD for both scalar and vector. Thus, we cannot process the 3rd scalar add, because we don't know if the VADD.vv could affect subsequent scalar instructions and so on. The stalling on vset is only needed for vset instructions where VL is being set by a register or vtype is being set by a register (vsetvl).
LMUL > 1In a scenario where one has a VALU LengthBy default in core/executepipe.cpp, the VALU bit length ( VALU Length for Multiple PassesIf a vector length > VALU length, i.e 512 vector length but only 128 bit vector adder, multiple passes will be needed. In the design, multiple passes are done at the execution stage, but a bubble is inserted between each add. So based on the example, 4 passes on the 128 bit adder will be needed to process the entire 512 vector length. So, the flow would be: Vector Tail (VTA)RISC-V Vector supports two settings, undisturbed and agnostic when a mask is applied. Undisturbed means the original values of the destination register are unchanged. Agnostic means the original values don't matter, so they can be set to 1s or retain original value. From an OOO perspective, undisturbed adds an extra source element that needs to be tracked. So an example instruction of vadd.vv v1, v2, v3 with VTA=0 (undisturbed), would have v2 and v3 as source registers, but v1 would also be a source register, resulting in 3 total source registers that need to be tracked. Although the current implementation doesn't support masks yet, proper renaming tracking is implemented for the example stated above. Timeline of WorkThe current effort is to support basic Ongoing WorkThis part is just to clarify that this is an ongoing effort, so designs are not final and always appreciative of feedback and suggestions. |
Beta Was this translation helpful? Give feedback.
-
Keep in mind that not all |
Beta Was this translation helpful? Give feedback.
-
I found a paper describing the design of the “New Ara” , which is an open source vector processing unit and was thinking of modeling the vector processor support after it.
Key Design areas of the Ara design, with some questions for Olympia support:
High Level Overview of Vector Processor for New Ara:
Beta Was this translation helpful? Give feedback.
All reactions