Vector Extension Support Design #89

aarongchan · 2023-09-11T23:57:56Z

aarongchan
Sep 11, 2023
Collaborator

I found a paper describing the design of the “New Ara” , which is an open source vector processing unit and was thinking of modeling the vector processor support after it.
Key Design areas of the Ara design, with some questions for Olympia support:

For lanes, should we essentially just model a lane as an entire “vector unit” in the execution pipe in Olympia? For example, if we were modeling for 8 lanes, we would just initialize 8 vector units, which each have a vector alu (VALU) and vector floating point unit (VFMPU).
In Ara design they have a mask unit to manage mask bits across all of the lanes, due to each lane having a portion of the VRF. Should we model this splitting of the VRF into chunks? Or should we just model where every lane gets access to the entire VRF for simplicity?
Slide unit would be modeled as its own unit. In the Ara design, there is only one slide unit that is shared among all lanes, so if there is a reduction we would need to have arbitration for coordinating it and broadcasting the reduction back to each lane accordingly.
In their design, they use scalar fetch/decode/issue, then dispatch from scalar to the vector co-processor, and then return results from the vector coprocessor for commit.
Memory coherencey between scalar and vector processors will need to be modeled and tracked. Their rules for issuing are as follows:
- Scalar loads if no vector stores are in-flight
- Scalar stores if no vector loads/stores in-flight
- Vector loads/stores only if no scalar stores are pending
Logic will be needed to handle mismatched lane numbers to elements in a vector
High Level Overview of Vector Processor for New Ara:

gglangg · 2023-10-06T16:02:01Z

gglangg
Oct 6, 2023

Hi @aarongchan
Awesome, thank you! I'd like to know the current status and direction of VPU design. If it's possible, could you push commit to your local repo(https://github.com/AaronGChan/riscv-perf-model) so I can take a look at your changes? If not, do you mind if I start trying to design the VPU ?

3 replies

aarongchan Oct 6, 2023
Collaborator Author

Hey @gglangg,
I don't have any code currently, as we have been discussing uarch implementation and wanted to finalize that before starting the modeling. We were going to discuss more in-depth the VPU design at our last week SIG meeting, but ran out of time so it'll most likely get discussed at the upcoming meeting. I have some thoughts and questions that I was hoping to run by the SIG around my design, and I'll post them below. If you want to start trying to design, feel free to! I've been pretty busy with school, so haven't been able to work on this as much as I'd like, but it's been interesting for me learning about vector designs, so I was looking forward to modeling it. If you're able to attend the meetings, it might be good if you can share your design with the SIG as you develop it, as I'm sure people will have feedback and inputs. I posted my current notes below as well, so if you have follow up thoughts or your own separate design we can discuss it and flesh it out.

gglangg Oct 6, 2023

@aarongchan Great! Where can I join the SIG meeting?

aarongchan Oct 6, 2023
Collaborator Author

Here’s the link to the calendar, it should be on Wednesday and is labeled “Performance Modeling SIG”
https://calendar.google.com/calendar/u/0/embed?src=tech.meetings@riscv.org&pli=1

aarongchan · 2023-10-06T16:47:39Z

aarongchan
Oct 6, 2023
Collaborator Author

Current design/thoughts:
Some thoughts I had were would it be simpler to just implement in-order first, get basic functionality down for all risc-v vector 1.0, then later implement out-of-order, but since Olympia's scalar core is OOO, that might present some challenges.

For out-of-order:

Assumptions:
VL=VLMAX (Always)
Ignoring masks for now

Need to rename/track the following:

Source operands
Destination operand
VTYPE
LMUL
SEW
VL (if we assume VL=VLMAX we don’t need to track)
Previous destination (Case explained below if we set AVL < VLEN)
Sequencer
Strip mining logic, calculate how many cycles at full lane utilization will it take, plus remainder cycle
LMUL > 1
- For add/sub element wise we could break up the following:
- LMUL = 4
- vadd.vv v0, v5, v10 is really vadd.vv v0:v4, v5:v9, v10:v14
- so for rename, would we want to rename continuous groups of registers, turn the above into
- vadd.vv <pv0: pv4>, pv5:pv9, pv10:pv14
- or we could break the operation into 4 microps, i.e vadd.vv v0, v5, v10/vadd.vv v1, v6, v11 and so on
- This works well for add/sub/multi/div operations where we can split it nicely, but for vrgather or reductions we can't really do that
- If we do a vrgather on lets say LMUL=4, we can't split it up nicely into microps like the vadd above, because the first gather index could be in the 4th vector register, so there are dependencies
If we have AVL set to 5 (maybe ignore if we assume assumption of VL=VLMAX)
- we only add the first 5 elements
- The spec then says either we can fill the remaining values with 1s, but the programmer can specify if they want the old values to remain, so from an OOO perspective we have to track this, and basically keep the old copy because we don’t know if a subsequent instruction will need it, how far back and so on
- Ex vadd.vv v8, v7, v5
  - Lets assume 8 elements in each vector registers, and we only add 5 elements, we need to either fill in the top 3 elements with 1s or hold the last value of v8 in those top 3 elements
- For us it would be easier just to assume user doesn’t define it, so just fill with 1’s
- An approach to implement it could be
  - We mark v8 that it has another instruction using its value, so we don’t retire/free that register.
  - Then on retirement of this instruction, we have a boolean value that marks if AVL was less than VL, we then copy the origins values in, free the original
  - This would also lead to an additional dependency to check for operand dependency, because if the previous v8’s value is still being calculated, we can’t run this instruction because we need to be able to set the correct values when we retire this one

These have just been some of my thoughts, additionally @sequencer had posted about https://github.com/chipsalliance/t1 in the issue for this, so we could just model after their implementation here. I would want to spend some time walking through their design, but if we model after there's, then we have a nice hardware implementation that we're modeling after.

0 replies

sequencer · 2023-10-07T01:09:11Z

sequencer
Oct 7, 2023

Basically, from my point of view, when designing a Cray-like vector machine, the purpose of renaming might be a wrong direction:

In the architecture level, instruction already provides enough parallelism, from the experience of designing the uArch of T1, it was easy to hit the maximum bandwidth of SRAM, thus T1 also supports the banked SRAM in a lane to provide enough throughput for VRF.
Creating a ROB might have two benefits I can think about:
- widen and reduce instruction support.
- exception handling.

In T1, for widen and reduce instruction, the problem is the datapath of each lane will read from/write to nearby lanes, introducing problem for physical designs. Thus we use the ring bus to make it possible.
For exception, we may follow the Swappable Traps by adding two custom instructions to "pause and play" the vector part, this will save a bunch of effort to load and save context in-place.

0 replies

ghost · 2023-10-07T20:02:11Z

ghost
Oct 7, 2023

Great set of requirements/initial ideas!

@aarongchan: small correction on your notes above. The following statement is more representative:

LMUL = 4
vadd.vv v0, v4, v8 is really vadd.vv v0:v3, v4:v7, v8:v11

Regarding shorted AVL, to keep things simple, we can make the assumption that vector operations are always tail-agnostic (i.e. filled with 1s like you mention). If you do decide, however, to support undisturbed, remember that you will need the original source PRF of the destination as one of the sources. So to extend your examplevadd.vv v8, v7, v5, after renaming, you have have an additional source register (whatever v8 pointed to before this instruction was renamed).

Some other considerations: those vector-scalar operations will need their source scalar value satisfied before progressing in any design you pick. Being that those scalar values are coming OoO from either middle machine (IEX/FEX) or load/store, register read ports for float/int will be required before the vector operation can progress. Fortunately, for LMUL > 1 cases, it's the same scalar value. 😉

Good catch on the gather/reductions. They are ... tricky.

Looking forward to your design/idea overview next meeting.

0 replies

aarongchan · 2023-10-23T15:19:40Z

aarongchan
Oct 23, 2023
Collaborator Author

Document to minutes from vector meetings as well as current discussion and design:
https://docs.google.com/document/d/1KkXHv2iHbZJb1ZUDi2UxO-_HeS1zELTSBTiAKCQhShA/edit?usp=sharing

Current Vector Design Direction

Instead of a vector coprocessor, we will just be having vector execution units with separate scoreboard, freelist, registers, and ROB for vector instructions.
The design might look something like this:

3 replies

sequencer Oct 23, 2023

I'd like to join the next meeting for this, can I have the meeting invitation?

aarongchan Oct 23, 2023
Collaborator Author

Yeah it's on the riscv calendar:
https://calendar.google.com/calendar/u/0/embed?src=tech.meetings@riscv.org

Look for "RISC-V Performance Modeling SIG Vector Support Working Group" on Mondays.

sequencer Oct 23, 2023

Thanks! I will join in the next time.

aarongchan · 2023-10-23T15:51:52Z

aarongchan
Oct 23, 2023
Collaborator Author

Questions:
Lets say we have the following instruction:

Vadd.vv v0:v1 v2:v3 v4:v5
Regarding freelist register allocation, do we want to wait for consecutive freelist registers to become available?
I.E PVRF 1 and PVRF 31 are the only physical registers available, and we have LMUL=2, do we go ahead and just map v0 -> PVRF1 and v1 -> PVRF31? Or should we wait until PVRF0 or PVRF2 become available?

For uop breakup:
V0 → PVRF 32
V1 → PVRF 33
Split into uops:
Vadd.vv PVRF32 PVRF2 PVRF4
Vadd.vv PVRF33 PVRF3 PVRF5

So if there are enough dispatch credits to dispatch both uops of a single instruction, should we stall? Or should we allow one to run ahead? From a performance standpoint, we should let as many as we can run ahead, and I think in terms of the syncing logic, it will be the same for both runahead and stalling, as we need the ROB to wait for both uops to finish. I think dispatching as many credits as we have should be the logic, but please let me know if there is a edge case of this I am missing or a different direction I could take.

Reductions

Need a temporary register or data structure to hold intermediate calculation in a VALU
For optimization step, if we want to fill VALU to the fullest in the case of LMUL > 1
Lets say we have 8 element vectors, and a VALU has 8 ALUs, so it can process 8 calculations
In the above, the most optimal would be the 1st one where we fit both calculations together in the same operation, the bottom 2 would be if we processed them separately
For the reduction VALU, should we worry about supporting this early on? Or should we just process separately, without doing the optimization step for less cycles.

0 replies

ghost · 2023-10-23T16:03:33Z

ghost
Oct 23, 2023

Questions: Lets say we have the following instruction:

Vadd.vv v0:v1 v2:v3 v4:v5
Regarding freelist register allocation, do we want to wait for consecutive freelist registers to become available?

I.E PVRF 1 and PVRF 31 are the only physical registers available, and we have LMUL=2, do we go ahead and just map v0 -> PVRF1 and v1 -> PVRF31? Or should we wait until PVRF0 or PVRF2 become available?

There are no rules for ensuring consecutive physical registers for vector operations. From the SW's POV, v0 and v1 still remain consecutive. The physical register file will maintain the mapping. There is no way for software to "walk" from v0 into v1 -- that's pretty big security hole. 😉

For uop breakup: V0 → PVRF 32 V1 → PVRF 33 Split into uops: Vadd.vv PVRF32 PVRF2 PVRF4 Vadd.vv PVRF33 PVRF3 PVRF5

So if there are enough dispatch credits to dispatch both uops of a single instruction, should we stall? Or should we allow one to run ahead? From a performance standpoint, we should let as many as we can run ahead, and I think in terms of the syncing logic, it will be the same for both runahead and stalling, as we need the ROB to wait for both uops to finish. I think dispatching as many credits as we have should be the logic, but please let me know if there is an edge case of this I am missing or a different direction I could take.

You are correct in your thinking -- let as many micro-ops dispatch/execute independently as possible. From the middle machine's POV, these are separate instructions anyway and it has no idea (nor needs to know unless you're considering gathers/reductions) that they are related.

The ROB, however, does need to know so it can complete the entire vector operation at ones. You are correct.

I'll let @kathlenemagnus poke on your reductions requirements. She's better at that than I am. 😄

0 replies

aarongchan · 2024-05-06T06:20:09Z

aarongchan
May 6, 2024
Collaborator Author

Current Vector Design

The current vector design is to support vector execution units in the Olympia design. This means no separate VPU or vector coprocessor as seen in designs such as Ocelot or other state of the art implementations of vector processors. The goal of the current work is to add basic vector support functionality to Olympia and further work could be to then design a vector co-processor to work with the scalar processor of Olympia.

Design

The design is thus a unified system, where both vector and scalar instructions pass through most stages together. The only time they will diverge is at the issue queue. There will be separate issue queues for vector instructions, which are user defined. At this time, masking will be ignored and AVL is always assumed to be vector length.

VSET

VSet instructions are used to set effective length multiplier (LMUL), vector length (VL), and set element width (SEW). The main design case around VSET is that VSET is needed before any vector instruction can be processed thus this scenario presents a problem:

VSET
VADD
VADD
VSET
VADD

In the above code scenario, the first VSET must be blocking in terms of vector to prevent any vector instructions to be processed. This is due to the fact that the VADD needs to know what VSET is setting the CSRs to before being processed. Due to this, we block at fetch and decode on vset in our design. We only block subsequent vector instructions in the decode group. So if we have:

VSET
SCALAR ADD
SCALAR ADD
VADD.VV
SCALAR ADD

we can process the two scalar adds, because the VSET doesn't affect them, but we have to block on VADD for both scalar and vector. Thus, we cannot process the 3rd scalar add, because we don't know if the VADD.vv could affect subsequent scalar instructions and so on. The stalling on vset is only needed for vset instructions where VL is being set by a register or vtype is being set by a register (vsetvl).
In this design, a dedicated sparta::Port is used to forward an instruction back from execution to decode, so that the CSRs can be read off the instruction and allow decoding of instructions to resume. This forwarding is done after the instruction has been executed.

Block flow of VSET dependency blocking at which stages

LMUL > 1

In a scenario where one has a vadd, but with LMUL 4, the efficient way to handle it is to split the instruction into 4 UOps. This allows one to parallelize across multiple VALUs, as for VADDs, the UOPs are independent. This is not the case for instructions such as VGather, which will be addressed in later PRs. With the creation of UOPs, only one ROB entry is created. This helps prevent the ROB slots from being polluted with all the UOPs. In the ROB, the parent instruction will have a list containing all of it's UOPs and at it's retirement, check if all of it's UOPs are all done before processing it. All UOPs will have the same UniqueID and ProgramID as the parent, but have a separate field of UOPID to differentiate between the UOPs.

VALU Length

By default in core/executepipe.cpp, the VALU bit length (valu_len) is set to 128 bits. This is a user specified parameter, so it can be increased. The current assumption is that the VALU/VFMU is a combination of 8-bit adders that can be combined based on SEW with no delay to computation. Thus a 128 bit length vector ALU with SEW of 8 would result in 32 8-bit adders computing in parallel. Likewise, a 128 bit length vector with SEW of 32 would result in 8 32-bit adders, where 4 8-bit adders combine to result in the 32 bit adder. In the current code we assume there is no latency, but this can definitely change to match more realistically to hardware.

VALU Length for Multiple Passes

If a vector length > VALU length, i.e 512 vector length but only 128 bit vector adder, multiple passes will be needed. In the design, multiple passes are done at the execution stage, but a bubble is inserted between each add. So based on the example, 4 passes on the 128 bit adder will be needed to process the entire 512 vector length. So, the flow would be:
ALU executes -> bubble -> ALU executes -> bubble -> ALU executes -> bubble -> ALU executes
and so in the current code, this is done by sequential calling between insertInst and executeInst, which is currently how execution of an instruction are handled in the design.

Vector Tail (VTA)

RISC-V Vector supports two settings, undisturbed and agnostic when a mask is applied. Undisturbed means the original values of the destination register are unchanged. Agnostic means the original values don't matter, so they can be set to 1s or retain original value. From an OOO perspective, undisturbed adds an extra source element that needs to be tracked. So an example instruction of vadd.vv v1, v2, v3 with VTA=0 (undisturbed), would have v2 and v3 as source registers, but v1 would also be a source register, resulting in 3 total source registers that need to be tracked. Although the current implementation doesn't support masks yet, proper renaming tracking is implemented for the example stated above.

Timeline of Work

The current effort is to support basic VADD operations with varying vector CSRs being set. After that, VALU and VFMPU operations will be added. Finally, special instructions such as VRGather and VReduction will require state machines and more complex work, so this will be saved for the end.

Ongoing Work

This part is just to clarify that this is an ongoing effort, so designs are not final and always appreciative of feedback and suggestions.

0 replies

klingaard · 2024-05-06T14:14:22Z

klingaard
May 6, 2024
Maintainer

Keep in mind that not all vset variants will cause decode to stall. Only those vset variants were decode cannot ascertain the value of VL/VTYPE should cause decode to stall. Specifically, those variants where an integer register input is required.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Extension Support Design #89

{{title}}

Replies: 9 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Vector Extension Support Design #89

aarongchan Sep 11, 2023 Collaborator

Replies: 9 comments · 6 replies

gglangg Oct 6, 2023

aarongchan Oct 6, 2023 Collaborator Author

gglangg Oct 6, 2023

aarongchan Oct 6, 2023 Collaborator Author

aarongchan Oct 6, 2023 Collaborator Author

sequencer Oct 7, 2023

ghost Oct 7, 2023

aarongchan Oct 23, 2023 Collaborator Author

sequencer Oct 23, 2023

aarongchan Oct 23, 2023 Collaborator Author

sequencer Oct 23, 2023

aarongchan Oct 23, 2023 Collaborator Author

ghost Oct 23, 2023

aarongchan May 6, 2024 Collaborator Author

Current Vector Design

Design

VSET

LMUL > 1

VALU Length

VALU Length for Multiple Passes

Vector Tail (VTA)

Timeline of Work

Ongoing Work

klingaard May 6, 2024 Maintainer

aarongchan
Sep 11, 2023
Collaborator

Replies: 9 comments 6 replies

gglangg
Oct 6, 2023

aarongchan Oct 6, 2023
Collaborator Author

aarongchan Oct 6, 2023
Collaborator Author

aarongchan
Oct 6, 2023
Collaborator Author

sequencer
Oct 7, 2023

ghost
Oct 7, 2023

aarongchan
Oct 23, 2023
Collaborator Author

aarongchan Oct 23, 2023
Collaborator Author

aarongchan
Oct 23, 2023
Collaborator Author

ghost
Oct 23, 2023

aarongchan
May 6, 2024
Collaborator Author

klingaard
May 6, 2024
Maintainer