operation fusion/morphing #121
Replies: 21 comments 3 replies
-
Start of a discussion on instruction/op fusion. Issue raised originally by zxc12523 and discussion suggested by Knute in #113. This is the start of that discussion, as suggested the outcome would be a description of the API for instructions transformed in decode, and a discussion of implementation choices, and a sample implementation with unit test. The original interest was in fusion but I think of this as a more general need for instruction transforms, where fusion is one transform and fracture is another. The former has more interest at this point of the RV ISA, but the allowance of custom instructions and as yet unknown future extensions suggests fracturing may also be of interest at some point. I would like an API that could accommodate both transformations. (if someone can remove the initial post, tygaribay-gh, I was logged into a test account, and submitted before I realized it.) |
Beta Was this translation helpful? Give feedback.
-
Attached is a rough draft of a design document. Posted for guidance on general direction but I welcome any feedback. In the current state of the API document, I have scoped the definition of the problem and created a set of basic methods that let me explore mechanics of legalization and transformation. I am creating a fusion unit and a test bench for it. I am currently leaning toward user registered callbacks because they make the API inherently independent of file format specifications of fusion/fracture operations. For the general case I expect there to be many ways to specify the transform process. In the first example I am implementing I am using callbacks with the fusion specification relying on Mavis/Json. I also expect there to be an STF utility function for transformation of trace records. The goal is to propose no modification to either standard. |
Beta Was this translation helpful? Give feedback.
-
Abandoning the PDF for now. Below is a long-ish summary. Suggestions and opinions invited. As given: Fusion compares an incoming sequence of instructions to a set of known sequences, constraints and transforms. Incoming sequences that match known sequences and meet the constraints have the fusion transform performed. The incoming sequence is replaced with the fusion result and passed downstream. A simplified sequence from dhrystone, chosen at random not because it warrants fusion.
The instruction Assume the pre-registered sequences are expressed as transforms (Tx), constraints (Cx) and sequence tuples [...];
The arrays of UIDS are replaced with a hash in the model. The abstract transforms take the form:
The transform results in a re-encoding to a custom instruction understood by the hardware. The abstract contraints take the form:
The transform and constraints are implemented in callbacks. The transform process uses greedy matching with a minimum tuple size of N and a maximum tuple size of M. N and M are parameters. If the incoming sequence is re-written as A B C D..., the ordered combinations of tuples are:
Beginning with the longest sequence the HASHes are compared to the HASH of the preregistered tuples and the contraints are tested. The first combination that meets the contraints is also the longest valid sequence. The transform of the matching preregistered sequence is applied to the operands of the combination tuple. This is another example of supporting generality in the model. The hardware would perform comparison in parallel and priority select the longest match. The matching incoming sequences are replaced with the transform result in the output and the process continues until the incoming sequence is exhausted. Of course in cases where no valid transform is found the incoming sequence is copied to the output. In this simple case the result is two complex instructions, the syntax is here for reference. I have not done anything for a working disassembler.
The preregistered sequences are represented in json and use the mavis UID as the reference key. The values of constraint_set and transform are the names of the callbacks.
|
Beta Was this translation helpful? Give feedback.
-
Dave Murrell (@dbmurrell) since you're an expert on fusion as well, do you want to provide any ideas/feedback to this discussion? |
Beta Was this translation helpful? Give feedback.
-
Thank you Jeff for putting this together. |
Beta Was this translation helpful? Give feedback.
-
Below is a sample showing more of the direction. I have an LALR parser, there are no shift/reduce problems. The domain context has eliminated the need for some common syntax. Not shown but the parser grammar also a supports flattened form and a form which improves reuse. I have example transform syntax below which I did not have before, work in progress. This case is shorter than the previous, it uses operand positions as a constraint, e.g. all g1 are the same. Making the assumption the input instruction was properly encoded saves a lot of value checking statements. I've added one contrived constraint, g1 != g2 Next is to try an implementation and prove the matching and constraints checking.
|
Beta Was this translation helpful? Give feedback.
-
I have reviewed the proposal and it looks good. I have just one comment, and it is more in line of how this will be integrated into Olympia. I propose that we should define a instruction fusion interface, and the proposal above will be one implementation of the interface. This will allow the users to write their own fusion model, if the proposal is not sufficient for their microarchitecture needs. |
Beta Was this translation helpful? Give feedback.
-
Attached is a PDF of the doxygen latex for the current snapshot of the C++ API. This is from a testbench. Work remains but the callback idiom is presented. There is also a riscv-perf-model Decoder unit test in progress, it is incomplete. I've made some assumptions about the capabilities of the support code:
The top level methods in a use case are: searchSequence also registerSequence is a startup/initialization process, which builds the fusion sequences at run time. I'm focusing on getting something functional in the Decoder testbench. I'll revisit something more in-line with the tree construction methods used in the model There is work remaining. Feedback is always welcome. |
Beta Was this translation helpful? Give feedback.
-
This is a working function object for checking constraints for a fusion group. The API handles field value queries of the input using mavis and access to machine implementation details. These implementation details, such as read and write port limits, will affect performance correlation with fusion. In the current implementation I use parameters added to the Decode unit. If there is time in the next call I'd like to discuss if there is a more general way to handle these implementation details. pdf had Sequence this is now FusionGroup.
|
Beta Was this translation helpful? Give feedback.
-
Still digesting your fusion proposal... but I did find some time to build a small "instruction dumper" that illustrates the use of Mavis to find pairings. Checkout the branch
Try these examples:
|
Beta Was this translation helpful? Give feedback.
-
I appreciate this, thank you. I would hold off further review until the draft PR. I am re-structuring the code to conform to existing style/idioms in riscv-perf-model. What I previously showed helped me explain the thought process without regard to conventions, I want to align the interfaces and style to what's in the codebase. I appreciate the guidance on mavis. The constraints example above uses mavis to extract fields and convert between representations. At the moment I'm looking at radix-trie alternatives to the Mavis d-trie, since the search is primarily uint32_t/uint64_t. I will post here when I've submitted the draft. |
Beta Was this translation helpful? Give feedback.
-
Draft PR posted. fusion draft PR commit #135 |
Beta Was this translation helpful? Give feedback.
-
thanks for your review knute, I've pushed the requested changes. I have a set of changes that begins to create the AST for the language. I'll hold these for now. I'd like to have a discussion on the dsl (calling it FSL for now). There are a number of ways to go, It would be interesting to solicit ideas before the syntax/grammar solidifies. |
Beta Was this translation helpful? Give feedback.
-
Longer than I hoped because of the explanation. From the last call, speaking to why a new syntax makes sense. Here's one view of what the dsl (FSL) could look like in Python. This is pseudo code, I have not implemented any of the Python wrappers. Below is an FSL example, a 5-instruction group with constraints. For simplicity, it omits the 'req' and 'opt' placeholders and transform syntax. In use you could imagine the model starts up a python interpreter reads these files and populates the Fusion API structures at run time. This is similar to how the FSL interpreter works. FSL interpreter is smaller than a Python interpreter and will have a smaller run time footprint.
Below is what I think it looks like to reproduce the FSL features in python. Again this is pseudo code.
Once all the python files have been loaded the model proceeds normally. The above presumes support in Python/C++ via something like Boost::Python. This is overhead not present in a native FSL implementation. What is implemented assumes speed is important so rather than native Python there is an assumption of support in C++ instead, e.g. the FSL::sequence and FSL::constraints. Wrapping these classes is overhead in the API. FSL uses the concept of an instruction with abstract operands. This is native to the FSL parser to support it through python requires writing the support methods in C++ and the wrappers. In python you would build your fusion group from a list of strings, a reference to the Mavis Facade, and the input container. Creating the constraints is problematic. Unless you reuse the expressions found in the FSL parser, which diminishes the value of python somewhat, you would need to create a way to describe constraint expressions or bring more of the C++ API into a native python implementation so you could use normal python statements for the constraint expression. This would be more Python overhead than what I show above and below. These are the constraints in FSL:
A possible python version:
These constraints are simple, it is easy to see how cumbersome this could become with more complicated constraints. There is overhead in the API to support this Python based expression of constraints. There are more topics on the syntax justification, such as re-use of fusion groups and constraints that would need additional support through a combination of Python syntax and Fusion API wrappers. |
Beta Was this translation helpful? Give feedback.
-
@klingaard I have pushed all the review changes I have on the fusion PR. thank you everyone. |
Beta Was this translation helpful? Give feedback.
-
From the last call I captured these ARs for Fusion.
|
Beta Was this translation helpful? Give feedback.
-
This is the oly decoder with fusion support using UIDs. |
Beta Was this translation helpful? Give feedback.
-
I plan to add a report def file for reporting IPC with and without fusion. I am looking for some help on the report mechanism. The dhry_report.yaml is similar to what I'd like. I am unable to get it to find the top node. I modified a copy of reports/core_report.def (as ./dhry_report.def) to contain this:
From release I executed
Can anyone tell me what I am doing wrong? |
Beta Was this translation helpful? Give feedback.
-
Repost the question as @arupc suggested. riscv-perf-model/fusion/fusion/HCache.hpp Line 78 in a0f965f riscv-perf-model/fusion/fusion/HCache.hpp Line 161 in a0f965f Won't different fgPairs have the same grpSize? If they are, will this line of code be wrong? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your question. I believe the code is correct. buildHashCacheEntry() inserts the length permutations of inputUids and their jenkins hash into hcache_. The conditions where fgPairs have the same size is what gives the hcache it's benefit. I have a local branch with an enhancement which adds an early exit buildHashCacheEntry() plus some other clean up. |
Beta Was this translation helpful? Give feedback.
-
Our crunch is still on going, but I found some time to finish a draft of the parser syntax. If you like these sorts of things this is the BNF, there is broader, ideally more symmetrical, support for expressions. Next steps: I am now creating a more extensive document, including use cases from public benchmarks.
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions