Reducing redundant type information by (mostly) storing type only at root of Value IR #213
Replies: 1 comment
-
If the main motivation is to decrease the size of the IR then I think there are easier ways to achieve that without putting too much burden on downstream tooling to reconstruct the missing types. One very easy solution would be to zip the JSON which works very efficiently with repetitive data. Or we could use a dictionary to store the actual type information and only include small keys on each node. Those are all approaches that make it very easy for the consumer to get the types for any node. You mentioned that the goal is not to rerun any difficult inferencing but the approach you are suggesting is actually moving a significant part of the type inference algorithm on the consumer side. And from what I can tell it's not even clear if we can come up with a set of rules that will make it work in every situation. Let me go through what the type inference algorithm does on a high-level:
What you are suggesting is to remove some of those concrete variables, then as needed reapply some of the constraints but in a reverse order, sometimes injecting some of the missing information. A much easier thing to do would be to take the result of the solve step (3) which only contains the unique types and store that on a value definition level and only keep type variables on the expression tree nodes. But even that is a lot of complexity when the main goal is simply to decrease size of storage/memory. |
Beta Was this translation helpful? Give feedback.
-
tl;dr
The 11-16 project meeting ended with a discussion of whether it would be worth it to drop some of the redundant type information from the IR for various reasons, and how that might be done.
The core of this idea is that, in serialization and memory, typed values would not necessarily store typed values as their AST children, but instead untyped values. For example:
(0 : Int, True : Bool) : (Int, Bool)
you would just have:
(0, True) : (Int, Bool)
Access to child values would then do (minimal, easy) processing to recover the type information when necessary. So in the above case, if you wanted to get the tuple's child at the 2nd position, you could take the raw value stored at that position (
True
), the type stored at the same position (Bool
), and thus reconstruct the typed valueTrue : Bool
.To be clear, the goal is not to rerun any difficult inferencing, and in some cases (described below) it would still be necessary to keep type information on certain AST children. It may help to think of it as "Do the difficult work of type inference once, then store enough hints to make reconstructing everything easy."
Motivation
One reason to do this is simply to reduce the size of the IR, in memory and on disk. I believe this reduction will be several-fold. If you consider stacking tuples as a stand in for many AST trees, suppose you have:
The size grows as
2^d
whered
is the depth of the tree. However, if you include type information at every level, this becomes:Which grows as
d * 2^d
. Inspection of any Morphir IR shows this happening - enormous amounts of redundant type information take up the vast majority of the IR files.A further consideration is that any time you have redundant information, you introduce the possibility of conflicts. Suppose you need to reverify untrusted IR for any reason. Take the case of a tuple. In the reduced case you have
(left, right) : (T0, T1)
. You need to check thatleft
is of typeT0
orright
is of typeT1
. In the unreduced case you have(left : T0, right : T1) : (T2, T3)
. To be sure this is correct, you additionally need to also check thatT0 == T2
andT1 == T3
. If these have any sort of deep IR trees beneath them, this is itself a non-trivial tree comparison - and parts of that comparison will need be repeated at lower levels.Proposed Implementation
Below is a list of IR values, how the type information of their children (if any) could be trivially recovered, and when child type information might need to be retained.
This assumes that the type of the parent has been fully dealiased. Note that minimal work is done in any case - you never have to look more than one layer down.
Literal
- Trivial (leaf node).Constructor
- Trivial (leaf node).Tuple
- The type of any child is stored on the tuple's type at the same index.List
- The type of a list is a Reference with an argument for the element type; that element type is then the type of any list element.Record
- A Record Value has a map from names to Values, while the Record Type has a list of fields. That list could be converted to a map, and then the type of a given element retrieved from it. (I'm unclear why the Value and the Type have different models - Map vs List - btw.) I'm not sure if Records are ever statically stored as ExtensibleRecord Types, but if so they can retrieve child information in the same way.Variable
- Trivial (leaf node).Reference
- Trivial (leaf node).Field
- Field nodes need to store some additional information, as theRecord
holds type information beyond what's in teh field.someRecord.name : String
does not tell you the type ofsomeRecord
. So, for this case, theRecord
would need to be stored with its type.FieldFunction
- Trivial (leaf node).Apply
- Apply nodes would also need to store additional information, as only the return type is present on the Apply node. However, if the argument type is stored, then the function's type will beArgType -> ReturnType
. To make this clearer, if you haveApply(f, arg : T1) : T2
, thenf : T1 -> T2
Lambda
- A Lambda's type will be a Function; the lambda's pattern will be the Function's argument type, and the lambda's body will be the function's return type.LetDefinition
- Again not all child type information can be easily elided, as the Definition's type needs to be stored to accurately type that definition's body. You may also need to do some currying/uncurrying transformations, to convert the flat list of input types to a nested Function type. However, the "in" bodyof the LetDefinition will always have the same type as theLetDefinition
.LetRecursion
- same asLetDefinition
, except with multiple Definitions, each of which needs to keep its type information.Destructure
- Destructure nodes need to store the type of either their pattern or theirtoDestructure
child (either will do, they should have the same type.) The "in" body will have the same type as the destructure node.IfThenElse
- The "condition" Value must always be boolean-typed; both the if and else branch will be the same as the parent.PatternMatch
- Type information must be retained on the Value to be matched against; this type should be the same as every pattern branch. Every branch's body's type will be the same as the pattern's overall.UpdateRecord
- This will be of a Record (or extensible record) type. The type of the Record to be updated will be the same, and the type of every field to be updated can be recovered from that Record type.Unit
- Trivial (leaf node).The
Pattern
AST is mostly trivial, with the exception of theConstructor
pattern. I believe this type information could be recovered by looking up that particular constructor's arguments, but it is probably easier to just retain the type information explicitly in this case.Caveats
Reviewing the above, I admit this wouldn't be perfect - there are more cases than I'd thought where some type information needs to get stored.
Field
,Apply
,LetDefinition
,LetRecursion
,Destructure
,PatternMatch
and theConstructor
Pattern would all need to keep some type information of their children. However, this would still greatly reduce the quantity of redundant information in the IR, even for some of those nodes - a twenty-branchPatternMatch
node would only need to store two types, rather than forty-two.I'll also admit this wouldn't be trivial to implement. While nothing described above is on the scale of true inference, you'd need some code to support accessing and modifying children. Additionally, you couldn't just have a straight TypedValue/RawValue dichotomy - for instance, in:
IfThenElse(Apply(<function>, <argument>), <thenBranch>, <elseBranch>)
You would have a somewhat weird state; following the above pattern and storing minimal information, you would have:
IfThenElse(Apply(<function>, <argument> : T0), <thenBranch>, <elseBranch>) : T1
Which results in the slightly weird construction of an
Apply
node that stores its argument type, but not its own. I believe it would still work - if you accessed the condition, you would know it was of typeBool
so you would retrieve:Apply(<function>, <argument> : T0) : Bool
and thus be able to reconstruct
<function>
asT0 -> Bool
.Type Information on Leaves
One other proposal that came up was to go the other direction, and store type information on every leaf rather than the parent. A quick review of the non-trivial cases there for completeness.
Tuple
- The type may be easily reconstructed from the children.List
- Again the type may be easily reconstructed, tho note that in this case, you would need to store the type information many more times. Special handling would also be required for empty lists, which would need their type explicitly known as they have no children to ask.Record
- The type of a record can be reconstructed from its fields. However, if I'm wrong and aRecord
value CAN be typed asExtensibleRecord
, the children would not tell you this information.Field
- The type will be the type of the Record child at theName
'd field.Apply
- The type would be the return type of their function.Lambda
- The type would be a function from the pattern's type to the body's type.LetDefinition
- The type would be the type of the inBodyLetRecursion
- Same as LetDefinitionIfThenElse
- The type would be the same as either thethen
orelse
branch.PatternMatch
- The type would be the same as any case's body.UpdateRecord
- The type would be the same as that of the record to be updated.A couple notes on this approach:
List
,Apply
,IfThenElse
,PatternMatch
andUpdateRecord
cases, all of which would still have redundant type information stored beneath them.Beta Was this translation helpful? Give feedback.
All reactions