Reducing redundant type information by (mostly) storing type only at root of Value IR #213

edwardpeters · 2023-11-17T16:07:43Z

edwardpeters
Nov 17, 2023

tl;dr

The 11-16 project meeting ended with a discussion of whether it would be worth it to drop some of the redundant type information from the IR for various reasons, and how that might be done.

The core of this idea is that, in serialization and memory, typed values would not necessarily store typed values as their AST children, but instead untyped values. For example:
(0 : Int, True : Bool) : (Int, Bool)
you would just have:
(0, True) : (Int, Bool)
Access to child values would then do (minimal, easy) processing to recover the type information when necessary. So in the above case, if you wanted to get the tuple's child at the 2nd position, you could take the raw value stored at that position (True), the type stored at the same position (Bool), and thus reconstruct the typed value True : Bool.

To be clear, the goal is not to rerun any difficult inferencing, and in some cases (described below) it would still be necessary to keep type information on certain AST children. It may help to think of it as "Do the difficult work of type inference once, then store enough hints to make reconstructing everything easy."

Motivation

One reason to do this is simply to reduce the size of the IR, in memory and on disk. I believe this reduction will be several-fold. If you consider stacking tuples as a stand in for many AST trees, suppose you have:

0 : Int
(0, 1) : (Int, Int)
((0, 1), (2, 3)) : ((Int, Int), (Int, Int))

The size grows as 2^d where d is the depth of the tree. However, if you include type information at every level, this becomes:

0 : Int
(0 : Int, 1 : Int) : (Int, Int)
((0 : Int, 1 : Int) : (Int, Int), (2 : Int, 3 : Int) : (Int, Int)) : ((Int, Int), (Int, Int))

Which grows as d * 2^d. Inspection of any Morphir IR shows this happening - enormous amounts of redundant type information take up the vast majority of the IR files.

A further consideration is that any time you have redundant information, you introduce the possibility of conflicts. Suppose you need to reverify untrusted IR for any reason. Take the case of a tuple. In the reduced case you have (left, right) : (T0, T1). You need to check that left is of type T0 or right is of type T1. In the unreduced case you have (left : T0, right : T1) : (T2, T3). To be sure this is correct, you additionally need to also check that T0 == T2 and T1 == T3. If these have any sort of deep IR trees beneath them, this is itself a non-trivial tree comparison - and parts of that comparison will need be repeated at lower levels.

Proposed Implementation

Below is a list of IR values, how the type information of their children (if any) could be trivially recovered, and when child type information might need to be retained.

This assumes that the type of the parent has been fully dealiased. Note that minimal work is done in any case - you never have to look more than one layer down.

Literal - Trivial (leaf node).
Constructor - Trivial (leaf node).
Tuple - The type of any child is stored on the tuple's type at the same index.
List - The type of a list is a Reference with an argument for the element type; that element type is then the type of any list element.
Record - A Record Value has a map from names to Values, while the Record Type has a list of fields. That list could be converted to a map, and then the type of a given element retrieved from it. (I'm unclear why the Value and the Type have different models - Map vs List - btw.) I'm not sure if Records are ever statically stored as ExtensibleRecord Types, but if so they can retrieve child information in the same way.
Variable - Trivial (leaf node).
Reference - Trivial (leaf node).
Field - Field nodes need to store some additional information, as the Record holds type information beyond what's in teh field. someRecord.name : String does not tell you the type of someRecord. So, for this case, the Record would need to be stored with its type.
FieldFunction- Trivial (leaf node).
Apply - Apply nodes would also need to store additional information, as only the return type is present on the Apply node. However, if the argument type is stored, then the function's type will be ArgType -> ReturnType. To make this clearer, if you have Apply(f, arg : T1) : T2, then f : T1 -> T2
Lambda - A Lambda's type will be a Function; the lambda's pattern will be the Function's argument type, and the lambda's body will be the function's return type.
LetDefinition - Again not all child type information can be easily elided, as the Definition's type needs to be stored to accurately type that definition's body. You may also need to do some currying/uncurrying transformations, to convert the flat list of input types to a nested Function type. However, the "in" bodyof the LetDefinition will always have the same type as the LetDefinition.
LetRecursion - same as LetDefinition, except with multiple Definitions, each of which needs to keep its type information.
Destructure - Destructure nodes need to store the type of either their pattern or their toDestructure child (either will do, they should have the same type.) The "in" body will have the same type as the destructure node.
IfThenElse - The "condition" Value must always be boolean-typed; both the if and else branch will be the same as the parent.
PatternMatch - Type information must be retained on the Value to be matched against; this type should be the same as every pattern branch. Every branch's body's type will be the same as the pattern's overall.
UpdateRecord - This will be of a Record (or extensible record) type. The type of the Record to be updated will be the same, and the type of every field to be updated can be recovered from that Record type.
Unit- Trivial (leaf node).

The Pattern AST is mostly trivial, with the exception of the Constructor pattern. I believe this type information could be recovered by looking up that particular constructor's arguments, but it is probably easier to just retain the type information explicitly in this case.

Caveats

Reviewing the above, I admit this wouldn't be perfect - there are more cases than I'd thought where some type information needs to get stored. Field, Apply, LetDefinition, LetRecursion, Destructure, PatternMatch and the Constructor Pattern would all need to keep some type information of their children. However, this would still greatly reduce the quantity of redundant information in the IR, even for some of those nodes - a twenty-branch PatternMatch node would only need to store two types, rather than forty-two.

I'll also admit this wouldn't be trivial to implement. While nothing described above is on the scale of true inference, you'd need some code to support accessing and modifying children. Additionally, you couldn't just have a straight TypedValue/RawValue dichotomy - for instance, in:
IfThenElse(Apply(<function>, <argument>), <thenBranch>, <elseBranch>)
You would have a somewhat weird state; following the above pattern and storing minimal information, you would have:
IfThenElse(Apply(<function>, <argument> : T0), <thenBranch>, <elseBranch>) : T1
Which results in the slightly weird construction of an Apply node that stores its argument type, but not its own. I believe it would still work - if you accessed the condition, you would know it was of type Bool so you would retrieve:
Apply(<function>, <argument> : T0) : Bool
and thus be able to reconstruct <function> as T0 -> Bool.

Type Information on Leaves

One other proposal that came up was to go the other direction, and store type information on every leaf rather than the parent. A quick review of the non-trivial cases there for completeness.

Tuple - The type may be easily reconstructed from the children.
List - Again the type may be easily reconstructed, tho note that in this case, you would need to store the type information many more times. Special handling would also be required for empty lists, which would need their type explicitly known as they have no children to ask.
Record - The type of a record can be reconstructed from its fields. However, if I'm wrong and a Record value CAN be typed as ExtensibleRecord, the children would not tell you this information.
Field - The type will be the type of the Record child at the Name'd field.
Apply - The type would be the return type of their function.
Lambda - The type would be a function from the pattern's type to the body's type.
LetDefinition - The type would be the type of the inBody
LetRecursion - Same as LetDefinition
IfThenElse - The type would be the same as either the then or else branch.
PatternMatch - The type would be the same as any case's body.
UpdateRecord - The type would be the same as that of the record to be updated.

A couple notes on this approach:

If the goal is to reduce the amount of redundant information, several of the above nodes would require storing more total types; if type information is held at every leaf, then we cannot take advantages of cases where several of those leafs must be the same. This comes up in the List, Apply, IfThenElse, PatternMatch and UpdateRecord cases, all of which would still have redundant type information stored beneath them.
In most access patterns you'll have a top-level thing available; if you need the type, in this model you have to do significant work to reconstruct it (walking the entire tree). In the other model, you only do one step of work each time you access one level deeper within the tree. This is a bit fuzzy, but I think that the top-down model would be easier to encapsulate and more performant in most situations.

AttilaMihaly · 2023-11-23T13:09:50Z

AttilaMihaly
Nov 23, 2023
Maintainer

If the main motivation is to decrease the size of the IR then I think there are easier ways to achieve that without putting too much burden on downstream tooling to reconstruct the missing types.

One very easy solution would be to zip the JSON which works very efficiently with repetitive data. Or we could use a dictionary to store the actual type information and only include small keys on each node. Those are all approaches that make it very easy for the consumer to get the types for any node.

You mentioned that the goal is not to rerun any difficult inferencing but the approach you are suggesting is actually moving a significant part of the type inference algorithm on the consumer side. And from what I can tell it's not even clear if we can come up with a set of rules that will make it work in every situation. Let me go through what the type inference algorithm does on a high-level:

It annotates every node in the expression tree with a unique type name (such as T42).
It traverses the tree again and generates constraints such as "T23 is Bool", "T34 is the same as T57".
It solves the constraints and ends up with concrete types for each type variable (and some generic types).
Finally it traverses the tree to attach the concrete types based on a lookup using the type variables.

What you are suggesting is to remove some of those concrete variables, then as needed reapply some of the constraints but in a reverse order, sometimes injecting some of the missing information.

A much easier thing to do would be to take the result of the solve step (3) which only contains the unique types and store that on a value definition level and only keep type variables on the expression tree nodes. But even that is a lot of complexity when the main goal is simply to decrease size of storage/memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing redundant type information by (mostly) storing type only at root of Value IR #213

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Reducing redundant type information by (mostly) storing type only at root of Value IR #213

edwardpeters Nov 17, 2023

tl;dr

Motivation

Proposed Implementation

Caveats

Type Information on Leaves

Replies: 1 comment

AttilaMihaly Nov 23, 2023 Maintainer

edwardpeters
Nov 17, 2023

AttilaMihaly
Nov 23, 2023
Maintainer