Implementing weights in Madara #968

apoorvsadana · 2023-08-05T10:37:45Z

apoorvsadana
Aug 5, 2023
Maintainer

Hey everyone, I am opening this thread as per my discussion with @EvolveArt and @tdelabro regarding the implementation of weights in Madara. I have tried to add all the necessary details and links needed to understand the complications involved in this process. The main purpose of this discussion is to evaluate different possible methods to implement weights (or any other solution) that solves the attack vectors mentioned and at the same time, keeps block production efficient and within limits.

About weights

The main purpose of weights in substrate is to ensure that the chain can create blocks within the targeted execution time. For a block time of 6 seconds, it's recommended that we have an execution time of around 2 seconds. By using the right weights, it becomes possible for the node to recognise which transactions will cause the execution time to cross 2 seconds and the node can choose to skip these transactions and include them in the next block. For example, if the node has already been executing transactions for 1.5 seconds and a new transaction will take 1 second to run (we know this because of the weight), we will skip it and add it to the next block.

Assigning weights to a transaction

The weight of a transaction is the time it takes to execute the transaction in picoseconds. Now this is tricky because the only way to know this number is by executing the transaction itself. But the purpose of weights is to tell if we should include a transaction within a block or not before executing it. So then we need to give substrate a way to guess the execution time.

On frontier (an EVM chain made on substrate), the gas of the transaction is used to calculate the weight. This makes sense as the gas on EVM chains is directly related to the computational resources needed to execute the transaction and hence the execution time. So once you know you want to use the gas to estimate the weight, the next question is how much weight do you assign per gas? Well, you assume the worst case possible i.e. the block will consume the max gas limit of the block. Now, since you know the total weight of the block (2 seconds in terms of picoseconds) and the total gas consumed, you can find the weight per unit gas. We assume the worst case to be sure we don't exceed the 2 seconds block time.

Now, on chains like Starknet, the math isn't very straightforward because gas on Starknet is a measure of how long it takes to prove the transaction and isn't directly related to the execution time. However, at the same time, the gas is the best variable we have in the transaction input which can be used to estimate the complexity of the transaction. Hence, we have to use this to estimate the weight (this does make benchmarking a little complicated as explained below).

Benchmarking

Now Starknet has its own limits on the total steps in a block/transaction and total gas of a block. However, these limits were created keeping in mind the architecture of Starkware's sequencer. We need to re-calculate these limits for the architecture, and specifically the block time we choose in substrate.

As of writing this, we have an execution time of 2 seconds. This means, our BLOCK_GAS_LIMIT is the maximum gas we can include in a block if we run transactions for 2 seconds. This, clearly, is very subjective and this number would vary largely on different machines. Hence, to arrive at the correct number, we define BLOCK_GAS_LIMIT as

the maximum gas we can include in a block if we run transactions for 2 seconds on a node running on the minimum hardware spec required by our chain.

While this still isn't a deterministic number and the exact BLOCK_GAS_LIMIT would differ slightly over executions, the difference would be small and negligible as long as it's run on the same specific machine.

Now, on how to practically calculate this number, we can use benchmarking tools provided by substrate. On EVM chains, you can simply benchmark against a few contracts and check the time it takes to execute 1 unit of gas. Since gas is directly related to execution time, we should get fairly consistent results and should be able to calculate our BLOCK_GAS_LIMIT for 2 seconds.

On Starknet, however, the time taken to execute 1 unit of gas (weight per gas), can vary drastically across contracts depending upon what the contract is doing. As mentioned above, this is mainly because of the fact that gas is not a direct measure of execution time. Consequently, we need to come up with techniques to calculate the approximate weight per unit gas. Some possible ways to do this are

Benchmark all different operations you can do on a Starknet contract (hashing, storage access etc.) and find out the weight per unit gas for each operation. Take the highest weight per unit gas achieved as the worst case and use it for estimating the weight of the transaction.
- Pro: The block execution time should ideally never exceed the targeted time.
- Con: The block production will be inefficient. For example, assume the node had already taken 1.8s in executing transactions. Now if a new transaction comes up which actually takes 0.1s to execute, we will assume the worst case and say that it will take 0.3s to execute the transaction and hence not include it in the current block.
Take the top contracts on Starknet and benchmark them to get an average estimate of weight per unit gas.
- Pro: Since we are taking the top and most used contracts, the con mentioned in the previous case is less likely to happen. Hence, block production will be more efficient
- Con: It's possible to exceed the block execution time of 2 seconds.

These are just 2 methods but there is scope for improvement here and we might be able to come up with better approaches that can be more efficient.

Current behaviour

Right now, we assign a 0 weight to all our transactions. Hence, there's no way for substrate to estimate how long a transaction will take before actually executing it. As a result, currently, our node keeps adding transactions to the block till we exceed the block execution time after which it stops. A few points to note here are

If we submit a very big transaction (40 million steps for example), the node fails to propose a block in its slot and gives an error - discarding proposal for slot 281845487; block production took too long. However, in the very next slot, the node includes the transaction (which was already executed) and successfully proposes a block. It's also important to note here that the node shouldn't be allowed to execute 40 million steps with a 2 second execution time, so the fact that we are able to do this is a bug itself. And even with the bug, the chain doesn't seem to halt as we are able to add the transaction in the next block.
In the case mentioned above, if we reduce the block time by 3x, we see the error 3 times after which the transaction is finally included in the block. So the logic basically is to skip all the slots when the execution is happening. Once the execution is complete, include the transaction in the next slot.
The above two cases were tested with a single node - so every slot was assigned to the node itself. It would be interesting to see how the node behaves when there are multiple nodes interacting with each other.
If we keep the transaction limit within reasonable bounds (less than a million steps for example, as it is on Starknet), the block production always happens correctly without any error. This is mostly because, with correct transactions, it's not possible to exceed the execution time by a large margin and hence substrate is able to keep the execution time around the targeted time.

Attack vector in the current setup

Our current setup doesn't have any limits on the total steps or gas allowed in a block. The only limit that all nodes agree on is a 6 seconds block time and 2 seconds time to execute transactions/import a block. However, time, as mentioned before, is a non-deterministic measure. What might take 2s to execute on one machine might take a minute on another. Hence, a possible attack vector in this case is

A node runs a supercomputer and manages to run some transactions within 2 seconds
All other nodes, running on recommended hardware, take a minute to import this block
As a result, all nodes miss their slot to propose a block
The supercomputer node once again gets the slot and the process repeats
Only the supercomputer node is able to propose blocks

Steps from here

The following are the possible actions we can take from here

Evaluate if we can use anything apart from gas to decide the weight of a transaction.
Decide a way to benchmark weight per unit gas
Implement the benchmarking using substrate's benchmarking tools - an example can be found in this PR
Decide on a BLOCK_GAS_LIMIT for the chain (and make it configurable as app chains would want to set their own limits).
Decide a way to write test cases for the BLOCK_GAS_LIMIT

rphmeier · 2023-08-05T21:41:59Z

rphmeier
Aug 5, 2023

On Starknet, however, the time taken to execute 1 unit of gas (weight per gas), can vary drastically across contracts depending upon what the contract is doing

This sounds like a gas mispricing issue, such as that which caused the Shanghai attacks in 2016

The main purpose of Weights in Substrate is to assign costs for coarse blocks of code and thereby gain efficiency over metering after every addition, multiplication, storage read, memory access, etc. Substrate provides infrastructure for benchmarking for those code blocks in order to meter them correctly, but I don't believe this is a hard requirement.

Since it seems based on the description and link you've provided like Cairo already limits gas or steps within a contract (and please correct me if I'm wrong), it might be reasonable to simply treat the Weight limit as a gas limit (plus some extra for non-Cairo logic, such as pallet initialization and finalization).

However, highly varying time-costs per gas are a major issue which likely requires setting the overall gas limit conservatively or repricing certain instructions, or else be vulnerable to gas mispricing attacks.

1 reply

rphmeier Aug 5, 2023

i.e. sidestepping weight-per-unit-gas, the thing we really care about is time-per-unit-gas, as that's what opens up the "supercomputer attack" described above. Any trapdoor (algorithmic or financial) where one node can compute an instruction significantly faster than others makes that attack viable by spamming contracts which execute only that instruction.

EvolveArt · 2023-12-04T17:03:36Z

EvolveArt
Dec 4, 2023
Maintainer

Update

Some update on this issue after internal discussion and extensive testing using gomu gomu.

Quoting @apoorvsadana

The safest thing we can do in the current setup is to calculate the worst case time for 1mn steps and use that as the weight. This would be inefficient but it would avoid any validator from skipping a slot.
This still doesn’t solve that users can spam the network with low fees and slow transactions.

And @tdelabro

If the node pick tx according to the max fee we are already better

In conclusion, madara ordering policy needs to be clearly defined and configurable (GPA, FCFS or more complex ones).

Related Discussion

https://substrate.stackexchange.com/questions/10517/how-are-the-node-threads-and-tasks-managed

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing weights in Madara #968

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Implementing weights in Madara #968

apoorvsadana Aug 5, 2023 Maintainer

About weights

Assigning weights to a transaction

Benchmarking

Current behaviour

Attack vector in the current setup

Steps from here

Replies: 2 comments · 1 reply

rphmeier Aug 5, 2023

rphmeier Aug 5, 2023

EvolveArt Dec 4, 2023 Maintainer

apoorvsadana
Aug 5, 2023
Maintainer

Replies: 2 comments 1 reply

rphmeier
Aug 5, 2023

EvolveArt
Dec 4, 2023
Maintainer