Model for miner actor cron cost #761
ZenGround0
started this conversation in
General
Replies: 1 comment 2 replies
-
This is awesome, thank you. Do you have any initial recommendations for where concrete high bang-for-buck changes might be here? Is it worth optimising parts of this before attempting to move this work out of cron? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In order to understand how filecoin scales I've been investigating the cost of miner cron. After some basic data gathering and analysis I have a simple linear model for the gas cost of a cron job.
Model
This model predicts the gas cost of a cron job based on a few key features of a miner cron proving deadline job:
$x_1$ Live partitions, the number of partitions with at least one live sector
$x_2$ Continued fault partitions, the number partitions that have a fault before the job runs
$x_3$ Newly faulted partitions, the number of partitions that detect fault during the cron job
$x_4$ Precommit expiry queue count
$x_5$ Number of entries in the locked funds table
$x_6$ Killed sectors, sectors either expired or terminated during the cron job
$x_7$ Live deadline indicator, 0 if no live partitions else 1
$x_8$ Faulty deadline indicator, 0 if no continued fault partitions else 1
The model is
$24,813,522 + 746,632x_1 + 12,769,347x_2 + 39,824,541x_3 + 2,254,195x_4 + 79,361x_5 + 24,809x_6 + 12,192,578x_7 + 18,868,998x_8 $
or succinctly in rounded millions of gas
$24 + 0.7x_1 + 12x_2 + 40x_3 + 2.3x_4 + .08x_5 + .025x_6 + 12x_7 + 19x_8 $
Methods, evaluation and caveats
I gathered 1000 epochs of state compute gas traces separated out cron jobs and processed them into a vector of components, epoch and address identifiers and total gas used. This led to about 70k vectors.
I then ran a multilinear regression over these variables. I discovered several of the model variables after extensive investigation and inspection of outliers. I chose a multilinear model because gas is additive and the variables all reference independent pieces of cron job work.
The model above is actually the result of two linear regressions. The precommit expiry term comes from a regression with all variables together and the rest come from a regression filtering out the ~3000 non zero precommit expiry jobs.
The precommit expiry regression coefficients are similar to those of the filtered regression. In millions of gas: (25, 0.6, 7, 41, 2.3, .08, .025, 12, 27). However the variance of this model is much higher with R^2 = 0.75 and mean absolute error of 2.7M gas
The regression on data filtering non zero precommit expiries has R^2 = 0.93 and mean absolute error of 1.6M gas.
To evaluate statistics I used a hold out set of 10% of the data.
Observations, intuition, explanations
The following are observations from looking through the data of many cron jobs, understanding proving deadline code and thinking through the results provided by the model.
High overhead
An immediate observation is that overhead is relatively high putting cron jobs for deadlines with 0 partitions at a similar gas cost to jobs covering many partitions.
There is a ~24M gas overhead for running a noop job and rescheduling the next one. Assuming a full vesting table with 360 entries (steady state and most common condition) that's another 360 * 0.08M = 29M for a total of 53M gas. If we added in 3 full live partitions (~70 TiB of sector storage) that adds another 12 + 2 = 14M gas only about 1/5 of the total cost.
This indicates there is opportunity to significantly reduce cron costs by skipping out on overhead operations within jobs or skipping out on unneeded cron jobs entirely.
Dense non-empty deadlines
Consistently about 1/2 of cron jobs in epochs have live or dead partitions and ~38% of cron jobs have live partitions. This is denser than I had expected and indicates a healthy load balance.
Cheap Happy Path
Related to the high overhead observation is the observation that it is relatively inexpensive for cron to maintain a miner actor with many sectors proven in its deadline. Imagining for a moment that our model doesn't break down with unreasonably high partition count we can calculate the gas cost of proving all the system's storage in a single deadline. Assuming 32GiB sectors there are about 14k partitions per EiB. Per EiB that's 65M + 14k* 0.7M = 9.8B ~10B. So at max efficiency we roughly hit the gas limit with cron jobs if one EiB is proven in an epoch. 2880 epochs in a day and all sectors proven daily that puts a hard happy path bound on filecoin scaling at a massive 2.8 ZiB. Of course with realistic assumptions like uneven deadline assignment and multiple miner actors for higher overhead)this limit will be hit much earlier. But it is so high that it doesn't look likely for this to be the first thing that breaks.
Expensive PreCommits
The biggest surprise in conducting this investigation was the dominance of precommit expiry in cron cost. Outliers with high costs consistently had expiring precommits. I found at least one job with 1B gas (it expired 184 precommits). In retrospect this is not that surprising. Almost all precommits are satisfied before expiring. The expiry operation involves doing lookups on the miner's PreCommitedSectors HAMT. Since these precommits are almost never found we end up traversing the HAMT all the way to its leaves each time. For poor code reasons we also don't get any caching benefits because we are reloading the map every time.
Precommit expiry is almost always is a noop apart from clearing the expiry queue. These noop cases should probably be moved to prove commit methods once precommits are consumed and they will be much cheaper: one AMT lookup for the expiry queue epoch, one bitfield operation and one rewrite of the AMT queue instead of a all of this plus a failing precommitted sectors HAMT lookup.
The observed large variance in gas per precommit being expired is probably explained by the dependence of this operation on precommitted sectors HAMT size which is expected to vary.
Faults
Fault analysis is similar to the happy path analysis but over an order of magnitude more expensive. It's ~17x more expensive for continued faults and ~57x more expensive for new faults per partition.
Only looking at per partition costs and taking the happy path calculation of 10B gas to processing 1 EiB => 2.8 ZiB at saturation, we see 2.8ZiB / 17 ~165 EiB before processing continued faults for a total network fault exceeds gas limits at uniform distribution of cron jobs. The threshold for the upper bound on a new total fault of the whole network is 49 EiB. Once these network sizes are crossed there exist faulting events that can not be processed within the block gas limit. Of course as in the happy path case these are very high upper bounds, in reality with multiple miner actors, uneven distribution of cron jobs and variance not accounted for by the model the network size at which a total fault exceeds gas limit in cron is much smaller.
Stated another way, neglecting miner actor overhead and focusing on per partition costs the fault gas paid in one epoch during a faulting event is:
(EiBs faulted) x (14k) x [12M for continued | 40M for new] / (epochs of fault)
Today a 10% continued fault over the period of a deadline (60 epochs) is (1) ( 14k * [12M | 40M]) / 60 = 2.8B gas per epoch for continued faults, 9.3B gas per epoch for new faults. With 10 gas per ns that's .3 and .9 seconds per epoch for faulting cron.
If we do the same analysis at a 130 EiB network thats 3 and 9 seconds per epoch added for faulting cron.
It is interesting to note that as the network grows each fault becomes less expensive for the SP but a constant expense for network since at higher network size same amount of power has smaller reward. And faults do not take a fee because of window post grace period, only a power freeze. For this reason it might be worth revisiting the window post grace period at higher network values.
Beta Was this translation helpful? Give feedback.
All reactions