Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D-Day Governance #5588

Open
kianenigma opened this issue Sep 4, 2024 · 20 comments
Open

D-Day Governance #5588

kianenigma opened this issue Sep 4, 2024 · 20 comments
Labels
T1-FRAME This PR/Issue is related to core FRAME, the framework.

Comments

@kianenigma
Copy link
Contributor

kianenigma commented Sep 4, 2024

Write a new governance pallet that should reside in the relay chain, while the main governance apparatus resides on Asset Hub.

The main usage of this pallet is when AH and/or Collectives are not producing blocks, and therefore can no longer access Root on the relay chain.

The assumption of this pallet is that it can have access to the latests state root of both Collectives and AH, and also has some notion of "soft metadata" of Collectives and AH. As in, it knows that a state proof corresponding to a specific hard-coded key prefix is associated with e.g. the balance of a user in AH.

A few key properties of this pallet:

Proposals creation:

  • Either of:
    • the Collectives parachain (fellowship) can always create one
    • membership proof in the fellowship (should collectives be offline)
      • signed origin sends a proof of fellowshipCollective::members(who) -> rank, then ensure that the origin was who, and rank is high enough.
      • Fellowship members should retain some DOT in RC for this.
  • AND, the pallet knows that AH has not produced blocks for some period of time.
  • Should AH/collective resume producing blocks, the pallet should ignore any ongoing proposals.
    • Any half-finished voting/proposal data should be removed lazily.
    • This entails the importance of not being able to "trick" the pallet to think that AH is stalled.
  • Should be implemented as an instance of pallet-referenda.

Voting

  • Should be implemented as a new pallet, implanting Tally and then linked to pallet-referenda.
  • Voters don't have balance on RC, but we can assume that some "power users", such as foundations, teams and fellowship do.
  • Voting can be done through submitting a proof of one's balance on AH, and a preference (aye/nay).
  • Voting info stored in a child-tree per referenda, to enable a better lazy cleanup.
  • Concerns for the RC:
    • Not get flooded by (free) transactions.
    • first process high-value-bearing votes, and then the smaller ones.

Option 1: Simple

  • Per referenda, each user gets 1 (fully) free immutable vote.
  • One simple de-sybil mitigation might be to introduce a type MinimumVotingPower.
  • Prioritizing votes in this case is hard; the only way to do it is to add logic to the
    transaction pool validation step, which will require at least reading one storage item.

Option 2: Meta Transaction Style

  • Allow any signed origin to hand over a signed statement from who regarding the vote (signed(aye/nay)), allowing origin to vote on behalf of who.
    • transaction should still be free for origin if it is the first valid vote of who
    • subsequent votes (who changed their mind) must pay
    • providing invalid proof will lead to slash of a deposit from origin proportional to the claimed voting power.
  • This allows voting to happen through funded "power users" as a proxy, which we can assume to have balance.
  • Consequently, we can implement prioritization with less risk; transactions will be sorted based on "claimed voting power", and origin is slashed if invalid.
    • slash amount proportional to voting power.
      • to submit a vote on behalf of a whale, you need more free balance on RC to pay for the slash deposit.
  • Might lead to censorship, but not a feasible issue as long as one honest actor is willing to vote on your behalf.

Relies on #5400. @shawntabrizi would you like to work on this after your current work? It seems to fit your aptitude very well.


Demo branch: https://github.com/paritytech/polkadot-sdk/compare/kiz-dday-demo?expand=1

@kianenigma kianenigma added the T1-FRAME This PR/Issue is related to core FRAME, the framework. label Sep 4, 2024
@shawntabrizi
Copy link
Member

Acknowledging the issue.

Not sure how much availability I have, but I can def mentor someone. Depends on urgency. If not urgent, I could probably get small pieces of this story done over the weeks.

@kianenigma
Copy link
Contributor Author

This falls into the more long term requirements of Asset Hub, not being needed until the very final days. In that sense, I was going to suggest you start working on it after your current project is done and roughly by end of DevCon?

@bkchr
Copy link
Member

bkchr commented Sep 5, 2024

Relies on #5400. @shawntabrizi would you like to work on this after your current work? It seems to fit your aptitude very well.

This is using a binary merkle tree and the chain is using a 16 patricia merkle tree. They are not compatible. We already have other code in historical session that does the checking of proofs already.

Generally, with the development of JAM, we will not have this luxury of having an extra governance sitting on the relay chain. So, when in JAM all chains stop, we don't have governance as well. So, a little bit questionable if we need this pallet at all. Or do you just want it for the period where governance switches over to AH and we are afraid of it not working properly on AH?

@burdges
Copy link

burdges commented Sep 5, 2024

Afaik XCMP need real state proofs into other parachain's state, so one could abstract that somewhat.

We'll avoid starving "true system parachains" ala #4632 (comment). We've not concretely defined that term yet, maybe audited like polkadot itself and no flexible execution aka no smart contracts. Also maybe no advanced collator communication, which maybe forbids elastic scaling. We've discussed reverting code upgrades automagically too, but afaik nothing currently in progress, and maybe imposes design restrictions.

We've more ways individual parachains can brick of course. Also JAM should bring much new brickage, but "true system parachain" could forbid non-trivial accumulation, which again maybe forbids elastic scaling.

Anyways, if collectives were kept relatively simple, than maybe collectives alone could provide this? Or maybe some simpler multi-sig derived from collectives? AH doing governance directly maybe a design mistake too, because doing so add tension between different concerns.

@shawntabrizi
Copy link
Member

@bkchr I actually switched to a compact base 16 trie because the binary tree libraries were unusable in the runtime currently

@kianenigma
Copy link
Contributor Author

Or do you just want it for the period where governance switches over to AH and we are afraid of it not working properly on AH?

Exactly for this period.

@kianenigma
Copy link
Contributor Author

kianenigma commented Sep 13, 2024

@bkchr I actually switched to a compact base 16 trie because the binary tree libraries were unusable in the runtime currently

I assume with this comment, there is no blocker to implement this, right?

It would be great to get a prototype of a pallet that tightly couples with the parachain pallets (e.g. can only work in RC), and can request to read the state of a parachain based on its latest state root: as in, have an extrinsic where anyone can provide a state proof of a parachain, and it would verify it based on the last known state root of the given para.

/// Provide the state `proof` for `id` at `block`, or the latest block if not provided
fn poc_read_para_state(id: ParaId, proof: Vec<Vec<u8>>, block: Option<BlockNumber>)

@bkchr do you know if this exists anywhere?

If this can be built, I will have no doubts that the rest of this issue can also be done.

@bkchr
Copy link
Member

bkchr commented Sep 23, 2024

It doesn't exist yet. However, building it should be straightforward, but it would also not support every parachain. Parachains are not required to use any specific state layout. But for the system chains we can make it work.

@kianenigma
Copy link
Contributor Author

Some code to demonstrate:

  • How to detect if AH is stalled
  • How to receive proofs for a given key in it

https://github.com/paritytech/polkadot-sdk/compare/kiz-dday-demo?expand=1

@burdges
Copy link

burdges commented Oct 25, 2024

We'll want parachains that never stall for PJR tests and DKGs, but they'd avoid censorship vectors like smart contracts, and never make too many blocks either, aka no elastic scaling.

In principle, relay chain governance could always take place on some non-stallable parachain, so not AssetHub, but using proofs into AssetHub state.

@shawntabrizi
Copy link
Member

Some code to demonstrate:

* How to detect if AH is stalled

* How to receive proofs for a given key in it

https://github.com/paritytech/polkadot-sdk/compare/kiz-dday-demo?expand=1

I guess what is missing there maybe is a double map:

balanceAtHead = head hash -> account -> balance info

then people call the frozen_balance_of call once, and we store that balance for the frozen head.

Then we should be able to do all local operations on that Head.
We will need pallets designed to do votes, but checking the correct frozen head data is used and not changed and all that.

We also probably want a way to migrate the total issuance number over for things like the voting curves, so we know when we reach certain levels of voter thresholds.

@kianenigma
Copy link
Contributor Author

We'll want parachains that never stall for PJR tests and DKGs, but they'd avoid censorship vectors like smart contracts, and never make too many blocks either, aka no elastic scaling.

What is non-stall-able? It has no bugs + gets infinite POV limit? I am not sure if we have such a thing or can build it fast enough.

Although, if this is easier to build, I agree that we should still build the pallet I said above, but instead of RC, put it in this special parachain, and let it work on-demand: it will only start working when it detects AH is in trouble. This is more JAM-compatible. @eskimor any comments from you?

@kianenigma
Copy link
Contributor Author

then people call the frozen_balance_of call once, and we store that balance for the frozen head.

This is only relevant if we want to do multiple voting son the same frozen AH, right? I hadn't thought of this, as I assumed the only voting will be for something that will un-block AH. It is good optimization.

We also probably want a way to migrate the total issuance number over for things like the voting curves, so we know when we reach certain levels of voter thresholds.

Indeed, it can be provided with the same mechanism quite trivially.

@bkchr
Copy link
Member

bkchr commented Nov 5, 2024

What is non-stall-able? It has no bugs + gets infinite POV limit? I am not sure if we have such a thing or can build it fast enough.

We don't have anything like that. However, if a separate parachain that has only the rescue pallet, the failure surface is quite small. The chain also would not really need any kind of state only for the one proposal that would need to be executed there.

@burdges
Copy link

burdges commented Nov 6, 2024

Yes, stall-able is a metric, not a yes or no. At a high level, fewer features means harder to stall.

You could make an almost-impossible-to-stall PJR check chain, by replacing the parachain state root by just the score, and allowing another block that improves the score. This means a staking miner could advance the state of the PJR check chain only by knowing the relay chain state, not the previous PJR check results. This is removing the feature of having state to make the PJR check chain harder to stall. It's harder to make DKG chains similarly hard to stall, but somewhat possible

Fully utilized chains would permit partial functionality stalls, because being fully utilized means not reserving anything. Smart contracts would typically open attack vectors that partially stall chains, becuase adversaries could find tricks that consume all the resources. Elastic scaling would often permit chain takeovers by not giving other collators enough sync time. We should expact AH can be stalled more easily because AH shall have all three.

All that is why you're proposing d-day governance, but..

Fallbacks suck. Why not always do RC governance on some parachain that's harder to stall than AH? We could leave treasury on AH, because treasury stalling doesn't break anything, but do system code upgrades and parameters somewhere safer.

@kianenigma
Copy link
Contributor Author

What is non-stall-able? It has no bugs + gets infinite POV limit? I am not sure if we have such a thing or can build it fast enough.

We don't have anything like that. However, if a separate parachain that has only the rescue pallet, the failure surface is quite small. The chain also would not really need any kind of state only for the one proposal that would need to be executed there.

I see. Let's first discuss the failure-surface. Note, the relay chain will have some code in its runtime that handles parachains (para-runtime). I assume there is in principle the possibility to also have a bug in this, in which case all parachains could stop working, no matter their code.

  1. AH itself is buggy, but the relay chain is fine.
  2. para-runtime is buggy.

Putting the rescue pallet in another parachain has the benefit that it is more JAM-compatible, but it does not help with the second failure.

Putting it in the RC is not JAM-compatible, but handles both failures.

I might be paranoid by thinking the second failure is actually a feasible one. @eskimor implied in conversation off-band that I might be wrong to worry about this. In this case, having a similar rescue system in a separate on-demand parachain makes more sense. Also cc @ordian

@eskimor
Copy link
Member

eskimor commented Nov 6, 2024

While we could break the relay chain runtime in a way that only parachain consensus is entirely broken, I would doubt that the risk is much higher than messing up the relay chain runtime in some other way (preventing relay chain governance from working). If this happened, we would need a hardfork to fix it, just as if we messed up a relay chain upgrade right now.

Asset hub no longer making progress is disastrous enough, that we should work hard to make this as unlikely as possible.

Also purely hypothetical: If all of parachain consensus broke, then we would want to have this fixed as quickly as possible and not do some governance dance, but instead indeed likely a hard fork will be demanded by pretty much everybody. Same is likely true if asset hub breaks.

@burdges
Copy link

burdges commented Nov 6, 2024

We've fixed bad upgrades before using on-chain governance, and not hardforks, although sometimes only barely, and maybe we no longer make those mistakes.

I'm assuming the RC continues running correctly, including elves/approvals and grandpa. I suppose AH might continue running correctly-ish too. Yet, we have problems backing honest AH parachain blocks, maybe because of malicious actors, or maybe unintentionally like from high or wierd usage.

In particular, we'll seemingly want AH to push a high tps for bragging rights, but this requires full AH blocks get used by transactions, meaning no reserved space for the ellection. That's problematic.

It's not bugs per se, but parachain choices that trade away resiliance for throughput and flexibility. In theory, a parachain project could always run "better" infrastructure, and that maybe how you land insane tps, but we're the L1 so their "better" might feel centralized to us.

Also..

There maybe similar robustness arguments going the opposite way, like the governance chain needing reliable infrastructure. If that's the case, then maybe a seperate d-day chain makes sense? It's unclear if AH failures could be detected though, so maybe activating the d-day chain should be the d-day chain's first act?

Anyways I worried mostly that we were going to have a fallback that barely worked, or required double the debugging time, when we should be doing it right in one place, but maybe that's not an easy choice to make right away.

@seadanda
Copy link
Contributor

After a conversation with @burdges and @eskimor at the retreat about this and specifically where the functionality could live, we wondered about the option of having a track on the collectives chain which can achieve root in specific scenarios, or some subset of operations with root perms, or at the very least restart AH if it stalls.

If we have the ability to restart AH from Collectives and vice versa, and then add a constraint that these chains need to be upgraded at different times then we solve part of the problem, missing only some problem which takes a long time to show itself on both chains. This is not along the lines of "minimal system upgrades chain" but it's a clean solution to part of the problem.

The back up option if we want a dedicated chain is fairly straightforward with a system parachain which is registered but dormant and can be spun up with on-demand coretime in the case that we need to recover one or both of AH/collectives if we bork them.

This means the hard fork sledgehammer is only needed for relay chain problems or where all parachains are not making progress (which is likely also a relay chain problem).

I think my takeaway from this conversation is that the functionality does not need to be on the relay chain

@burdges
Copy link

burdges commented Dec 17, 2024

Yeah, I'd worried about maintenance costs of having two difference governance systems, but actually maintenance need not be problematic if we debug and run the same code in both places. We could turn off conviction for code upgrades maybe, so then only dot ownership and fellowship status matter, which simplifies using remote dot ownership proofs.

As a first step, we could re-engeneer the storage interface, so that remote storage proofs can be first class citizens, alongside local storage proofs. That's a huge win for polkadot overall regardless. It doesn't matter if this reengeneering cannot work within the macro DSL, becuase we could port the governance code to the new storage interface.

I suppose the macro DSL could differ in different build units, but afaik the fellowship lives in collectives, so every governance vote needs proofs into both collectives and AH, with one remote and one local. That's amazingly cool, but that's also some scary complexity. I suppose foilk here envision this being fellowship or collectives, instead fo fellowship and collectives. That's fine, but that's a bigger governance change than merely dropping conviction, no?

Anyways, we should break down the single chain stall conditions:

  • Only AH stalls:
    • Broken code update,
    • all bad or DoSed collator set, or
    • semi-bad AH collator set + AH or collectives exploit/weakness
  • Only collectives stalls:
    • Broken code update,
    • all bad or DoSed collator set, or
    • semi-bad collective collator set + collectives exploit/weakness

In other words, AH has far more functionality than collectives, like contracts, so any exploit or weakness of collectives implies an exploit of AH, but not coversely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
Status: Backlog
Development

No branches or pull requests

6 participants