Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow to respawn node with different set of arguments #190

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn commented Mar 17, 2024

I made these changes while researching a quite specific use case. Not all of them may be useful for the general public, but I believe some of them may.

Everything is implemented for the native provider only at the moment. I currently lack the expertise to implement it for Kubernetes.

It also introduces changes in publicly exposed interfaces, so that should be considered a breaking change.

An approximate list of what's been done and why:

  • Got rid of nodes_by_name hashmap in the orchestrator::Network. It didn't make sense from the POV of performance (as a zombienet is unlikely to be of size exceeding a couple of hundreds of nodes), but cloning structures around was effectively sealing the network at the moment it was spawned, not allowing to change it dynamically;
  • Fixed a bug which resulted in double-adding nodes to network::network::Relaychain and network::network::Parachain when a new validator or collator is added;
  • Introduced a call allowing to kill a node without restarting it;
  • Introduced a call allowing to re-spawn a previously killed node, altering its spec between the kill and the re-spawn;
  • Implemented means of dynamically replacing the node inside a network.

Thoughts and ideas:

  • Narrowing down the use case to changing node command line arguments during the restart is a decision dictated by my specific research. Probably, it may be generalized to like "change everything you want while the node is dead", but the scope of the allowed changes has to be discussed;
  • The overall approach is somewhat brittle and not that foolproof. It would be better to have something like pub async fn restart_with(&mut self, mutate: impl FnMut(&mut NetworkNode)) that would kill the node, run the closure to alter its configuration, and then re-spawn the node, but that gives less leverage to the developer. This needs to be discussed as well.

I'll add a usage example a bit later. Feel free to ask questions, provide suggestions, add your commits, etc.

@s0me0ne-unkn0wn
Copy link
Contributor Author

@pepoviola pepoviola requested review from l0r1s and pepoviola March 17, 2024 22:45
@pepoviola
Copy link
Collaborator

pepoviola commented Mar 17, 2024

Hi @s0me0ne-unkn0wn, thanks for your feedback and contribution!!

Related to individual changes:

Got rid of nodes_by_name hashmap in the orchestrator::Network. It didn't make sense from the POV of performance (as a zombienet is unlikely to be of size exceeding a couple of hundreds of nodes), but cloning structures around was effectively sealing the network at the moment it was spawned, not allowing to change it dynamically;

Fixed a bug which resulted in double-adding nodes to network::network::Relaychain and network::network::Parachain when a new validator or collator is added;

This looks great!

Introduced a call allowing to kill a node without restarting it;

Did you think make sense to expose this method? Looks like will be always using followed by a re-spawn call since only kill the node will leave the node in the network but the process/pod inaccessible (maybe we can add a state flag). Maybe we can call it internally in the respawn method.

Introduced a call allowing to re-spawn a previously killed node, altering its spec between the kill and the re-spawn;

I think this is a great, we made some hack in v1 to allow to restart the node with a modified cmd and this is a more clean approach. We can check which other fields make sense to allow to modify in the spec to cover other use-cases. Also, one thing to notice is that key/peer_id is pretty much bounded to the node name by default and if we allow to change those we are breaking that pattern. I think that will not cause issues in most of the cases but we should document so users know that naming/key/peer_id can be changed.

Implemented means of dynamically replacing the node inside a network.

This looks great!!

Thoughts and ideas:

Narrowing down the use case to changing node command line arguments during the restart is a decision dictated by my specific research. Probably, it may be generalized to like "change everything you want while the node is dead", but the scope of the allowed changes has to be discussed;

Sounds good, command, subcommand, args and env are other good candidates to evaluate.

The overall approach is somewhat brittle and not that foolproof. It would be better to have something like pub async fn restart_with(&mut self, mutate: impl FnMut(&mut NetworkNode)) that would kill the node, run the closure to alter its configuration, and then re-spawn the node, but that gives less leverage to the developer. This needs to be discussed as well.

Yes, restart_with sounds good but as you said we should think what is the better ux for the developers.

Again, thanks for your feedback and contribution!! If you have time I will ping you to talks about this and implement the needed changes to support this feature also in kubernetes.

Thx!

--
Side note, I think the wait_for_metric fn from your example could be also a good candidate to implement and very useful for developers.

@s0me0ne-unkn0wn
Copy link
Contributor Author

If you have time I will ping you to talks about this

Sure! It's a great tool we're actively using so I'm interested in making it better :)

pepoviola added a commit that referenced this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants