One of the fundamental benefits of owning your blockchain is that you can understand scope of improvement in much greater detail, and optimize accordingly. On a general-purpose chain, this is hard because an application cannot know what parts of the codebase are getting bottlenecked. The second benefit is making several low-hanging fruit optimizations to the stack are absolutely possible and easy to do, while it is obviously not the same in a general-purpose chain.
To take advantage of this status quo, Spicenet will feature a full rewrite of it's core runtime and auxiliary components. We call this the Velocity stack, which can be thought of as a collection of low-level improvements to the Spicenet stack, with the main goal of improving performance and scalability. Some common improvements we have been planning are a complete rewrite of the networking stack, allowing for faster packet propagation, re-work of the core runtime and processing engine, allowing for better resource efficiency and horizontal scalability, and a fresh state commitment protocol built upon the original NOMT.
In this section, we will largely discuss the scope of improvement in the Spicenet stack and how Velocity tries to address it.
The networking stack of Spicenet is the defined as the transaction ingestion pipeline of nodes(including the sequencer) and the gossip protocol to facilitate Node -> Node communication and Sequencer -> Node communication. To better optimize the networking stack, we would need to know where potential bottlenecks may arise. After all, that is how things are optimized. The networking stack 's role can be better understood by looking at the transaction flow on Spicenet.
When a transaction is sent on Spicenet, it reaches the sequencer. This is the initial point of the networking stack wherein the sequencer receives transactions. Upon executing transactions, the sequencer offers pre-confs to users and sends the transactions to full nodes via the gossip protocol. Nodes optionally may communicate with each other in order to patch any incomplete/lost data, etc. With this we can understand that the networking stack comes into play when the sequencer receives inbound traffic, and facilitating communication between the sequencer and nodes.
The transaction ingestion pipeline may bottleneck when the sequencer receives excess load at the same time. And, the gossip protocol introduce performance issues if too many nodes communicate at the same time, or if the gossip library used is extremely heavy and is not able to transport large volumes of data concurrently.
We introduce a complete rework of the node software allowing nodes to ingest traffic better, by making better use of available computational resources. By introducing a process called "tiling", we are able to initiate multiple processes within the machine of the node, allowing each tile to consume a part of the traffic. Tiles can also be distributed across cores, resulting in better usage of resources and increased computational capacity. Moreover, tiles can be pipelined, such that one tile does not have to rely on another tile's process to complete. By speeding up transaction ingestion, we are able to cut down on what we call as Time To Pre-Confirmation or TTPC.
Secondly, we introduce a novel gossiping protocol that aims to break down the challenges of traditional p2p libraries such as libp2p. Libraries that facilitate p2p communication, such as libp2p are often heavy, complex and unscalable. Because of the convoluted nature of such libraries, it is often hard to scale them to facilitate large data transport. The Spicenet gossip protocol aims to be maximally simple, yet sufficiently fast and scalable. Currently, we use a p2p communication protocol based on QUIC, enabling establishment of connections, fault tolerance, pooling, etc.
As iterated before, the main goal of the runtime and the Sovereign SDK as a whole was to tackle scalability issues with state commitment before improving the actual execution engine of the runtime. Velocity brings forth a complete rewrite of the runtime over the NOMT state database, allowing the runtime to access and exploit the scalability benefits of NOMT with features such as pipelining, concurrency control and tiling. Velocity introduces the concept of tiling in two major areas, in the networking stack and in the runtime. As discussed before, tiling helps the networking stack to scalably ingest incoming load, while it also helps the runtime to horizontally scale execution workloads by splitting the execution process across tiles, wherein each tile executes a portion of the transactions.
This means that the Velocity runtime is inherently parallelized, allowing tile processes to be split and run across the available cores of the machine. We're currently looking at Block-STM for concurrency control and fault tolerance. Block-STM was originally proposed by the research team at Aptos Labs, and the protocol is actively used within many other projects such as Monad, Sei and Polygon PoS. We believe that Block-STM is the most optimal parallelization protocol. At the core of Block-STM is the optimistic concurrency protocol where transactions are executed without immediate validation. This inherently allows for higher parallelism as only conflicting transactions need to be re-executed, instead of locking all resources upfront in static parallelism methods.
Secondly, Block-STM features a scheduler which coordinates the execution and validation of transactions. Instead of using complex priority queues, Block-STM uses a simple atomic counter which keeps track of the current state of execution, allowing for resources to be freed up post execution, mitigating the problem of instant locking. Third, using the scheduler, the protocol can dynamically estimate dependencies of transactions, allowing it to adapt to different types of workloads. Lastly, Block-STM guarantees deterministic execution using the preset ordering feature offered by the scheduler, making conflict resolution much simpler.
While each tile optimistically executes transactions, we introduce pipelining between tiles, such that one tile does not have to wait for another tile's process to complete. With an intelligent scheduling algorithm, we are able to ensure faster workload completion times and better resource efficiency. Let's understand this mechanism a bit better. Computation is shared between tiles, allowing for better usage of CPU resources. When pipelined, tile A does not need to wait for tile A,B,C...N to finish, and instead can immediately start working on another workload. This means that resources are constantly being put to use without much idle time, leading to overall better performance.
Although Spicenet already has a scalable state commitment format in the form of NOMT, we have plans to scale it further. We have not yet decided exactly what changes will be made, but have an idea of different approaches to scaling state commitment. We will be discussing them here.
One of the low-level(OS level) improvements we want to make is introducing asynchronous i/o to interact with NOMT. Asynchronous i/o allows for efficient management and scheduling of a given workload, maximally reducing delays and idle times. Let's understand this with the help of an example. When one transaction wants to read state from disk, the execution engine should not wait for the operation to complete. Instead, it should first initiate the read and then start preparing resources for another transaction in the meantime. This reduces any form of unnecessary delays induced due to waiting for expensive operations to complete by preparing resources that can be deployed immediately after currently running operations are completed.
However, this is not the same as pipelining. In pipelining, we divide a bigger task into smaller tasks such that when one task is completed, we can divert resources to work on another task. In asynchronous i/o we ensure resource readiness such that they can be deployed instantly without significant warm-up time.
The second improvement is namespacing the merkle tree, which allows for parallelized computation of the tree root. Namespacing allows each sub-tree to be identified separately, which allows for faster insertion because only the root of the sub-tree to which the leaf belongs can be computed. This capability means that leaves can be inserted in parallel across multiple sub-trees, as we can compute roots of sub-trees parallely. However, we have not fully evaluated the drawbacks of this approach, and hence are not sure of it's inclusion in Velocity.