-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
- Loading branch information
Showing
3 changed files
with
81 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Add the design for node-agent concurrency |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Node-agent Concurrency Design | ||
|
||
## Glossary & Abbreviation | ||
|
||
**Velero Generic Data Path (VGDP)**: VGDP is the collective of modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transfer for various purposes (i.e., PodVolume backup/restore, Volume Snapshot Data Movement). VGDP modules include uploaders and the backup repository. | ||
|
||
## Background | ||
|
||
Velero node-agent is a daemonset hosting controllers and VGDP modules to complete the concrete work of backups/restores, i.e., PodVolume backup/restore, Volume Snapshot Data Movement backup/restore. | ||
For example, node-agent runs DataUpload controllers to watch DataUpload CRs for Volume Snapshot Data Movement backups, so there is one controller instance in each node. One controller instance takes a DataUpload CR and then launches a VGDP instance, which initializes a uploader instance and the backup repository connection, to finish the data transfer. The VGDP instance runs inside the node-agent pod or in a pod associated to the node-agent pod in the same node. | ||
|
||
Varying from the data size, data complexity, resource availability, VGDP may take a long time and remarkable resources (CPU, memory, network bandwidth, etc.). | ||
Technically, VGDP instances are able to run in concurrent regardless of the requesters. For example, a VGDP instance for a PodVolume backup could run in parallel with another VGDP instance for a DataUpload. Then the two VGDP instances share the same resources if they are running in the same node. | ||
|
||
Therefore, in order to gain the optimized performance with the limited resources, it is worthy to configure the concurrent number of VGDP per node. When the resources are sufficient in nodes, users can set a large concurrent number, so as to reduce the backup/restore time; otherwise, the concurrency should be reduced or prohibited, otherwise, the backup/restore may encounter problems, i.e., time lagging, hang or OOM kill. | ||
|
||
## Goals | ||
|
||
- Define the behaviors of concurrent VGDP instances in node-agent | ||
- Create a mechanism for users to specify the concurrent number of VGDP per node | ||
|
||
## Non-Goals | ||
- VGDP instances from different nodes always run in concurrent since in most common cases the resources are isolated. For special cases that some resources are shared across nodes, there is no support at present | ||
- In practice, restores run in prioritized scenarios, e.g., disarster recovery. However, the current design doesn't consider this difference, a VGDP instance for a restore is blocked if it reaches to the limit of the concurrency, even though the ones block it are for backups. If users do meet some problems here, they should consider to stop the backups first | ||
|
||
## Solution | ||
|
||
### Global concurrent number | ||
We add a node-agent server parameter ```data-path-concurrent-num``` to accept a concurrent number that will be applied to all nodes if the per-node number is not specified. | ||
The number starts from 1 which means there is no concurrency, only one instance of VGDP is allowed. There is no roof limit. | ||
If the node-agent server parameter is not specified, a hard-coded default value will be used, the value is set to 1. | ||
|
||
### Per-node concurrent number | ||
We allow users to specify different concurrent number per node, for example, users can allow 3 concurrent instances in Node-1, 2 instances in Node-2 and 1 instance in Node-3. This is for below considerations: | ||
- The resources may be different among nodes. Then users could specify smaller concurrent number for nodes with less resources while larger number for the ones with more resources | ||
- Help users to isolate critical environments. Users may run some critical workloads in some specified nodes, since VGDP instances may take large resource consumption, users may only want to run them in the other nodes or run less number of instances in the nodes with critical workloads | ||
|
||
The range of Per-node concurrent number is the same with Global concurrent number. | ||
Per-node concurrent number is preferable to Global concurrent number, so it will overwrite the Global concurrent number for that node. | ||
Per-node concurrent number is implemented in a separate configMap named ```node-agent-configs```. This configMap is not created by Velero, users should create it manually on demand. The configMap should be created before node-agent server starts, otherwise, node-agent server should be restarted in order to make the effect. | ||
|
||
The ```node-agent-configs``` configMap may be used for other purpose of configuring node-agent in future, at present, there is only one kind of data ```data-path-concurrency```. Its data struct is like below: | ||
|
||
```go | ||
type DataPathConcurrency struct { | ||
// MatchLabel specifies the label to identify a node for which Configs holds configs | ||
MatchLabel string `json:"matchLabel"` | ||
|
||
// Configs specifies the configs by node identified by MatchLabel | ||
Configs map[string]int `json:"configs"` | ||
} | ||
``` | ||
|
||
Users are allowed to specify how the node is matched. The ```MatchLabel``` field accepts the key of a label associated to nodes. Then this label is used to define concurrent number in ```Configs``` field. | ||
|
||
For example, say ```MatchLabel``` is "foo", then: | ||
- At least one node is expected to have a label with key of "foo" (e.g., "foo":"bar"). If no node is with this label, the Per-node configuration makes no effect | ||
- Inside the map of ```Configs```, if there is one element "bar":X, then X will be the Per-node concurrent number for the node with the label "foo":"bar" | ||
|
||
```MatchLabel``` is optional, if it is not set, the hard-coded label key ```kubernetes.io/hostname``` will be used, which means the host name will be used to match nodes. | ||
|
||
### Global data path manager | ||
As for the code implementation, data path manager is to maintain the total number of the running VGDP instances and ensure the limit is not excceeded. At present, there is one data path manager instance per controller, as a result, the concurrent numbers are calculated separately for each controller. This doesn't help to limit the concurrency among different requesters. | ||
Therefore, we need to create one global data path manager instance server-wide, and pass it to different controllers. The instance will be created at node-agent server startup. | ||
The concurrent number is required to initiate a data path manager, the number comes from either Per-node concurrent number or Global concurrent number. | ||
Below are some prototypes related to data path manager: | ||
|
||
```go | ||
func NewManager(cocurrentNum int) *Manager | ||
func (m *Manager) CreateFileSystemBR(jobName string, requestorType string, ctx context.Context, client client.Client, namespace string, callbacks Callbacks, log logrus.FieldLogger) (AsyncBR, error) | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
[1]: unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters