Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTT1 - Design and develop PoC #4524

Closed
rauldpm opened this issue Sep 15, 2023 · 20 comments
Closed

DTT1 - Design and develop PoC #4524

rauldpm opened this issue Sep 15, 2023 · 20 comments

Comments

@rauldpm
Copy link
Member

rauldpm commented Sep 15, 2023

EPIC: #4495

Description

This issue aims to design and create an initial Proof of Concept based on the analysis carried out in the issue #4519

In this way, the PoC will show the following functionalities on a single system:

This PoC will have the following bases:

  • Deployment of two Ubuntu 22 instances
  • Installation of a Wazuh agent and manager of the latest productive version
  • The tests will be executed on the Wazuh agent and manager instances
  • The results will be observed through the build console
  • All development can be executed locally through scripts written in Python, as well as in Jenkins through modular pipelines

The composition of the tests will be as follows:

  • Install
  • Registration
  • Connection
  • Basic info
  • Restart
  • Stop
  • Uninstall
@rauldpm rauldpm changed the title DTT1 - Design and develop POC DTT1 - Design and develop PoC Sep 15, 2023
@rauldpm
Copy link
Member Author

rauldpm commented Sep 18, 2023

Update report - Test


  • Created inventory for two nodes
all:
  hosts:
    Agent:
      ansible_host: 192.168.56.34
      ansible_port: 22
    Manager:
      ansible_host: 192.168.56.35
      ansible_port: 22
  vars:
    ansible_user: 'vagrant'
    ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
    ansible_ssh_private_key_file: './utils/key'
  • Created two Ansible playbooks to provision and launch the tests in the nodes with the command:
ansible-playbook playbooks/provision_test.yml -i ./inventory.yaml --limit Agent
ansible-playbook playbooks/test.yml -i ./inventory.yaml --limit Agent
  • Working on an Ansible class

@rauldpm
Copy link
Member Author

rauldpm commented Sep 19, 2023

Update report - Test

  • Created Ansible class
    • This class will use the ansible_runner module to run playbooks using an inventory
  • The main Python file will create an Ansible object and manage all the connections
  • Created a playbook to deploy a Wazuh agent and a Wazuh manager in two instances following the documentation process
    • This playbook will, sequentially, configure the repositories in all nodes, install the packages, and start the services
Playbook execution
╰─➤  python3 test.py

PLAY [all] *********************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]
ok: [Manager]

TASK [Install GPG key] *********************************************************
changed: [Manager]
changed: [Agent1]

TASK [Add Wazuh repository] ****************************************************
changed: [Agent1]
changed: [Manager]

TASK [Update package information] **********************************************
changed: [Manager]
changed: [Agent1]

PLAY [Manager*] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Manager]

TASK [Install the Wazuh manager] ***********************************************
changed: [Manager]

TASK [Enable and start Wazuh service] ******************************************
changed: [Manager] => (item=systemctl daemon-reload)
changed: [Manager] => (item=systemctl enable wazuh-manager)
changed: [Manager] => (item=systemctl start wazuh-manager)

PLAY [Agent*] ******************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Install the Wazuh agent with environment variables] **********************
changed: [Agent1]

TASK [Enable and start Wazuh service] ******************************************
changed: [Agent1] => (item=systemctl daemon-reload)
changed: [Agent1] => (item=systemctl enable wazuh-agent)
changed: [Agent1] => (item=systemctl start wazuh-agent)

PLAY RECAP *********************************************************************
Agent1                     : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Manager                    : ok=7    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0  

image

@rauldpm
Copy link
Member Author

rauldpm commented Sep 20, 2023

Update report

  • Worked on the Wazuh agent and manager tests
  • Migrated current tests to the new execution format
  • Working on new sequential tests between nodes

@jnasselle
Copy link
Member

Update report

  • Developing infrastructure provisioning script
  • Approach:
  • Tool: infra.py
    • Input parameters: file requesting resources. ie
     ---
     - composite-name: linux-ubuntu-22.04-amd64
       role: agent
       provider: aws
       alias: Agent
     - composite-name: linux-ubuntu-22.04-amd64
       role: manager
       provider: aws
       alias: Manager
  • Specs: define what is supported by the tool
    • Roles: Wazuh currently constrains roles based on OS resources. Defined on specs/roles.yaml
     - role: agent
       composite-name: 
         linux-*-amd64:
           aws:
             instance-type: t2.nano
         linux-*-arm64:
           aws:
             instance-type: c7gn.medium
     - role: manager
       composite-name: 
         linux-*-amd64:
           aws:
             instance-type: c5a.xlarge
         linux-*-arm64:
           aws:
             instance-type: c6g.xlarge
    • OS: Define OS, its provider, and ID for each one. Defined on specs/os/*.yaml grouped by OS family i.e ubuntu.yaml
    ---
    linux-ubuntu-22.04-amd64:
      specs:
        distro: ubuntu
        familly: linux
        version: 22.04
        codename: jammy
        arch: amd64
      providers:
        aws:
          ami: ami-0fc5d935ebf8bc3bc
          zone: us-east-1
        vagrant:
          box: ubuntu/focal64
    linux-ubuntu-22.04-arm64:
      specs:
        distro: ubuntu
        familly: linux
        version: 22.04
        codename: jammy
        arch: arm64
      providers:
        aws:
          ami: ami-016485166ec7fa705
          zone: us-east-1
    • OS: Define providers default values, used in case the were not explicitly defined.Defined on specs/providers/*.yaml i.e aws.yaml
       security-group: [ sg-0877d224f9c5b2708 ]
       zone: us-east-1
       key: 
         type: fixed
         name: jnasselle-dev
       name:
         type: dynamic
         prefix: dt-${tier}-${os}-${arch}-${type}
       tags:
         type: qa

@rauldpm
Copy link
Member Author

rauldpm commented Sep 25, 2023

Update report

  • Started process of modularization of provision and testing based on the tasks that the process must contemplate
  • On hold due to 4.5.3 RC 1 testing Release 4.5.3 - RC 1 wazuh#19111

@rauldpm
Copy link
Member Author

rauldpm commented Sep 27, 2023

Still on hold due wazuh/wazuh#19166

@rauldpm
Copy link
Member Author

rauldpm commented Oct 2, 2023

Still on hold due wazuh/wazuh#19300 and wazuh/wazuh#19166

@rauldpm
Copy link
Member Author

rauldpm commented Oct 4, 2023

Update report

  • Creating basic tests for each section

@rauldpm
Copy link
Member Author

rauldpm commented Oct 5, 2023

@rauldpm
Copy link
Member Author

rauldpm commented Oct 16, 2023

@fcaffieri
Copy link
Member

fcaffieri commented Oct 18, 2023

Observability module

Proposed architecture

image

Working on the proposed architecture, only the part that will enter DTT1

@fcaffieri
Copy link
Member

Update Observability

Jenkins:
image

Loki metrics:
image

Grafana dashboard:

image
image
image

Jenkins log:
image

Grafana Loki datasource
image

@fcaffieri
Copy link
Member

Update

I am investigating Jenkins, Prometheus and Grafana integration. The objective is to obtain a graph like the following:

image

This could give us a lot of real-time information that we do not have today.

@fcaffieri
Copy link
Member

Update

Configured Prometheus with Jenkins and Grafana:

image

This can give us a lot of information about metrics of Jenkins, such us:

  • Number of nodes
  • Usage metrics on nodes and Jenkins
  • Number of jobs executed ok vs. erroneous.
  • And several metrics of this type, some example dashboards continue to be generated for the POC.

image

@fcaffieri
Copy link
Member

fcaffieri commented Oct 30, 2023

Update

After analyzing prometheus and grafana, I was able to configure the following dashboards which provide information about:

  • Jenkins metrics:

    • CPU
    • Memory
    • Disk usage
    • Network traffic
  • Pipeline metrics:

    • Number of Jobs
    • Number of Jobs executed OK
    • Number of Jobs executed with errors
    • Number of unstable executed Jobs
    • Number of aborted executed Jobs
    • Metrics on executor nodes
    • Job queue duration
    • Offline nodes

image
image
image
image

@fcaffieri fcaffieri added level/epic and removed Epic labels Oct 31, 2023
@QU3B1M QU3B1M self-assigned this Nov 8, 2023
@QU3B1M
Copy link
Member

QU3B1M commented Nov 24, 2023

Update report

  • Merged test and provision modules
  • Working on fixes to make the modules work correctly together with the Jenkins launcher

@jnasselle
Copy link
Member

Update

Based on PoC feedback (functional and non-functional) and weekly design meetings, the next tasks should be addressed

  • Propose a task-driven architecture/framework that keeps the current modular approach while improving parallelism based on DAG designs
    • Rather than expect the best execution flow designed by our engineers, design/use a tool that ingests a DAG ( YAML as desired format) and propose an execution plan that will be executed as fast as its tasks were completed
  • Evaluate current and projected pipelines/executions into the designed arch to detect design flaws
  • PoC of the new approach

@jnasselle
Copy link
Member

@rauldpm
Copy link
Member Author

rauldpm commented Dec 11, 2023

Review notes

  • The PoC is not clear, we must establish a series of actions to validate it before continuing with the next iteration
  • To validate the PoC we should have at least:
    • Documentation of how to carry out the PoC (using individual modules as well as together)
    • Video that shows the basic operation of the PoC
    • List of used tools (documentation)
    • Each module can be easily executed without the need for a relationship with another module during its execution (total decoupling of dependencies)
  • Given the last meeting and comment by @jnasselle, we must review the current status and ensure that the PoC meets the required standard, this implies making changes and revalidating them
  • I agree with @jnasselle that the Observability module is very invasive in the developed modules, we should review this and ensure independence between modules
    • Does it make local operations too complex?

Proposed plan

We must ensure the functionality of the main modules (Allocator/Provision/Test) before readapting the Observability module. Before working on each module, the previous one must be completed and validated. Each module must take into account the previous one

Modular

  1. Ensure the functionality of the Allocator module
    • Deployments are correct
    • Input/Output artifacts are correct
    • Parallelized deployment
    • Local/cloud deployment
  2. Ensure the functionality of the Provision module
    • Input artifacts are correct
    • Parallelized/sequential provisioning to each instance according to the provisioning case
  3. Ensure the functionality of the Testing module
    • Input/Output artifacts are correct
    • Tests are executed on each instance

Modular integration

  1. Ensure the operation of the Allocator/Provision/Testing modules in tandem
    • Jenkinsfile for cloud -> How is this executed locally? -> PoC environment deployment documentation
    • Modules are managed sequentially/parallel
      correctly
    • Correct inventories/artifacts (input/outputs) management
    • Will the Jenkinsfile use multithreading? How are we going to support the provision of an instance after his allocation if other instances are still being allocated?
      • Instead of calling the Python script one time with a composed inventory, the script must be called once for each inventory target and will only wait if the provision has dependencies from other instances not yet provisioned, for example, an instance that is going to be provisioned with nano, will not wait to other allocations, but an instance that will be provisioned with an agent, will wait for the manager instance to be allocated

Observability module

  1. Ensure the operation of the observability module
    • What dependencies currently exist between the observability module and the rest?
    • With the current PoC, can the modules be executed independently or together, both locally and in the cloud?
    • Should the modules know the tools necessary for the operation of the observability module or should this module be the one that adapts to the rest? Example: The testing module uses a DDBB (influxDB) to be used by Grafana

Before continuing with the final development of the modules, we should consider the changes and proposals discussed in the last meeting to evaluate the impact and carry out a second iteration of the PoC so that we validate these annotations.

@fcaffieri
Copy link
Member

Review Notes

Referring to the comments made

The PoC is not clear, we must establish a series of actions to validate it before continuing with the next iteration

The documentation will be generated in the last iteration of the development, because as progress is made, problems are found that are being resolved, for more details on this see issue: https://github.com/wazuh/wazuh-qa/issues/ 4495
Referring to the modules and their easy execution, it is one of the problems found after the PoC and we are working on iteration 2.
Regarding the observability module, as it was built it is not invasive or intrusive in the modules, it only collects information from the nodes where said modules run and stores it in Loki. No configuration or development is required within each module for its operation.

We must ensure the functionality of the main modules (Allocator/Provision/Test) before readapting the Observability module. Before working on each module, the previous one must be completed and validated. Each module must take into account the previous one

Exactly, this is what was proposed in iteration 2 to be developed, in conjunction with the analysis and implementation of the orchestrator of said modules.


Referring to the aforementioned modular approach

Indeed, iteration 2 is where everything mentioned in @raul's comment will be taken into account, in summary:

  • Correct functioning of each module.
  • Easy use.
  • Error and validation management to avoid input or user errors.
  • Parallelization of executions, here the orchestrator comes into play.
  • Improvement in the structure and implementation of the project in general.
  • Jenkins will execute the orchestrator with the defined input (it must be versatile and easily contemplate all defined use cases).
  • In case of local execution, the developed library will be installed and the possibility of executing each module or the complete orchestrator using a simple launcher is given.
  • Parallelism will be handled by the Python orchestrator and not by Jenkins. The only thing that will be sequential are those tasks that have dependencies on others to be completed.
  • The management of dependencies will be carried out by the orchestrator, who will have the logic to define them.

Referring to the observability module:

  • There are no dependencies between the observability module and the rest. It only captures the information they leave and stores it in a Bucket to later be viewed by the respective dashboards.
  • The observability module is not a module that is executed, but rather it is a service available to users who require it.
  • In the case of the test module that uses InfluxDB, it only uses a plugin, which stores the pytest data in said database. This database is created as a service and would be part of the observability module, the test is only connected to save the information. Regarding this functionality of the test, it is something that is and continues to be analyzed for feasibility. For the PoC and iterations 1 and 2, this functionality is not available, because it is required to analyze its implementation or not.

Conclusions

The PoC is considered finalized, according to the fact that the proposed objective for it was completed in the branch https://github.com/wazuh/wazuh-qa/tree/enhancement/4495-deployability-tier-1 and presented with acceptance of those interested.
Another PoC will be scheduled at the end of iteration 2 of the DTT to address the points raised and problems found in iteration 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants