Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTT1 - PoC - Observability module #4668

Closed
fcaffieri opened this issue Oct 31, 2023 · 11 comments
Closed

DTT1 - PoC - Observability module #4668

fcaffieri opened this issue Oct 31, 2023 · 11 comments
Assignees

Comments

@fcaffieri
Copy link
Member

fcaffieri commented Oct 31, 2023

Description

This issue aims to design and create a PoC of the Observability module.

This module is responsible for centralizing all the information related to:

  • Jenkins states (metrics):
    • Memory
    • Disk
    • CPU
    • Jenkins nodes
    • Executor nodes
    • Job-status
    • Calculation of quantities of OK, failed jobs, etc.

Information exploited:

  • Collector output test
  • Keep a history of the tests
  • Allows time tracking.
  • Estimate test duration.
  • See status in real time of the process.
  • More accurate and faster diagnosis: it will facilitate troubleshooting.

Alerts:

  • Generation of alerts, due to problems in any of the modules:
    • Anomalies in the tests
    • Failures in test failures.
    • Failures in provisioning.
    • Failures in allocations.

All this information will be presented in Grafana dashboard format.

Architecture:

image

For this PoC, this module will only present metrics for the VMs, nodes and workers. It will also show metrics about the Jobs (per module) and finally logs.

Parent issue #4524.

@fcaffieri
Copy link
Member Author

Update

Currently the following has been implemented locally:

  • VM 1:
    -Jenkins

    • Loki agent for log collection
    • Prometheus agent for collecting metrics from Jenkins and the VM.
  • VM 2:

    • Grafana for data visualization and exploitation
  • VM 3:

    • Prometheus to obtain metrics
  • VM 4:

    • Loki for information storage

Configurations

Jenkins: standard installation and creation of example pipelines. Loki agent installation is exposed by a specific port. Prometheus agent installation.
https://www.cherryservers.com/blog/how-to-install-jenkins-on-ubuntu-22-04

Grafana: Standard installation and configuration of data sources, at the moment they are Loki and Prometheus
https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/

Prometheus: Standard installation. The integration with Jenkins was configured using the Prometheus agent installed on the Jenkins server to obtain metrics, and the Prometheus plugin was installed in Jenkins to review what metrics it provides and how to exploit them. To integrate this, this integration was configured in Prometheus for a specific port.
https://www.cherryservers.com/blog/install-prometheus-ubuntu

root@prometheus:/home/vagrant# cat /etc/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
  - job_name: 'jenkins'
    metrics_path: /prometheus
    scheme : http
    static_configs:
      - targets: ['172.16.1.57:8080']
  - job_name: node-jenkins # Node exporter
    static_configs:
      - targets: ['172.16.1.57:9100'] 
root@prometheus:/home/vagrant#

Loki agent: Standard installation. For the configuration, it was configured to raise the system logs, Jenkins logs and Logs of each job executed.
https://github.com/grafana/loki/releases
https://psujit775.medium.com/how-to-setup-loki-in-ubuntu-20-04-f7aab49910fc

Node exporter: https://prometheus.io/docs/guides/node-exporter/

Promtail:

root@jenkins:/home/vagrant# cat /etc/promtail/config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://172.16.1.59:3100/loki/api/v1/push

scrape_configs:
  - job_name: jenkins-job-2
    static_configs:
    - targets:
        - localhost
      labels:
        job: jenkins-jobs-2
        __path__: /var/lib/jenkins/jobs/*/builds/*/*log

  - job_name: system
    static_configs:
    - targets:
        - localhost
      labels:
        job: system-logs
        __path__: /var/log/*log

  - job_name: jenkins-jobs
    static_configs:
      - targets:
          - localhost
        labels:
          job: jenkins-jobs
          __path__: /var/lib/jenkins/jobs/test/builds/4/*log

#  - job_name: jenkins_pipelines
#    static_configs:
#      - targets:
#        - localhost
#        labels:
#          job: jenkins_pipelines
#    pipeline_stages:
#    - match:
#        expression: "^.*$"
#      pipeline_stage:
#        - decorative:
#            expr: line
root@jenkins:/home/vagrant#

@fcaffieri
Copy link
Member Author

Move to on hold due to switch with #4665

@fcaffieri
Copy link
Member Author

Update

Working on the generation of the dashboard with the information obtained by the provisioned module.

@fcaffieri
Copy link
Member Author

fcaffieri commented Nov 15, 2023

Update

I continue working on the generation of dashboards, and on information management. In the formatting of the logs, since we have logs from Ansible, python, pytest and Jenkins, which concentrates all of them. I am researching the best way to format these logs and talking with the team to have a standard format for all modules. With all this, an attempt is made to generate a series of dynamic Dashboards that allow the results of the executions to be quickly visualized.

@fcaffieri
Copy link
Member Author

Update

Logs obtained with Loki agent used to generate a dashboard.

PLAY [Manager*] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Manager]

TASK [Install the Wazuh manager] ***********************************************
changed: [Manager] => {"changed": true, "cmd": ["apt-get", "-y", "install", "wazuh-manager"], "delta": "0:00:00.536879", "end": "2023-11-15 23:10:00.392475", "msg": "", "rc": 0, "start": "2023-11-15 23:09:59.855596", "stderr": "", "stderr_lines": [], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nwazuh-manager is already the newest version (4.6.0-1).\n0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded.", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "wazuh-manager is already the newest version (4.6.0-1).", "0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded."]}

PLAY [Agent*] ******************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Install the Wazuh agent] *************************************************
changed: [Agent1] => {"changed": true, "cmd": ["apt-get", "-y", "install", "wazuh-agent"], "delta": "0:00:00.529907", "end": "2023-11-15 23:10:02.385842", "msg": "", "rc": 0, "start": "2023-11-15 23:10:01.855935", "stderr": "", "stderr_lines": [], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nwazuh-agent is already the newest version (4.6.0-1).\n0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded.", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "wazuh-agent is already the newest version (4.6.0-1).", "0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded."]}

PLAY RECAP *********************************************************************
Agent1                     : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Manager                    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
No config file found; using defaults

PLAY [Agent*] ******************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Modify Wazuh manager IP in Wazuh agent] **********************************
ok: [Agent1] => {"changed": false, "msg": "", "rc": 0}

PLAY RECAP *********************************************************************
Agent1                     : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
No config file found; using defaults

PLAY [Manager*] ****************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Manager]

TASK [Enable and start Wazuh service] ******************************************
changed: [Manager] => (item=systemctl daemon-reload) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "daemon-reload"], "delta": "0:00:00.295270", "end": "2023-11-15 23:10:07.478110", "item": "systemctl daemon-reload", "msg": "", "rc": 0, "start": "2023-11-15 23:10:07.182840", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [Manager] => (item=systemctl enable wazuh-manager) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "enable", "wazuh-manager"], "delta": "0:00:00.311433", "end": "2023-11-15 23:10:08.038251", "item": "systemctl enable wazuh-manager", "msg": "", "rc": 0, "start": "2023-11-15 23:10:07.726818", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [Manager] => (item=systemctl start wazuh-manager) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "start", "wazuh-manager"], "delta": "0:00:00.005700", "end": "2023-11-15 23:10:08.285128", "item": "systemctl start wazuh-manager", "msg": "", "rc": 0, "start": "2023-11-15 23:10:08.279428", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY [Agent*] ******************************************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Enable and start Wazuh service] ******************************************
changed: [Agent1] => (item=systemctl daemon-reload) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "daemon-reload"], "delta": "0:00:00.275375", "end": "2023-11-15 23:10:10.059105", "item": "systemctl daemon-reload", "msg": "", "rc": 0, "start": "2023-11-15 23:10:09.783730", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [Agent1] => (item=systemctl enable wazuh-agent) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "enable", "wazuh-agent"], "delta": "0:00:00.359611", "end": "2023-11-15 23:10:10.657072", "item": "systemctl enable wazuh-agent", "msg": "", "rc": 0, "start": "2023-11-15 23:10:10.297461", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [Agent1] => (item=systemctl start wazuh-agent) => {"ansible_loop_var": "item", "changed": true, "cmd": ["systemctl", "start", "wazuh-agent"], "delta": "0:00:00.005630", "end": "2023-11-15 23:10:10.904515", "item": "systemctl start wazuh-agent", "msg": "", "rc": 0, "start": "2023-11-15 23:10:10.898885", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************
Agent1                     : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Manager                    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
No config file found; using defaults

PLAY [Install packages on Agent1] **********************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Installcurl] *************************************************************
ok: [Agent1] => {"cache_update_time": 1694246393, "cache_updated": false, "changed": false}

PLAY RECAP *********************************************************************
Agent1                     : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
No config file found; using defaults

PLAY [Install packages on Agent1] **********************************************

TASK [Gathering Facts] *********************************************************
ok: [Agent1]

TASK [Installnano] *************************************************************
ok: [Agent1] => {"cache_update_time": 1694246393, "cache_updated": false, "changed": false}

PLAY RECAP *********************************************************************
Agent1                     : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Working on the integration of the module provision and test due to both are almost finished.

@fcaffieri
Copy link
Member Author

Update

I am trying to generate dynamic panels, so that it allows selecting a pipeline. With this selection, show a grid with the latest builds of said pipeline and certain information such as status, build duration, link to the pipeline, etc. Then, when selecting a build within this list, you must show the execution logs of said pipeline in another panel below it.
So far I have managed to apply the filters, the pipeline filter works and brings up the pipelines that exist, but I am still looking for a way to show the list of builds, I found some obstacles that I am solving.
Regarding the logs, it is not a problem because I already have the information in grafana and I can show it according to this missing filter
Additionally, add a text type search filter to be able to filter within the selected log.

image

@fcaffieri
Copy link
Member Author

Update

The following dashboard was generated with everything mentioned above:
1- Filter by job
2- List of builds with their status, execution date, execution duration links to Jenkins and visualization of the logs.
These functionalities can be seen in the following video.

Peek 22-11-2023 19-48

@fcaffieri
Copy link
Member Author

fcaffieri commented Nov 23, 2023

Update

Finish Jenkins Job details Dashboard

Peek 23-11-2023 19-57

Working on integration with test module

@fcaffieri
Copy link
Member Author

The PoC was finalized and presented to the interested parties, the DTT work will continue from other issues.

@rauldpm
Copy link
Member

rauldpm commented Dec 11, 2023

Review: #4524 (comment)

@fcaffieri
Copy link
Member Author

Reviews answered at #4524 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

2 participants