Skip to content

Host Deployment Documentation

Zhenbo edited this page Dec 21, 2022 · 1 revision

Configuration

  • The *.yml files in the config directory handles configuration for both cloud and site stacks. ifName and vlan together under Hosts and switchData are the unique identifier of each flow. (currently only ifName is the identifier).
  • The scrapeInterval and scrapeDuration do not change the rate. Every scrape is by default 15s and currently no configuration can change that. This is only applicable to Site stack since no pulling allowed in the system.
  • communityString under switchData is unique to each network element and should be kept as a secret not shown on Github.
  • Under both Cloud and Site Stacks, there're fill*.py files that read in the configuration files and dynamically writes install and start scripts. This is hidden from the users under start.sh and install.sh.
  • config.yml is an example config file for single switch system.
  • multiconfig.yml is an example config file for double switches system. (Note double switch network has never been tested, we might run into problem when implementing it on a network with more than 1 switch).

Site

Installation

  • Site installs SNMP exporter from Github. SNMP has many dependencies when generating snmp.yml file. Make sure GO is up to date and gcc is installed properly.
  • Install & start also make helper script files that curl the local port metrics to pushgateway server on the Cloud stack. For example, Node exporter runs on 9100. Script generated from install script will look like this curl -s ${MYIP}:9100/metrics | curl --data-binary @- $pushgateway_server/metrics/job/node-exporter/instance/$MYIP. Instance indicates where the data coming from. This URL can be customized.
  • ARP and SNMP are similar but more complicated with intermediate storage files.
  • The new scripts are added to the current crontab with a cycle of 15s. Every 15s the result of the curl is pushed to pushgateway.

Running

  • ./start.sh script parses using fill_start.py the configuration file and fills in ./dynamic_start.sh and run it.
  • It gives the user to choose which container to start and It composes all the containers in the end.
  • It dynamically generates a push script file for each exporter inside crontabs sub directory based on user inputs.
  • To add additional switches and its corresponding SNMP exporters, run python3 add_switch.py. This script generates a docker compose file, a snmp.yml, and update the script that's ran in crontab to include the new SNMP exporter.

Cleaning

  • clean.sh cleans up the metrics on pushgateway sent from this specific host. erase_pushgateway.py cleans the data by deleting the exact URLs that are present on pushgateway site. Node and SNMP are easier to delete since they both only have one URL. ARP is complicated and explained in ARP section https://github.com/esnet/sense-rtmon/wiki#arp.
  • All containers are removed and the image of ARP is deleted. ARP image is deleted to allow new configuration dynamically.

Cloud

Cloud Installation

  • Installs Grafana and Nginx. It allows user to encrypt Grafana to HTTPS using Nginx Reverse Proxy.
  • Script Exporter is pull from Github.

Cloud Running

Cleaning

Exporter Details:

  • Nginx

    • Running as a container that enables HTTPS. Certificates and DNS from the host are required.
    • In the stack.yml file Nginx needs to match the ports of other applications and ports access from. E.g. access 443 to 3000. If we want pushgateway and promethues to be HTTPS we need to open two additional ports (I might be wrong there might be other work arounds, I tried using location / but CSS didn't apply to pushgateway).
    • The path to certificates right now needs be manually changed
    • Remove HTTPS auto redirecting on Chrome: chrome://net-internals/#hsts
    • Many browser auto reroute to HTTPS. If we have ports that are still on HTTP it's hard to access due to the redirecting. Go to the website and find Delete domain security policies to remove auto direct.
    • Inside the container /etc/nginx/conf.d/ is where configuration files are stored.
  • Script Exporter

    • Script Exporter enables layer2 debugging. Under examples directory, the config.yaml tells the script exporter which script to run. args.sh and multiDef.sh are used for single and double switches. Anything more than 2 switches are not implemented yet.
    • These files are configed by fill_config.py date is from configuration files.
    • *.sh files send echo and Promethues database stores the data. The dashboard is looking for what is sent. Every changes made here need to be made in the Layer 2 dashboard templates as well.
    • e.g. echo "host1_arp_on{host=\"${host1}\"} 1" host1_arp_on is stored in prometheus, 1 represents on and 0 is off.
    • The format is string followed by a number. If a string is included Prometheus database can't take the data in the whole script will fail and no data goes through.
  • Grafana

    • dashboard directory has dynamic.py that generates two dashboards. One contains SNMP and Node exporter data and the other contains SNMP and ARP exporter data. The diagram and Prometheus queries are dynamically built based on the config file given.
    • If SNMP exporter is not running on either hosts, dynamic.py will fail to generate any dashboard.
    • dynamic.py runs curl to find the corresponding ifIndex based the ifName given in the config file. The ifIndex is used the queries.
    • fill_API.py curls the API key AUTOMATICALLY and it's included in the generate.sh script. Please curl API keys before changing admin and password of Grafana.
    • generate.sh reads the configuration files from users from config_flow folder and generates a dashboard accordingly. It includes an auto curl process that creates an API authentication key and stores it in the configuration file.
  • Prometheus

    • prometheus.yml stores the configuration for Prometheus.
    • It targets port 9090 pushgateway, 9091 Prometheus (itself), 9496 Scrape Exporter for both single and double switch (multi/default).
    • prometheus.yml is updated by config file when installed and started. The only moving part is the IP address. To add more script exporter scripts follow the current syntax.
  • SNMP

    • SNMP access the switch and MIB though this line:
    • <host_ip_address>:9116/snmp?target=<switch_ip_address>&module=<module_names_e.g.: if_mib>
    • Curl stores the result of the query in an intermediate file then curl the content to pushgateway.
    • Downloading MIBS refer to: https://github.com/esnet/sense-rtmon/issues/17#issue-1330372320
    • export MIBDIRS=<mibs_directory> right now the default directory is site/SNMPExporter/src/github.com/prometheus/snmp_exporter/generator/mibs. mibs is generated by make mibs. MIB files need to be moved to a single directory. The install_snmp.py file installs private mibs based on user input network element brand.
    • make mibs does not work consistently due to the URLs used for downloading. All mibs are not imported from librenms: https://github.com/librenms/librenms/tree/master/mibs
    • Each new SNMP exporter added on later with add_switch.py will run on the port of 9115 + (number of existing SNMP exporters).
  • ARP

    • ARP is more complicated for it needs to be able to detect changes in ARP table (arp -a).
    • ARP files are located under site/Metrics/ARPMetrics.
    • Important files:
      • arpOut.json stores the output of arp -a of the host system in json format. The plain output is converted to json by convertARP.py.
      • prev.json stores the previous arp -a output.
      • delete.json stores all current URLs on pushgateway in the format that can be processed to erase pushgateway data directly.
    • Put together. aroOut.json is updated every 15s. If there is discrepancy between the it and prev.json, ARP container deletes all current URLs from delete.json files and push new URLs from arpOut.json.
    • ping_status and prev_ping_status work in a similar fashion. The host pings the other host and stores the result and send it to pushgateway. If the two files are different, delete everything on pushgateway and resend the URLs and ping status.
  • TCP Exporter

    • Currently not functional and in development. It's similar to ARP and can send data to pushgateway with easy fixes in overwrite_json_exporter_tcp.py (take less than 1 hour). However, the design might need to change.

Others:

  • Dashboard Diagram Generation:

    • mermaid.live is used to draw diagram. The website has a good live drawing board for instant feedbacks.
    • Future: Local/Global Ports Unique Flow IDs
  • fill*.py files

  • site_functions.py and cloud_functions.py

    • They store the functions that are used more than once in other .py files.
    • site_functions.py only stores the functions that are used in Site Stack and vice versa.
    • This practice makes the project more modular and avoiding repeated codes.
    • Usage: import site_functions.
  • Cron jobs

    • crontab -e shows all the cron jobs that are currently running.
    • Site installs 3 new jobs Node, SNMP, and ARP. They are set * * * * * which is run every minute but with a loop it runs every 15s.
    • If Node, SNMP, and ARPrelated cron jobs are already setup, the install file will not install again. So make sure they're written incrontab -e` correctly and the can only be deleted manually.