diff --git a/README.md b/README.md index a5069ca..2b9ebe7 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,5 @@ # OONI Devops -This document outlines some of the best practices we follow when developing and -deploying OONI services. ## Infrastructure Tiers diff --git a/docs/DebianPackages.md b/docs/DebianPackages.md new file mode 100644 index 0000000..53be5af --- /dev/null +++ b/docs/DebianPackages.md @@ -0,0 +1,30 @@ +# Debian packages + +**NOTE** The direction we are going with the new backend is that of dropping debian packaging of all backend API components and move to a dockerized deployment approach. + +This section lists the Debian packages used to deploy backend +components. They are built by [GitHub CI workflows](#github-ci-workflows) 💡 +and deployed using [The deployer tool](#the-deployer-tool) 🔧. See +[Debian package build and publish](#debian-package-build-and-publish) 💡. + + +#### ooni-api package +Debian package for the [API](#api) ⚙ + + +#### fastpath package +Debian package for the [Fastpath](#fastpath) ⚙ + + +#### detector package +Debian package for the +[Social media blocking event detector](#social-media-blocking-event-detector) ⚙ + + +#### analysis package +The `analysis` Debian package contains various tools and runs various of +systemd timers, see [Systemd timers](#systemd-timers) 💡. + + +#### Analysis deployment +See [Backend component deployment](#backend-component-deployment) 📒 diff --git a/docs/DeprecatedDocs.md b/docs/DeprecatedDocs.md new file mode 100644 index 0000000..113d91d --- /dev/null +++ b/docs/DeprecatedDocs.md @@ -0,0 +1,141 @@ +## Test helper rotation runbook +This runbook provides hints to troubleshoot the rotation of test +helpers. In this scenario test helpers are not being rotated as expected +and their TLS certificates might be at risk of expiring. + +Steps: + +1. Review [Test helpers](#comp:test_helpers), [Test helper rotation](#comp:test_helper_rotation) and [Test helpers notebook](#test-helpers-notebook) 📔 + +2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊. + Look at different timespans: + + a. The uptime of the test helpers should be staggered by a week + depending on [Test helper rotation](#test-helper-rotation) ⚙. + +3. A summary of the live and last rotated test helper can be obtained + with: + +```sql +SELECT rdn, dns_zone, name, region, draining_at FROM test_helper_instances ORDER BY name DESC LIMIT 8 +``` + +4. The rotation tool can be started manually. It will always pick the + oldest host for rotation. ⚠️ Due to the propagation time of changes + in the DNS rotating many test helpers too quickly can impact the + probes. + + a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 + + b. Check the last run using + `sudo systemctl status ooni-rotation.timer` + + c. Review the logs using `sudo journalctl -u ooni-rotation` + + d. Run `sudo systemctl restart ooni-rotation` and monitor the logs. + +5. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊 + during and after the rotation. + + +### Test helpers failure runbook +This runbook presents a scenario where a test helper is causing probes +to fail their tests sporadically. It describes how to identify the +affected host and mitigate the issue but can also be used to investigate +other issues affecting the test helpers. + +It has been chosen because such kind of incidents can impact the quality +of measurements and can be relatively difficult to troubleshoot. + +For investigating glitches in the +[test helper rotation](#test-helper-rotation) ⚙ see +[test helper rotation runbook](#test-helper-rotation-runbook) 📒. + +In this scenario either an alert has been sent to the +[#ooni-bots](#topic:oonibots) [Slack](#slack) 🔧 channel by +the [test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 or something +else caused the investigation. +See [Alerting](#alerting) 💡 for details. + +Steps: + +1. Review [Test helpers](#test-helpers) ⚙ + +2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊. + Look at different timespans: + + a. The uptime of the test helpers should be staggered by a week + depending on [Test helper rotation](#test-helper-rotation) ⚙. + + b. The in-flight requests and requests per second should be + consistent across hosts, except for `0.th.ooni.org`. See + [Test helpers list](#test-helpers-list) 🐝 for details. + + c. Review CPU load, memory usage and run duration percentiles. + +3. Review [Test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 + +4. For more detailed investigation there is also a [test helper notebook](https://jupyter.ooni.org/notebooks/notebooks/2023%20%5Bfederico%5D%20test%20helper%20metadata%20in%20fastpath.ipynb) + +5. Log on the hosts using + `ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -Snone root@0.th.ooni.org` + +6. Run `journalctl --since '1 hour ago'` or review logs using the query + below. + +7. Run `top`, `strace`, `tcpdump` as needed. + +8. The rotation tool can be started at any time to rotate away failing + test helpers. The rotation script will always pick the oldest host + for rotation. ⚠️ Due to the propagation time of changes in the DNS + rotating many test helpers too quickly can impact the probes. + + a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 + + b. Check the last run using + `sudo systemctl status ooni-rotation.timer` + + c. Review the logs using `sudo journalctl -u ooni-rotation` + + d. Run `sudo systemctl restart ooni-rotation` and monitor the logs. + +9. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊 + during and after the rotation. + +10. Summarize traffic hitting a test helper using the following commands: + + Top 10 miniooni probe IP addresses (Warning: this is sensitive data) + + `tail -n 100000 /var/log/nginx/access.log | grep miniooni | cut -d' ' -f1|sort|uniq -c|sort -nr|head` + + Similar, with anonimized IP addresses: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d'.' -f1-3 | head -n 10000 |sort|uniq -c|sort -nr|head` + + Number of requests from miniooni probe in 10-minutes buckets: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-17 | uniq -c` + + Number of requests from miniooni probe in 1-minute buckets: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-18 | uniq -c` + + Number of requests grouped by hour, cache HIT/MISS/etc, software name and version + + `head -n 100000 /var/log/nginx/access.log | awk '{print $4, $6, $13}' | cut -c1-15,22- | sort | uniq -c | sort -n` + +To extract data from the centralized log database +on [monitoring.ooni.org](#monitoring.ooni.org) 🖥 you can use: + +``` sql +SELECT message FROM logs +WHERE SYSLOG_IDENTIFIER = 'oohelperd' +ORDER BY __REALTIME_TIMESTAMP DESC +LIMIT 10 +``` + +> **note** +> The table is indexed by `__REALTIME_TIMESTAMP`. Limiting the range by time can significantly increase query performance. + + +See [Selecting test helper for rotation](#selecting-test-helper-for-rotation) 🐞 diff --git a/docs/IncidentResponse.md b/docs/IncidentResponse.md new file mode 100644 index 0000000..6e98465 --- /dev/null +++ b/docs/IncidentResponse.md @@ -0,0 +1,362 @@ +# Incident response + +## On-call preparation +Review [Alerting](#alerting) 💡 and check +[Grafana dashboards](#grafana-dashboards) 💡 + +On Android devices the following apps can be used: + + * [Slack](#slack) 🔧 app with audible notifications from the + #ooni-bots channel + + * [Grafana](#grafana) 🔧 viewer + + +## Tiers and severities + +**TODO** Consolidate the tiers outlined here with the other tiers listed in the top level readme. + +When designing architecture of backend components or handling incidents it can be useful to have +defined severities and tiers. + +A set of guidelines are described at +This section presets a simplified approach to prioritizing incident response. + +In this case there is no distinction between severity and priority. Impact and response time are connected. + +Incidents and alarms from monitoring can be classified by severity levels based on their impact: + + - 1: Serious security breach or data loss; serious loss of privacy impacting users or team members; legal risks. + - 2: Downtime impacting service usability for a significant fraction of users; Serious security vulnerability. + Examples: probes being unable to submit measurements + - 3: Downtime or poor performance impacting secondary services; anything that can cause a level 2 event if not addressed within 24h; outages of monitoring infrastructure + - 4: Every other event that requires attention within 7 days + +Based on the set of severities, components can be classified in tier as follows: + + - tier 1: Anything that can cause a severity 1 (or less severe) event. + - tier 2: Anything that can cause a severity 2 (or less severe) event but not a severity 1. + - tier 3: Anything that can cause a severity 3 (or less severe) event but not a severity 1 or 2. + - ...and so on + +### Relations and dependencies between services + +Tiers are useful during design and deployment as a way to minimize risk of outages and avoid unexpected cascading failures. + +Having a low tier value should not be treated as a sign of "importance" for a component, but a liability. + +Pre-production deployment stages (e.g. testbed) have tier level >= 5 + +In this context a component can be a service as a whole, or a running process (daemon), a host, a hardware device, etc. +A component can contain other components. + +A component "A" is said to "hard depend" on another component "B" if an outage of B triggers an outage of A. + +It can also "soft depend" on another component if an outage of the latter triggers only a failure of a subsystem, or an ancillary feature or a reasonably short downtime. + +Regardless of tiers, components at a higher stage, (e.g. production) cannot depend and/or receive data from lower stages. The opposite is acceptable. + +Components can only hard-depend on other components at the same tier or with lower values. +E.g. a Tier 2 component can depend on a Tier 1 but not the other way around. +If it happens, the Tier 2 component should be immediatly re-classified as Tier 1 and treated accordingly (see below). + +E.g. anything that handles real-time failover for a service should be treated at the same tier (or lower value) as the service. + +Redundant components follow a special rule. For example, the "test helper" service provided to the probes, as a whole, should be considered tier 2 at least, +as it can impact all probes preventing them from running tests succesfully. +Yet, test helper processes and VMs can be considered tier 3 or even 4 if they sit behind a load balancer that can move traffic away from a failing host reliably +and with no significant downtime. + +Example: An active/standby database pair provides a tier 2 service. An automatic failover tool is triggered by a simple monitoring script. +Both have to be labeled tier 2. + + +### Handling incidents + +Depending on the severity of an event a different workflow can be followed. + +An example of incident management workflow can be: + +| Severity | Response time | Requires conference call | Requires call leader | Requires postmortem | Sterile | +| -------- | ------- | ------ | -------- | ------- | ------ | +| 1 | 2h | Yes | Yes | Yes | Yes | +| 2 | 8h | Yes | No | Yes | Yes | +| 3 | 24h | No | No | No | Yes | +| 4 | 7d | No | No | No | No | + +The term "sterile" is named after - during the investigation the only priority should be to solve the issue at hand. +Other investigations, discussions, meetings should be postponed. + +When in doubt around the severity of an event, always err on the safe side. + +### Regular operations + +Based on the tier of a component, development and operation can follow different rules. + +An example of incident management workflow can be: + +| Tier | Require architecture review | Require code review | Require 3rd party security review | Require Change Management | +| -------- | ------- | ------ | -------- | ------- | +| 1 | Yes | Yes | Yes | Yes | +| 2 | Yes | Yes | No | No | +| 3 | No | Yes | No | No | +| 4 | No | No | No | No | + +"Change Management" refers to planning operational changes in advance and having team members review the change to be deployed in advance. + +E.g. scheduling a meeting to perform a probe release, have 2 people reviewing the metrics before and after the change. + + +## Redundant notifications +If needed, a secondary channel for alert notification can be set up +using + +Ntfy can host a push notification topic for free. + +For example is currently being used to +notify the outcome of CI runs from + + +An Android app is available: + + +[Grafana](#grafana) 🔧 can be configured to send alerts to ntfy.sh +using a webhook. + +### Measurement drop tutorial + +This tutorial provides examples on how to investigate a drop in measurements. +It is based on an incident where a drop in measurement was detected and the cause was not immediately clear. + +It is not meant to be a step-by-step runbook but rather give hints on what data to look for, how to generate charts and identify the root cause of an incident. + +A dedicated issue can be used to track the incident and the investigation effort and provide visibility: +https://github.com/ooni/sysadmin/blob/master/.github/ISSUE_TEMPLATE/incident.md +The issue can be filed during or after the incident depending on urgency. + +Some of the examples below come from +https://jupyter.ooni.org/notebooks/notebooks/android_probe_release_msm_drop_investigation.ipynb +During an investigation it can be good to create a dedicated Jupyter notebook. + +We started with reviewing: + + * + No issues detected as the charts show a short timespan. + * The charts on [Test helpers dashboard](#test-helpers-dashboard) 📊. + No issues detected here. + * The [API and fastpath](#api-and-fastpath) 📊 dashboard. + No issues detected here. + * The [Long term measurements prediction notebook](#long-term-measurements-prediction-notebook) 📔 + The decrease was clearly showing. + +Everything looked OK in terms of backend health. We then generated the following charts. + +The chunks of Python code below are meant to be run in +[Jupyter Notebook](#jupyter-notebook) 🔧 and are mostly "self-contained". +To be used you only need to import the +[Ooniutils microlibrary](#ooniutils-microlibrary) 💡: + +``` python +%run ooniutils.ipynb +``` + +The "t" label is commonly used on existing notebooks to refer to hour/day/week time slices. + +We want to plot how many measurements we are receiving from Ooniprobe Android in unattended runs, grouped by day and by `software_version`. + +The last line generates an area chart using Altair. Notice that the `x` and `y` and `color` parameters match the 3 columns extracted by the `SELECT`. + +The `GROUP BY` is performed on 2 of those 3 columns, while `COUNT(*)` is counting how many measurements exist in each t/software_version "bucket". + +The output of the SQL query is just a dataframe with 3 columns. There is no need to pivot or reindex it as Altair does the data transformation required. + +> **note** +> Altair refuses to process dataframes with more than 5000 rows. + +``` python +x = click_query(""" + SELECT + toStartOfDay(toStartOfWeek(measurement_start_time)) AS t, + software_version, + COUNT(*) AS msm_cnt + FROM fastpath + WHERE measurement_start_time > today() - interval 3 month + AND measurement_start_time < today() + AND software_name = 'ooniprobe-android-unattended' + GROUP BY t, software_version +""") +alt.Chart(x).mark_area().encode(x='t', y='msm_cnt', color='software_version').properties(width=1000, height=200, title="Android unattended msm cnt") +``` + +The generated chart was: + +![chart](../../../assets/images-backend/msm_drop_investigation_1.png) + +From the chart we concluded that the overall number of measurements have been decreasing since the release of a new version. +We also re-ran the plot by filtering on other `software_name` values and saw no other type of probe was affected. + +> **note** +> Due to a limitation in Altair, when grouping time by week use +> `toStartOfDay(toStartOfWeek(measurement_start_time)) AS t` + +Then we wanted to measure how many measurements are being collected during each `web_connectivity` test run. +This is to understand if probes are testing less measurements in each run. + +The following Python snippet uses nested SQL queries. The inner query groups measurements by time, `software_version` and `report_id`, +and counts how many measurements are related to each `report_id`. +The outer query "ignores" the `report_id` value and `quantile()` is used to extract the 50 percentile of `msm_cnt`. + +> **note** +> The use of double `%%` in `LIKE` is required to escape the `%` wildcard. The wildcard is used to match any amount of characters. + +``` python +x = click_query(""" + SELECT + t, + quantile(0.5)(msm_cnt) AS msm_cnt_p50, + software_version + FROM ( + SELECT + toStartOfDay(toStartOfWeek(measurement_start_time)) AS t, + software_version, + report_id, + COUNT(*) AS msm_cnt + FROM fastpath + WHERE measurement_start_time > today() - interval 3 month + AND test_name = 'web_connectivity' + AND measurement_start_time < today() + AND software_name = 'ooniprobe-android-unattended' + AND software_version LIKE '3.8%%' + GROUP BY t, software_version, report_id + ) GROUP BY t, software_version +""") +alt.Chart(x).mark_line().encode(x='t', y='msm_cnt_p50', color='software_version').properties(width=1000, height=200, title="Android unattended msmt count per report") +``` + +We also compared different version groups and different `software_name`. +The output shows that indeed the number of measurements for each run is significantly lower for the newly released versions. + +![chart](../../../assets/images-backend/msm_drop_investigation_4.png) + +To update the previous Python snippet to group measurements by a different field, change `software_version` into the new column name. +For example use `probe_cc` to show a chart with a breakdown by probe country name. You should change `software_version` once in each SELECT part, +then in the last two `GROUP BY`, and finally in the `color` line at the bottom. + +We did such change to confirm that all countries were impacted in the same way. (The output is not included here as not remarkable) + +Also, `mark_line` on the bottom line is used to create line charts. Switch it to `mark_area` to generate *stacked* area charts. +See the previous two charts as examples. + +We implemented a change to the API to improve logging the list of tests returned at check-in: +and reviewed monitored the logs using `sudo journalctl -f -u ooni-api`. + +The output showed that the API is very often returning 100 URLs to probes. + +We then ran a similar query to extract the test duration time by calculating +`MAX(measurement_start_time) - MIN(measurement_start_time) AS delta` for each `report_id` value: + +``` python +x = click_query(""" + SELECT t, quantile(0.5)(delta) AS deltaq, software_version + FROM ( + SELECT + toStartOfDay(toStartOfWeek(measurement_start_time)) AS t, + software_version, + report_id, + MAX(measurement_start_time) - MIN(measurement_start_time) AS delta + FROM fastpath + WHERE measurement_start_time > today() - interval 3 month + AND test_name = 'web_connectivity' + AND measurement_start_time < today() + AND software_name = 'ooniprobe-android-unattended' + AND software_version LIKE '3.8%%' + GROUP BY t, software_version, report_id + ) GROUP BY t, software_version +""") +alt.Chart(x).mark_line().encode(x='t', y='deltaq', color='software_version').properties(width=1000, height=200, title="Android unattended test run time") +``` + +![chart](../../../assets/images-backend/msm_drop_investigation_2.png) + +The chart showed that the tests are indeed running for a shorter amount of time. + +> **note** +> Percentiles can be more meaningful then averages. +> To calculate quantiles in ClickHouse use `quantile()()`. + +Example: + +``` sql +quantile(0.1)(delta) AS deltaq10 +``` + +Wondering if the slowdown was due to slower measurement execution or other issues, we also generated a table as follows. + +> **note** +> Showing color bars allows to visually inspect tables more quickly. Setting the axis value to `0`, `1` or `None` helps readability: +> `y.style.bar(axis=None)` + +Notice the `delta / msmcnt AS seconds_per_msm` calculation: + +``` python +y = click_query(""" + SELECT + quantile(0.1)(delta) AS deltaq10, + quantile(0.3)(delta) AS deltaq30, + quantile(0.5)(delta) AS deltaq50, + quantile(0.7)(delta) AS deltaq70, + quantile(0.9)(delta) AS deltaq90, + + quantile(0.5)(seconds_per_msm) AS seconds_per_msm_q50, + quantile(0.5)(msmcnt) AS msmcnt_q50, + + software_version, software_name + FROM ( + SELECT + software_version, software_name, + report_id, + MAX(measurement_start_time) - MIN(measurement_start_time) AS delta, + count(*) AS msmcnt, + delta / msmcnt AS seconds_per_msm + FROM fastpath + WHERE measurement_start_time > today() - interval 3 month + AND test_name = 'web_connectivity' + AND measurement_start_time < today() + AND software_name IN ['ooniprobe-android-unattended', 'ooniprobe-android'] + AND software_version LIKE '3.8%%' + GROUP BY software_version, report_id, software_name + ) GROUP BY software_version, software_name + ORDER by software_version, software_name ASC +""") +y.style.bar(axis=None) +``` + +![chart](../../../assets/images-backend/msm_drop_investigation_3.png) + +In the table we looked at the `seconds_per_msm_q50` column: the median time for running each test did not change significantly. + +To summarize: + * The backend appears to deliver the same amount of URLs to the Probes as usual. + * The time required to run each test is rougly the same. + * Both the number of measurements per run and the run time decreased in the new releases. + +## Github issues + +### Selecting test helper for rotation +See + + +### Document Tor targets +See + + +### Disable unnecessary ClickHouse system tables +See + + +### Feed fastpath from JSONL +See + + +### Implement Grafana dashboard and alarms backup +See diff --git a/docs/Infrastructure.md b/docs/Infrastructure.md new file mode 100644 index 0000000..8507c78 --- /dev/null +++ b/docs/Infrastructure.md @@ -0,0 +1,360 @@ +# Infrastructure + +Our infrastructure is primarily spread across the following providers: + +* Hetzner, for dedicated hosts +* DigitalOcean, for VPSs which require IPv6 support +* AWS, for most cloud based infrastrucutre hosting + +We manage the deployment and configuration of hosts through a combination of ansible and terraform. + +### Hosts + +This section provides a summary of the backend hosts described in the +rest of the document. + +A full list is available at + - +also see [Ansible](#ansible) 🔧 + +#### backend-fsn.ooni.org + +Public-facing production backend host, receiving the deployment of the +packages: + +- [ooni-api](legacybackend/operations/#ooni-api-package) 📦 + +- [fastpath](legacybackend/operations/#fastpath-package) 📦 + +- [analysis](legacybackend/operations/#analysis-package) 📦 + +- [detector](legacybackend/operations/#detector-package) 📦 + +#### backend-hel.ooni.org + +Standby / pre-production backend host. Runs the same software stack as +[backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥, plus the +[OONI bridges](#ooni-bridges) ⚙ + +#### ams-pg-test.ooni.org + +Testbed backend host. Runs the same software stack as +[backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥. Database tables are not backed up and +incoming measurements are not uploaded to S3. All data is considered +ephemeral. + +#### monitoring.ooni.org + +Runs the internal monitoring stack, including +[Jupyter Notebook](#tool:jupyter), [Prometheus](#prometheus) 🔧, +[Vector](#vector) 🔧 and +[ClickHouse instance for logs](#clickhouse-instance-for-logs) ⚙ + +### The Sysadmin repository + +This is a git repository living at +for internal use. It primarily contains: + +- Playbooks for [Ansible](#ansible) 🔧 + +- The [debops-ci tool](#debops-ci-tool) 🔧 + +- Scripts and tools including diagrams for + [DNS and Domains](#dns-and-domains) 💡 + +### Ansible + +Ansible is used to configure the OSes on the backend hosts and manage +the configuration of backend components. The playbooks are kept at + + +This manual supersedes + + +#### Installation and setup + +Install Ansible using a OS packages or a Python virtualenv. Ensure the +same major+minor version is used across the team. + +Secrets are stored in vaults using the `ansible/vault` script as a +wrapper for `ansible-vault`. Store encrypted variables with a `vault_` +prefix to allow using grep: +and link location of the variable using same name without prefix in +corresponding `vars.yml`. + +In order to access secrets stored inside of the vault, you will need a +copy of the vault password encrypted with your PGP key. This file should +be stored inside of `~/.ssh/ooni-sysadmin.vaultpw.gpg`. + +The file should be provided by other teammates and GPG-encrypted for your own GPG key. + +#### SSH Configuration + +You should configure your `~/.ssh/config` with the following: + +``` + IdentitiesOnly yes + ServerAliveInterval 120 + UserKnownHostsFile ~/.ssh/known_hosts ~/REPLACE_ME/sysadmin/ext/known_hosts + + host *.ooni.io + user YOUR_USERNAME + + host *.ooni.nu + user YOUR_USERNAME + + host *.ooni.org + user YOUR_USERNAME +``` + +Replace `~/REPLACE_ME/sysadmin/ext/known_hosts` to where you have cloned +the `ooni/sysadmin` repo. This will ensure you use the host key +fingeprints from this repo instead of just relying on TOFU. + +You should replace `YOUR_USERNAME` with your username from `adm_login`. + +On MacOS you may want to also add: + + host * + UseKeychain yes + +To use the Keychain to store passwords. + +### Ansible playbooks summary + +Usage: + + ./play deploy-.yml -l --diff -C + ./play deploy-.yml -l --diff + +> **warning** +> any minor error in configuration files or ansible's playbooks can be +> destructive for the backend infrastructure. Always test-run playbooks +> with `--diff` and `-C` at first and carefully verify configuration +> changes. After verification run the playbook without `-C` and verify +> again the applied changes. + +> **note** > [Etckeeper](#etckeeper) 🔧 can be useful to verify configuration +> changes from a different point of view. + +Some notable parts of the repository: + +A list of the backend hosts lives at + + +The backend deployment playbook lives at + + +Many playbooks depend on roles that configure the OS, named +`base-`, for example: + +for Debian Bookworm and + +for Debian Bullseye + +The nftables firewall is configured to read every `.nft` file under +`/etc/ooni/nftables/` and `/etc/ooni/nftables/`. This allows roles to +create small files to open a port each and keep the configuration as +close as possible to the ansible step that deploys a service. For +example: + + +> **note** +> Ansible announces its runs on [ooni-bots](##ooni-bots) 💡 unless running with `-C`. + +#### The root account + +Runbooks use ssh to log on the hosts using your own account and leveraging `sudo` to act as root. + +The only exception is when a new host is being deployed - in that case ansible will log in as root to create +individual accounts and lock out the root user. + +When running the entire runbook ansible might try to run it as root. +This can be avoided by selecting only the required tags using `-t `. + +Ideally the root user should be disabled after succesfully creating user accounts. + +#### Roles layout + +Ansible playbooks use multiple roles (see +[example](https://github.com/ooni/sysadmin/blob/master/ansible/deploy-backend.yml#L46)) +to deploy various components. + +Few roles use the `meta/main.yml` file to depend on other roles. See +[example](https://github.com/ooni/sysadmin/blob/master/ansible/roles/ooni-backend/meta/main.yml) + +> **note** +> The latter method should be used sparingly because ansible does not +> indicate where each task in a playbook is coming from. + +A diagram of the role dependencies for the deploy-backend.yml playbook: + +```mermaid + +flowchart LR + A(deploy-backend.yml) --> B(base-bullseye) + B -- meta --> G(adm) + A --> F(nftables) + A --> C(nginx-buster) + A --> D(dehydrated) + D -- meta --> C + E -- meta --> F + A --> E(ooni-backend) + style B fill:#eeffee + style C fill:#eeffee + style D fill:#eeffee + style E fill:#eeffee + style F fill:#eeffee + style G fill:#eeffee +``` + +A similar diagram for deploy-monitoring.yml: + +```mermaid + +flowchart LR + B -- meta --> G(adm) + M(deploy-monitoring.yml) --> B(base-bookworm) + M --> O(ooca-cert) + M --> F(nftables) + M --> D(dehydrated) -- meta --> N(nginx-buster) + M --> P(prometheus) + M --> X(blackbox-exporter) + M --> T(alertmanager) + style B fill:#eeffee + style D fill:#eeffee + style F fill:#eeffee + style G fill:#eeffee + style N fill:#eeffee + style O fill:#eeffee + style P fill:#eeffee + style T fill:#eeffee + style X fill:#eeffee +``` + +> **note** +> When deploying files or updating files already existing on the hosts it can be useful to add a note e.g. "Deployed by ansible, see ". +> This helps track down how files on the host were modified and why. + +### Etckeeper + +Etckeeper is deployed on backend +hosts and keeps the `/etc` directory under git version control. It +commits automatically on package deployment and on timed runs. It also +allows doing commits manually. + +To check for history of the /etc directory: + +```bash +sudo -i +cd /etc +git log --raw +``` + +And `git diff` for unmerged changes. + +Use `etckeeper commit ` to commit changes. + +:::tip +Etckeeper commits changes automatically when APT is used or on daily basis, whichever comes first. +::: + +### Team credential repository + +A private repository contains team +credentials, including username/password tuples, GPG keys and more. + +> **warning** +> The credential file is GPG-encrypted as `credentials.json.gpg`. Do not +> commit the cleartext `credentials.json` file. + +> **note** +> The credentials are stored in a JSON file to allow a flexible, +> hierarchical layout. This allow storing metadata like descriptions on +> account usage, dates of account creations, expiry, and credential +> rotation time. + +The tool checks JSON syntax and sorts keys automatically. + +#### Listing file contents + + git pull + make show + +#### Editing contents + + git pull + make edit + git commit credentials.json.gpg -m "" + git push + +#### Extracting a credential programmatically: + + git pull + ./extract 'grafana.username' + +> **note** +> this can be used to automate credential retrieval from other tools, e.g. +> [Ansible](#ansible) 🔧 + +#### Updating users allowed to decrypt the credentials file + +Edit `makefile` to add or remove recipients (see `--recipient`) + +Then run: + + git pull + make decrypt encrypt + git commit makefile credentials.json.gpg + git push + +### DNS diagrams + +#### A: + +See + + +The image is not included here due to space constraints. + +#### CNAME: + +![CNAME](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.CNAME.svg) + +#### MX: + +![MX](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.MX.svg) + +#### NS: + +![NS](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.NS.svg) + +#### TXT: + +![TXT](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.TXT.svg) + +#### HTTP Moved Permanently (HTTP code 301): + +![URL301](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.URL301.svg) + +#### HTTP Redirects: + +![URL](https://raw.githubusercontent.com/ooni/sysadmin/master/ext/dnsgraph.URL.svg) + +#### Updating DNS diagrams + +To update the diagrams use the sysadmin repository: + +Update the `./ext/dns.json` file: + + cd ansible + ./play ext-inventory.yml -t namecheap + cd .. + +Then run +to generate the charts: + + ./scripts/dnsgraph + +It will generate SVG files under the `./ext/` directory. Finally, commit +and push the dns.json and SVG files. diff --git a/docs/LegacyDocs.md b/docs/LegacyDocs.md new file mode 100644 index 0000000..785ae2f --- /dev/null +++ b/docs/LegacyDocs.md @@ -0,0 +1,182 @@ +# Legacy Docs + +**ATTENTION** this documentation speaks about topics that are still relevant, yet it may not be up to date with the currently defined best-practices or infrastructure status. + +### Creating new playbooks runbook + +**TODO** this needs to be rewritten to conform to the new policies + + +This runbook describe how to add new runbooks or modify existing runbooks to support new hosts. + +When adding a new host to an existing group, if no customization is required it is enough to modify `inventory` +and insert the hostname in the same locations as its peers. + +If the host requires small customization e.g. a different configuration file for the <>: + +1. add the hostname to `inventory` as described above +2. create "custom" blocks in `tasks/main.yml` to adapt the deployment steps to the new host using the `when:` syntax. + +For an example see: + +NOTE: Complex `when:` rules can lower the readability of `main.yml` + +When adding a new type of backend component that is different from anything already existing a new dedicated role can be created: + +1. add the hostname to `inventory` as described above +2. create a new playbook e.g. `ansible/deploy-newcomponent.yml` +3. copy files from an existing role into a new `ansible/roles/newcomponent` directory: + +- `ansible/roles/newcomponent/meta/main.yml` +- `ansible/roles/newcomponent/tasks/main.yml` +- `ansible/roles/newcomponent/templates/example_config_file` + +4. run `./play deploy-newcomponent.yml -l newhost.ooni.org --diff -C` and review the output +5. run `./play deploy-newcomponent.yml -l newhost.ooni.org --diff` and review the output + +Example: + +TIP: To ensure playbooks are robust and idemponent it can be beneficial to develop and test tasks incrementally by running the deployment commands often. + + +## Test helper rotation runbook +This runbook provides hints to troubleshoot the rotation of test +helpers. In this scenario test helpers are not being rotated as expected +and their TLS certificates might be at risk of expiring. + +Steps: + +1. Review [Test helpers](#comp:test_helpers), [Test helper rotation](#comp:test_helper_rotation) and [Test helpers notebook](#test-helpers-notebook) 📔 + +2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊. + Look at different timespans: + + a. The uptime of the test helpers should be staggered by a week + depending on [Test helper rotation](#test-helper-rotation) ⚙. + +3. A summary of the live and last rotated test helper can be obtained + with: + +```sql +SELECT rdn, dns_zone, name, region, draining_at FROM test_helper_instances ORDER BY name DESC LIMIT 8 +``` + +4. The rotation tool can be started manually. It will always pick the + oldest host for rotation. ⚠️ Due to the propagation time of changes + in the DNS rotating many test helpers too quickly can impact the + probes. + + a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 + + b. Check the last run using + `sudo systemctl status ooni-rotation.timer` + + c. Review the logs using `sudo journalctl -u ooni-rotation` + + d. Run `sudo systemctl restart ooni-rotation` and monitor the logs. + +5. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊 + during and after the rotation. + + +### Test helpers failure runbook +This runbook presents a scenario where a test helper is causing probes +to fail their tests sporadically. It describes how to identify the +affected host and mitigate the issue but can also be used to investigate +other issues affecting the test helpers. + +It has been chosen because such kind of incidents can impact the quality +of measurements and can be relatively difficult to troubleshoot. + +For investigating glitches in the +[test helper rotation](#test-helper-rotation) ⚙ see +[test helper rotation runbook](#test-helper-rotation-runbook) 📒. + +In this scenario either an alert has been sent to the +[#ooni-bots](#topic:oonibots) [Slack](#slack) 🔧 channel by +the [test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 or something +else caused the investigation. +See [Alerting](#alerting) 💡 for details. + +Steps: + +1. Review [Test helpers](#test-helpers) ⚙ + +2. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊. + Look at different timespans: + + a. The uptime of the test helpers should be staggered by a week + depending on [Test helper rotation](#test-helper-rotation) ⚙. + + b. The in-flight requests and requests per second should be + consistent across hosts, except for `0.th.ooni.org`. See + [Test helpers list](#test-helpers-list) 🐝 for details. + + c. Review CPU load, memory usage and run duration percentiles. + +3. Review [Test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 + +4. For more detailed investigation there is also a [test helper notebook](https://jupyter.ooni.org/notebooks/notebooks/2023%20%5Bfederico%5D%20test%20helper%20metadata%20in%20fastpath.ipynb) + +5. Log on the hosts using + `ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -Snone root@0.th.ooni.org` + +6. Run `journalctl --since '1 hour ago'` or review logs using the query + below. + +7. Run `top`, `strace`, `tcpdump` as needed. + +8. The rotation tool can be started at any time to rotate away failing + test helpers. The rotation script will always pick the oldest host + for rotation. ⚠️ Due to the propagation time of changes in the DNS + rotating many test helpers too quickly can impact the probes. + + a. Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 + + b. Check the last run using + `sudo systemctl status ooni-rotation.timer` + + c. Review the logs using `sudo journalctl -u ooni-rotation` + + d. Run `sudo systemctl restart ooni-rotation` and monitor the logs. + +9. Review the charts on [Test helpers dashboard](#test-helpers-dashboard) 📊 + during and after the rotation. + +10. Summarize traffic hitting a test helper using the following commands: + + Top 10 miniooni probe IP addresses (Warning: this is sensitive data) + + `tail -n 100000 /var/log/nginx/access.log | grep miniooni | cut -d' ' -f1|sort|uniq -c|sort -nr|head` + + Similar, with anonimized IP addresses: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d'.' -f1-3 | head -n 10000 |sort|uniq -c|sort -nr|head` + + Number of requests from miniooni probe in 10-minutes buckets: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-17 | uniq -c` + + Number of requests from miniooni probe in 1-minute buckets: + + `grep POST /var/log/nginx/access.log | grep miniooni | cut -d' ' -f4 | cut -c1-18 | uniq -c` + + Number of requests grouped by hour, cache HIT/MISS/etc, software name and version + + `head -n 100000 /var/log/nginx/access.log | awk '{print $4, $6, $13}' | cut -c1-15,22- | sort | uniq -c | sort -n` + +To extract data from the centralized log database +on [monitoring.ooni.org](#monitoring.ooni.org) 🖥 you can use: + +``` sql +SELECT message FROM logs +WHERE SYSLOG_IDENTIFIER = 'oohelperd' +ORDER BY __REALTIME_TIMESTAMP DESC +LIMIT 10 +``` + +> **note** +> The table is indexed by `__REALTIME_TIMESTAMP`. Limiting the range by time can significantly increase query performance. + + +See [Selecting test helper for rotation](#selecting-test-helper-for-rotation) 🐞 diff --git a/docs/MonitoringAlerts.md b/docs/MonitoringAlerts.md new file mode 100644 index 0000000..c4fb3b0 --- /dev/null +++ b/docs/MonitoringAlerts.md @@ -0,0 +1,612 @@ +# Monitoring and Alerts + +## Application metrics + All components of the backend are designed to output application + metrics. + + Metrics are prefixed with the name of each application. The metrics are + used in [Grafana](#grafana) 🔧 for charts, monitoring and alarming. + + They use the [StatsD](#statsd) 💡 protocol. + + Application metrics data flow: + + ![Diagram](https://kroki.io/blockdiag/svg/eNq9kc1qAyEUhffzFDLZNnGf0EBX7SoEkl0p4arXUaJe8QcKpe9eZ9Imkz5AXHo-OcdzhCN5VhYG9tUxhRqqK6dsICJ7ZolqUKgEfW469hKjsxKKpcDeJTlKjegXWmM7_UcjdlgUFJiro6Z1_8RMQj3emFJiXnM-2GKqWEnynChYLkCeMailIlk9hjL5cOFIcA82_OmnO33l1SJcTKcA-0Qei8GaH5shXn2nGK8JNIQH9zBcTKcA86mW29suDgS60T23d1ndjda4eX1X9O143B_-t9vg309uuu8fUvvJ0Q==) + + + + Ellipses represent data; rectangles represent processes. Purple + components belong to the backend. Click on the image and then click on + each shape to see related documentation. + + [Prometheus](#tool:prometheus) and [Grafana](#grafana) 🔧 provide + historical charts for more than 90 days and are useful to investigate + long-term trends. + + [Netdata](#netdata) 🔧 provides a web UI with real-time metrics. See + the dedicated subchapter for details. + + +### StatsD + All backend components send StatsD metrics over UDP using localhost as destination. + + This guarantees that applications never block on metric generation in + case the receiver slows down. The StatsD messages are received by + [Netdata](#netdata) 🔧. It automatically tracks any new metric, + generates averages and summaries as needed and exposes it to + [Prometheus](#prometheus) 🔧 for scraping. + In the codebase the statsd library is often used as: + + ```python + from .metrics import setup_metrics + setup_metrics(name="") + metrics.gauge("", ) + ``` + + Because of this, a quick way to identify where metrics are being generated + in the backend codebase is to search e.g.: + + * + * + + Where possible, timers have the same name as the function being timed e.g. + + + See [Conventions](#conventions) 💡 for patterns around component naming. + + +#### Metrics list + This subsection provides a list of the most important application metrics as they + are shown in Grafana. The names are autogenerated by Netdata based on the + metric name used in StatsD. + + For example a `@metrics.timer("generate_test_list")` Python decorator is used at: + . + Such timer will be processed by Netdata and appear in Grafana as: + ``` + netdata_statsd_timer_ooni_api_generate_test_list_milliseconds_average + ``` + + The metrics always start with `netdata_statsd` and end with: + + * `_milliseconds_average` + * `_events_persec_average` + * `_value_average` + + Also see + + TIP: StatsD collectors (like Netdata or others) preprocess datapoints by calculating average/min/max values etc. + + Run this to locate where in the backend codbase application metrics + are being generated: + + ```bash + find ~ -name '*.py' -exec grep 'metrics\.' -H "{}" \; + ``` + + Metrics for [ASN metadata updater](#asn-metadata-updater) ⚙. + See the [ASN metadata updater dashboard](#asn-metadata-updater-dashboard) 📊: + +``` +netdata_statsd_asnmeta_updater_asnmeta_tmp_len_gauge_value_average +netdata_statsd_asnmeta_updater_asnmeta_update_progress_gauge_value_average +netdata_statsd_asnmeta_updater_fetch_data_timer_milliseconds_average +netdata_statsd_gauge_asnmeta_updater_asnmeta_tmp_len_value_average +netdata_statsd_gauge_asnmeta_updater_asnmeta_update_progress_value_average +netdata_statsd_timer_asnmeta_updater_fetch_data_milliseconds_average +``` + + +Metrics for [CitizenLab test list updater](#citizenlab-test-list-updater) ⚙ + +``` +netdata_statsd_citizenlab_test_lists_updater_citizenlab_test_list_len_gauge_value_average +netdata_statsd_citizenlab_test_lists_updater_fetch_citizen_lab_lists_timer_milliseconds_average +netdata_statsd_citizenlab_test_lists_updater_update_citizenlab_table_timer_milliseconds_average +netdata_statsd_gauge_citizenlab_test_lists_updater_citizenlab_test_list_len_value_average +netdata_statsd_gauge_citizenlab_test_lists_updater_rowcount_value_average +netdata_statsd_timer_citizenlab_test_lists_updater_fetch_citizen_lab_lists_milliseconds_average +netdata_statsd_timer_citizenlab_test_lists_updater_rebuild_citizenlab_table_from_citizen_lab_lists_milliseconds_average +netdata_statsd_timer_citizenlab_test_lists_updater_update_citizenlab_table_milliseconds_average +``` + +Metrics for the [Database backup tool](#database-backup-tool) ⚙. +See the [Database backup dashboard](#database-backup-dashboard) 📊 on Grafana: + +``` +netdata_statsd_db_backup_run_export_timer_milliseconds_average +netdata_statsd_db_backup_status_gauge_value_average +netdata_statsd_db_backup_table_fastpath_backup_time_ms_gauge_value_average +netdata_statsd_db_backup_table_jsonl_backup_time_ms_gauge_value_average +netdata_statsd_db_backup_uploaded_bytes_tot_gauge_value_average +netdata_statsd_db_backup_upload_to_s3_timer_milliseconds_average +netdata_statsd_gauge_db_backup_status_value_average +netdata_statsd_gauge_db_backup_table_fastpath_backup_time_ms_value_average +netdata_statsd_gauge_db_backup_table_jsonl_backup_time_ms_value_average +netdata_statsd_gauge_db_backup_uploaded_bytes_tot_value_average +netdata_statsd_timer_db_backup_run_backup_milliseconds_average +netdata_statsd_timer_db_backup_run_export_milliseconds_average +netdata_statsd_timer_db_backup_upload_to_s3_milliseconds_average +netdata_statsd_gauge_db_backup_status_value_average +netdata_statsd_gauge_db_backup_table_citizenlab_byte_count_value_average +netdata_statsd_gauge_db_backup_table_fastpath_backup_time_ms_value_average +netdata_statsd_gauge_db_backup_table_fastpath_byte_count_value_average +netdata_statsd_gauge_db_backup_table_jsonl_backup_time_ms_value_average +netdata_statsd_gauge_db_backup_table_jsonl_byte_count_value_average +netdata_statsd_gauge_db_backup_uploaded_bytes_tot_value_average +netdata_statsd_timer_db_backup_backup_table_citizenlab_milliseconds_average +netdata_statsd_timer_db_backup_backup_table_fastpath_milliseconds_average +netdata_statsd_timer_db_backup_backup_table_jsonl_milliseconds_average +``` + + +Metrics for the [social media blocking event detector](#social-media-blocking-event-detector) ⚙: + +``` +netdata_statsd_gauge_detector_blocking_events_tblsize_value_average +netdata_statsd_gauge_detector_blocking_status_tblsize_value_average +netdata_statsd_timer_detector_run_detection_milliseconds_average +``` + + +Metrics for the [Fastpath](#fastpath) ⚙. Used in various dashboards, +primarily [API and fastpath](#api-and-fastpath) 📊 dashboard. + +``` +netdata_statsd_timer_fastpath_db_clickhouse_upsert_summary_milliseconds_average +netdata_statsd_timer_fastpath_db_fetch_fingerprints_milliseconds_average +netdata_statsd_timer_fastpath_full_run_milliseconds_average +netdata_statsd_gauge_fastpath_recent_measurement_count_value_average +``` + + +Metrics [Fingerprint updater](#fingerprint-updater) ⚙ +See the [Fingerprint updater dashboard](#fingerprint-updater-dashboard) 📊 on Grafana. + +``` +netdata_statsd_timer_fingerprints_updater_fetch_csv_milliseconds_average +netdata_statsd_gauge_fingerprints_updater_fingerprints_dns_tmp_len_value_average +netdata_statsd_gauge_fingerprints_updater_fingerprints_http_tmp_len_value_average +netdata_statsd_gauge_fingerprints_updater_fingerprints_update_progress_value_average +``` + +Metrics from Nginx caching of the aggregation API. +See [Aggregation cache monitoring](#aggregation-cache-monitoring) 🐍 + +``` +netdata_statsd_gauge_nginx_aggregation_cache_EXPIRED_value_average +netdata_statsd_gauge_nginx_aggregation_cache_HIT_value_average +netdata_statsd_gauge_nginx_aggregation_cache_MISS_value_average +netdata_statsd_gauge_nginx_aggregation_cache_UPDATING_value_average +``` + +Metrics for the [API](#api) ⚙. + +``` +netdata_statsd_counter_ooni_api_geoip_asn_differs_events_persec_average +netdata_statsd_counter_ooni_api_geoip_cc_differs_events_persec_average +netdata_statsd_counter_ooni_api_geoip_ipaddr_found_events_persec_average +netdata_statsd_counter_ooni_api_geoip_ipaddr_not_found_events_persec_average +netdata_statsd_counter_ooni_api_gunicorn_request_status_ +netdata_statsd_counter_ooni_api_probe_cc_asn_match_events_persec_average +netdata_statsd_counter_ooni_api_probe_cc_asn_nomatch_events_persec_average +netdata_statsd_counter_ooni_api_probe_legacy_login_successful_events_persec_average +netdata_statsd_counter_ooni_api_probe_login_successful_events_persec_average +netdata_statsd_counter_ooni_api_receive_measurement_count_events_persec_average +netdata_statsd_counter_ooni_api_receive_measurement_discard_asn_ +netdata_statsd_counter_ooni_api_receive_measurement_discard_cc_zz_events_persec_average +netdata_statsd_counter_ooni_api_uploader_msmt_count_events_persec_average +netdata_statsd_counter_ooni_api_uploader_postcan_count_events_persec_average +netdata_statsd_gauge_ooni_api_check_in_test_list_count_value_average +netdata_statsd_gauge_ooni_api_spool_post_count_value_average +netdata_statsd_gauge_ooni_api_test_list_urls_count_value_average +netdata_statsd_timer_ooni_api_apicall___api__v +netdata_statsd_timer_ooni_api_citizenlab_lock_time_milliseconds_average +netdata_statsd_timer_ooni_api_citizenlab_repo_init_milliseconds_average +netdata_statsd_timer_ooni_api_citizenlab_repo_pull_milliseconds_average +netdata_statsd_timer_ooni_api_fetch_citizenlab_data_milliseconds_average +netdata_statsd_timer_ooni_api_fetch_reactive_url_list_milliseconds_average +netdata_statsd_timer_ooni_api_generate_test_list_milliseconds_average +netdata_statsd_timer_ooni_api_get_aggregated_milliseconds_average +netdata_statsd_timer_ooni_api_get_measurement_meta_clickhouse_milliseconds_average +netdata_statsd_timer_ooni_api_get_measurement_meta_milliseconds_average +netdata_statsd_timer_ooni_api_get_raw_measurement_milliseconds_average +netdata_statsd_timer_ooni_api_get_torsf_stats_milliseconds_average +netdata_statsd_timer_ooni_api_gunicorn_request_duration_milliseconds_average +netdata_statsd_timer_ooni_api_open_report_milliseconds_average +netdata_statsd_timer_ooni_api_open_report_milliseconds_averageopen_report +netdata_statsd_timer_ooni_api_receive_measurement_milliseconds_average +netdata_statsd_timer_ooni_api_uploader_fill_jsonl_milliseconds_average +netdata_statsd_timer_ooni_api_uploader_fill_postcan_milliseconds_average +netdata_statsd_timer_ooni_api_uploader_total_run_time_milliseconds_average +netdata_statsd_timer_ooni_api_uploader_update_db_table_milliseconds_average +netdata_statsd_timer_ooni_api_uploader_upload_measurement_milliseconds_average +``` + +Metrics for the [GeoIP downloader](#geoip-downloader) ⚙. + +``` +netdata_statsd_gauge_ooni_download_geoip_geoip_asn_epoch_value_average +netdata_statsd_gauge_ooni_download_geoip_geoip_asn_node_cnt_value_average +netdata_statsd_gauge_ooni_download_geoip_geoip_cc_epoch_value_average +netdata_statsd_gauge_ooni_download_geoip_geoip_cc_node_cnt_value_average +netdata_statsd_timer_ooni_download_geoip_download_geoip_milliseconds_average +``` + +Metrics for the [test helper rotation](#test-helper-rotation) ⚙. + +``` +netdata_statsd_timer_rotation_create_le_do_ssl_cert_milliseconds_average +netdata_statsd_timer_rotation_deploy_ssl_cert_milliseconds_average +netdata_statsd_timer_rotation_destroy_drained_droplets_milliseconds_average +netdata_statsd_timer_rotation_end_to_end_test_milliseconds_average +netdata_statsd_timer_rotation_run_time_milliseconds_average +netdata_statsd_timer_rotation_scp_file_milliseconds_average +netdata_statsd_timer_rotation_setup_nginx_milliseconds_average +netdata_statsd_timer_rotation_setup_vector_milliseconds_average +netdata_statsd_timer_rotation_spawn_new_droplet_milliseconds_average +netdata_statsd_timer_rotation_ssh_reload_nginx_milliseconds_average +netdata_statsd_timer_rotation_ssh_restart_netdata_milliseconds_average +netdata_statsd_timer_rotation_ssh_restart_nginx_milliseconds_average +netdata_statsd_timer_rotation_ssh_restart_vector_milliseconds_average +netdata_statsd_timer_rotation_ssh_wait_droplet_warmup_milliseconds_average +netdata_statsd_timer_rotation_update_dns_records_milliseconds_average +``` + + +### Prometheus +Prometheus is a popular monitoring system and +runs on [monitoring.ooni.org](#monitoring.ooni.org) 🖥 + +It is deployed and configured by [Ansible](#ansible) 🔧 using the +following playbook: + + +Most of the metrics are collected by scraping Prometheus endpoints, +Netdata, and using node exporter. The web UI is accessible at + + +#### Blackbox exporter +Blackbox exporter is part of Prometheus. It's a daemon that performs HTTP +probing against other hosts without relying on local agents (hence the name Blackbox) +and feeds the generated datapoints into Promethous. + +See + +It is deployed by +[Ansible](#tool:ansible) on the [monitoring.ooni.org](#monitoring.ooni.org) 🖥 + +See +[Updating Blackbox Exporter runbook](#updating-blackbox-exporter-runbook) 📒 + + +### Grafana dashboards +There is a number of dashboards on [Grafana](#grafana) 🔧 at + + +[Grafana](#grafana) 🔧 is deployed on the +[monitoring.ooni.org](#monitoring.ooni.org) 🖥 host. See +[Monitoring deployment runbook](#monitoring-deployment-runbook) 📒 for deployment. + +The dashboards are used for: + + * Routinely reviewing the general health of the backend infrastructure + + * Predicting long-term scaling requirements, i.e. + + * increasing disk space for the database + + * increasing CPU and memory requirements + + * Investigating alerts and troubleshooting incidents + + +#### Alerting +Alerts from [Grafana](#tool:grafana) and [Prometheus](#prometheus) 🔧 +are sent to the [#ooni-bots](#topic:oonibots) [Slack](#slack) 🔧 +channel by a bot. + +[Slack](#slack) 🔧 can be configured to provide desktop notification +from browsers and audible notifications on smartphones. + +Alert flow: + +![Diagram](https://kroki.io/blockdiag/svg/eNp1jUEKwjAQRfc9xTBd9wSioBtxV3ApIpNmYktjJiQpCuLdTbvQIDirP7zH_8pKN-qBrvCsQLOhyaZL7MkzrCHI5DRrJY9VBW2QG6eepwinTqyELGDN-YzBcxb2gQw5-kOxFnFDoyRFLBVjZmlRioVm86nLEY-WuhG27QGXt6z6YvIef4dmugtyjxwye70BaPFK1w==) + + + +The diagram does not explicitly include alertmanager. It is part of Prometheus and receives alerts and routes them to Slack. + +More detailed diagram: + +```mermaid +flowchart LR + P(Prometheus) -- datapoints --> G(Grafana) + G --> A(Alertmanager) + A --> S(Slack API) --> O(#ooni-bots) + P --> A + O --> R(Browser / apps) + J(Jupyter notebook) --> A + classDef box fill:#eeffee,stroke:#777,stroke-width:2px; + class P,G,A,S,O,R,J box; +``` + +In the diagram Prometheus receives, stores and serves datapoints and has some alert rules to trigger alerts. +Grafana acts as a UI for Prometheus and also triggers alerts based on alert rules configured in Grafana itself. + +Alertmanager is pretty simple - receives alerts and sends notification to Slack. + +The alert rules are listed at +The list also shows which alerts are firing at the moment, if any. There +is also a handful of alerts configured in [Prometheus](#prometheus) 🔧 +using [Ansible](#ansible) 🔧. + +The silences list shows if any alert has been temporarily silenced: + + +See [Grafana editing](#grafana-editing) 📒 and +[Managing Grafana alert rules](#managing-grafana-alert-rules) 📒 for details. + +There are also many dashboards and alerts configured in +[Jupyter Notebook](#jupyter-notebook) 🔧. These are meant for metrics that require more +complex algorithms, predictions and SQL queries that cannot be +implemented using [Grafana](#grafana) 🔧 e.g. when using machine learning or Pandas. +See [Ooniutils microlibrary](#ooniutils-microlibrary) 💡 for details. + +On many dashboards you can set the averaging timespan and the target +hostname using fields on the top left. + +Here is an overview of the most useful dashboards: + + +#### API and fastpath + + +This is the most important dashboard showing metrics of the +[API](#comp:api) and the [Fastpath](#fastpath) ⚙. + + +#### Test-list repository in the API + + +This dashboard shows timings around the git repository checked out by the +[API](#api) ⚙ that contains the test lists. + + +#### Measurement uploader dashboard + + +This dashboard shows metrics, timing and amounts of data transferred by the +[Measurement uploader](#measurement-uploader) ⚙ + + +#### Fingerprint updater dashboard + + +This dashboard shows metrics and timing from the +[Fingerprint updater](#fingerprint-updater) ⚙ + + +#### ClickHouse dashboard + + +This dashboards show ClickHouse-specific performance metrics. +It can be used for optimizations. + +For investigating slow queries also see the [ClickHouse queries notebook](#clickhouse-queries-notebook) 📔. + + +#### HaProxy dashboard + + +Basic metrics from [HaProxy](#haproxy) ⚙ load balancers. Used for +[OONI bridges](#ooni-bridges) ⚙. + + +#### TLS certificate dashboard + + +Certificate expiration times. There are alerts configured in +[Grafana](#grafana) 🔧 to alert on expiring certificates. + + +#### Test helpers dashboard + + +Status, uptime and load metrics from the +[Test helpers](#test-helpers) ⚙. + + +#### Database backup dashboard + + +Metrics, timing and data transferred by +[Database backup tool](#database-backup-tool) ⚙ + +By looking at the last 24 hours of run you should be able to see the backup +being run + + +The "Status" chart shows the running status. +"Uploaded bytes in total" and "Backup time" should be self explanatory. + +TIP: If the backup time or size grows too much it could be worth alerting and considering implementing incremental backups. + + +#### Event detector dashboard + + +Basic metrics from the +[social media blocking event detector](#social-media-blocking-event-detector) ⚙ + + +#### GeoIP MMDB database dashboard + + +Age and size of the GeoIP MMDB database. Also, a chart showing +discrepancies between the lookup performed by the probes VS the one in +the API, used to gauge the benefits of using a centralized solution. + +Also see [Geolocation script](#geolocation-script) 🐍 + +See [GeoIP downloader](#geoip-downloader) ⚙ + + +#### Host clock offset dashboard + + +Measures NTP clock sync and alarms on big offsets + + +#### Netdata-specific dashboard + + +Shows all the metrics captured by [Netdata](#netdata) 🔧 - useful for +in-depth performance investigation. + + +#### ASN metadata updater dashboard + + +Progress, runtime and table size of the [ASN metadata updater](#asn-metadata-updater) ⚙ + +See [Metrics list](#metrics-list) 💡 + + +### Netdata +Netdata is a monitoring agent that runs +locally on the backend servers. It exports host and +[Application metrics](#topic:appmetrics) to [Prometheus](#prometheus) 🔧. + +It also provides a web UI that can be accessed on port 19999. It can be +useful during development, performance optimization and debugging as it +provides metrics with higher time granularity (1 second) and almost no +delay. + +Netdata is not exposed on the Internet for security reasons and can be +accessed only when nededed by setting up port forwarding using SSH. For +example: + +```bash +ssh ams-pg-test.ooni.org -L 19998:127.0.0.1:19999 +``` + +Netdata can also be run on a development desktop and be accessed locally +in order to explore application metrics without having to deploy +[Prometheus](#tool:prometheus) and [Grafana](#grafana) 🔧. + +See [Netdata-specific dashboard](#netdata-specific-dashboard) 📊 of an example of native +Netdata metrics. + + +## Log management +All components of the backend are designed to output logs to Systemd's +journald. They usually log using the component name as Systemd unit +name. + +Sometimes you might have to use `--identifier ` instead for +scripts that are not run as Systemd units. + +Journald automatically indexes logs by time, unit name and other items. +This allows to quickly filter logs during troubleshooting, for example: + +```bash +sudo journalctl -u ooni-api --since '10 m ago' +``` + +Or follow live logs using e.g.: + +```bash +sudo journalctl -u nginx -f +``` + +Sometimes it is useful to show milliseconds in the timestamps: + +```bash +sudo journalctl -f -u ooni-api -o short-precise +``` + +The logger used in Python components also sets additional fields, +notably CODE_FUNC and CODE_LINE + +Available fields can be listed using: + +```bash +sudo journalctl -f -u ooni-api -N | sort +``` + +It is possible to filter by those fields. It comes very handy for +debugging e.g.: + +```bash +sudo journalctl -f -u ooni-api CODE_FUNC=open_report +``` + +Every host running backend services also sends host to +monitoring.ooni.org using [Vector](#vector) 🔧. + +![Diagram](https://kroki.io/blockdiag/svg/eNrFks9qwzAMxu95CpNel_gYWOlgDEqfYJdRiv_IiYltBccphdF3n5yyNellt01H6ZO_nyxJh6rXVrTss2AajJhcOo2dGIDtWMQpaNASL9uCvQ6Ds0oki4FVL-wdVMJYjUCKuEhEUGDPt9QbNfQHnEZ46P9Q6DCSQ7kxBijKIynWTy40WWFM-cS6CGaXU11Kw_jMeWtTN8laoeeIwXIpVE_tlUY1eQhptuPSoeRe2PBdP63qtdeb8-y9xPgZ5N9A7t_3xwwqG3fZOHMUrKVDGPKBUCzWuF1vjIivD-LfboLCCQkuT-EJmcQ2tHWmrzG25U1yn71p9vumKWen6xdypu8x) + + +There is a dedicated ClickHouse instance on monitoring.ooni.org used to +collect logs. See the [ClickHouse instance for logs](#clickhouse-instance-for-logs) ⚙. +This is done to avoid adding unnecessary load to the production database +on FSN that contains measurements and also keep a copy of FSN's logs on +a different host. + +The receiving [Vector](#vector) 🔧 instance and ClickHouse are +deployed and configured by [Ansible](#ansible) 🔧 using the following +playbook: + + +See [Logs from FSN notebook](#logs-from-fsn-notebook) 📔 and +[Logs investigation notebook](#logs-investigation-notebook) 📔 + + +### Slack +[Slack](https://slack.com/) is used for team messaging and automated +alerts at the following instance: + + +#### #ooni-bots +`#ooni-bots` is a [Slack](#slack) 🔧 channel used for automated +alerts: diff --git a/docs/Runbooks.md b/docs/Runbooks.md new file mode 100644 index 0000000..550d697 --- /dev/null +++ b/docs/Runbooks.md @@ -0,0 +1,1155 @@ +# Runbooks + +Below you will find runbooks for common tasks and operations to manage our infra. + +## Monitoring deployment runbook + +The monitoring stack is deployed and configured by +[Ansible](#tool:ansible) on the [monitoring.ooni.org](#monitoring.ooni.org) 🖥 +host using the following playbook: + + +It includes: + +- [Grafana](#grafana) 🔧 at + +- [Jupyter Notebook](#jupyter-notebook) 🔧 at + +- [Vector](#tool:vector) (see [Log management](#log-management) 💡) + +- local [Netdata](#tool:netdata), [Blackbox exporter](#blackbox-exporter) 🔧, etc + +- [Prometheus](#prometheus) 🔧 at + +It also configures the FQDNs: + +- loghost.ooni.org + +- monitoring.ooni.org + +- netdata.ooni.org + +This also includes the credentials to access the Web UIs. They are +deployed as `/etc/nginx/monitoring.htpasswd` from +`ansible/roles/monitoring/files/htpasswd` + +**Warning** the following steps are dangerously broken. Applying the changes +will either not work or worse break production. + +If you must do something of this sort, you will unfortunately have to resort of +specifying the particular substeps you want to run using the `-t` tag filter +(eg. `-t prometheus-conf` to update the prometheus configuration. + +Steps: + +1. Review [Ansible playbooks summary](#ansible-playbooks-summary) 📒, + [Deploying a new host](#run:newhost) [Grafana dashboards](#grafana-dashboards) 💡. + +2. Run `./play deploy-monitoring.yml -l monitoring.ooni.org --diff -C` + and review the output + +3. Run `./play deploy-monitoring.yml -l monitoring.ooni.org --diff` and + review the output + +## Updating Blackbox Exporter runbook + +This runbook describes updating [Blackbox exporter](#blackbox-exporter) 🔧. + +The `blackbox_exporter` role in ansible is pulled in by the `deploy-monitoring.yml` +runbook. + +The configuration file is at `roles/blackbox_exporter/templates/blackbox.yml.j2` +together with `host_vars/monitoring.ooni.org/vars.yml`. + +To add a simple HTTP[S] check, for example, you can copy the "ooni website" block. + +Edit it and run the deployment of the monitoring stack as described in the previous subchapter. + +## Deploying a new host + +To deploy a new host: + +1. Choose a FQDN like \$name.ooni.org based on the + [DNS naming policy](#dns-naming-policy) 💡 + +2. Deploy the physical host or VM using Debian Stable + +3. Create `A` and `AAAA` records for the FQDN in the Namecheap web UI + +4. Follow [Updating DNS diagrams](#updating-dns-diagrams) 📒 + +5. Review the `inventory` file and git-commit it + +6. Deploy the required stack. Run ansible it test mode first. For + example this would deploy a backend host: + + ./play deploy-backend.yml --diff -l .ooni.org -C + ./play deploy-backend.yml --diff -l .ooni.org + +7. Update [Prometheus](#prometheus) 🔧 by following + [Monitoring deployment runbook](#monitoring-deployment-runbook) 📒 + +8. git-push the commits + +Also see [Monitoring deployment runbook](#monitoring-deployment-runbook) 📒 for an +example of deployment. + +## Deleting a host + +1. Remove it from `inventory` + +2. Update the monitoring deployment using: + +``` +./play deploy-monitoring.yml -t prometheus-conf -l monitoring.ooni.org --diff +``` + +## Weekly measurements review runbook + +On a daily or weekly basis the following dashboards and Jupyter notebooks can be reviewed to detect unexpected patterns in measurements focusing on measurement drops, slowdowns or any potential issue affecting the backend infrastructure. + +When browsing the dashboards expand the time range to one year in order to spot long term trends. +Also zoom in to the last month to spot small glitches that could otherwise go unnoticed. + +Review the [API and fastpath](#api-and-fastpath) 📊 dashboard for the production backend host[s] for measurement flow, CPU and memory load, +timings of various API calls, disk usage. + +Review the [Incoming measurements notebook](#incoming-measurements-notebook) 📔 for unexpected trends. + +Quickly review the following dashboards for unexpected changes: + + * [Long term measurements prediction notebook](#long-term-measurements-prediction-notebook) 📔 + * [Test helpers dashboard](#test-helpers-dashboard) 📊 + * [Test helper failure rate notebook](#test-helper-failure-rate-notebook) 📔 + * [Database backup dashboard](#database-backup-dashboard) 📊 + * [GeoIP MMDB database dashboard](#geoip-mmdb-database-dashboard) 📊 + * [GeoIP dashboard](#geoip-mmdb-database-dashboard) 📊 + * [Fingerprint updater dashboard](#fingerprint-updater-dashboard) 📊 + * [ASN metadata updater dashboard](#asn-metadata-updater-dashboard) 📊 + +Also check for glitches like notebooks not being run etc. + + +## Grafana backup runbook +This runbook describes how to back up dashboards and alarms in Grafana. +It does not include backing up datapoints stored in +[Prometheus](#prometheus) 🔧. + +The Grafana SQLite database can be dumped by running: + +```bash +sqlite3 -line /var/lib/grafana/grafana.db '.dump' > grafana_dump.sql +``` + +Future implementation is tracked in: +[Implement Grafana dashboard and alarms backup](#implement-grafana-dashboard-and-alarms-backup) 🐞 + + +## Grafana editing +This runbook describes adding new dashboards, panels and alerts in +[Grafana](#grafana) 🔧 + +To add a new dashboard use this + + +To add a new panel to an existing dashboard load the dashboard and then +click the \"Add\" button on the top. + +Many dashboards use variables. For example, on + +the variables `$host` and `$avgspan` are set on the top left and used in +metrics like: + + avg_over_time(netdata_disk_backlog_milliseconds_average{instance="$host:19999"}[$avgspan]) + + +### Managing Grafana alert rules +Alert rules can be listed at + +> **note** +> The list also shows which alerts are currently alarming, if any. + +Click the arrow on the left to expand each alerting rule. + +The list shows: + +![editing_alerts](../../../assets/images-backend/grafana_alerts_editing.png) + +> **note** +> When creating alerts it can be useful to add full URLs linking to +> dashboards, runbooks etc. + +To stop notifications create a \"silence\" either: + +1. by further expanding an alert rule (see below) and clicking the + \"Silence\" button + +2. by inputting it in + +Screenshot: + +![adding_silence](../../../assets/images-backend/grafana_alerts_silence.png) + +Additionally, the \"Show state history\" button is useful especially +with flapping alerts. + + +### Adding new fingerprints +This is performed on + +Updates are fetched automatically by +[Fingerprint updater](#fingerprint-updater) ⚙ + +Also see [Fingerprint updater dashboard](#fingerprint-updater-dashboard) 📊. + + +### Backend code changes +This runbook describes making changes to backend components and +deploying them. + +Summary of the steps: + +1. Check out the backend repository. + +2. Create a dedicated branch. + +3. Update `debian/changelog` in the component you want to monify. See + [Package versioning](#package-versioning) 💡 for details. + +4. Run unit/functional/integ tests as needed. + +5. Create a pull request. + +6. Ensure the CI workflows are successful. + +7. Deploy the package on the testbed [ams-pg-test.ooni.org](#ams-pg-test.ooni.org) 🖥 + and verify the change works as intended. + +8. Add a comment the PR with the deployed version and stage. + +9. Wait for the PR to be approved. + +10. Deploy the package to production on + [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥. Ensure it is the same version + that has been used on the testbed. See [API runbook](#api-runbook) 📒 for + deployment steps. + +11. Add a comment the PR with the deployed version and stage, then merge + the PR. + +When introducing new metrics: + +1. Create [Grafana](#grafana) 🔧 dashboards, alerts and + [Jupyter Notebook](#jupyter-notebook) 🔧 and link them in the PR. + +2. Collect and analize metrics and logs from the testbed stages before + deploying to production. + +3. Test alarming by simulating incidents. +### Backend component deployment +This runbook provides general steps to deploy backend components on +production hosts. + +Review the package changelog and the related pull request. + +The amount of testing and monitoring required depends on: + +1. the impact of possible bugs in terms of number of users affected and + consequences + +2. the level of risk involved in rolling back the change, if needed + +3. the complexity of the change and the risk of unforeseen impact + +Monitor the [API and fastpath](#api-and-fastpath) 📊 and dedicated . Review past +weeks for any anomaly before starting a deployment. + +Ensure that either the database schema is consistent with the new +deployment by creating tables and columns manually, or that the new +codebase is automatically updating the database. + +Quickly check past logs. + +Follow logs with: + +``` bash +sudo journalctl -f --no-hostname +``` + +While monitoring the logs, deploy the package using the +[The deployer tool](#the-deployer-tool) 🔧 tool. (Details on the tool subchapter) + + +### API runbook +This runbook describes making changes to the [API](#api) ⚙ and +deploying it. + +Follow [Backend code changes](#backend-code-changes) 📒 and +[Backend component deployment](#backend-component-deployment) 📒. + +In addition, monitor logs from Nginx and API focusing on HTTP errors and +failing SQL queries. + +Manually check [Explorer](#explorer) 🖱 and other +[Public and private web UIs](#public-and-private-web-uis) 💡 as needed. + + +#### Managing feature flags +To change feature flags in the API a simple pull request like + is enough. + +Follow [Backend code changes](#backend-code-changes) 📒 and deploy it after +basic testing on [ams-pg-test.ooni.org](#ams-pg-test.ooni.org) 🖥. + + +### Running database queries +This subsection describes how to run queries against +[ClickHouse](#clickhouse) ⚙. You can run queries from +[Jupyter Notebook](#jupyter-notebook) 🔧 or from the CLI: + +```bash + ssh + $ clickhouse-client +``` + +Prefer using the default user when possible. To log in as admin: + +```bash + $ clickhouse-client -u admin --password +``` + +> **note** +> Heavy queries can impact the production database. When in doubt run them +> on the CLI interface in order to terminate them using CTRL-C if needed. + +> **warning** +> ClickHouse is not transactional! Always test queries that mutate schemas +> or data on testbeds like [ams-pg-test.ooni.org](#ams-pg-test.ooni.org) 🖥 + +For long running queries see the use of timeouts in +[Fastpath deduplication](#fastpath-deduplication) 📒 + +Also see [Dropping tables](#dropping-tables) 📒, +[Investigating table sizes](#investigating-table-sizes) 📒 + + +#### Modifying the fastpath table +This runbook show an example of changing the contents of the +[fastpath table](#fastpath-table) ⛁ by running a \"mutation\" query. + +> **warning** +> This method creates changes that cannot be reproduced by external +> researchers by [Reprocessing measurements](#reprocessing-measurements) 📒. See +> [Reproducibility](#reproducibility) 💡 + +In this example [Signal test](#signal-test) Ⓣ measurements are being +flagged as failed due to + +Summarize affected measurements with: + +``` sql +SELECT test_version, msm_failure, count() +FROM fastpath +WHERE test_name = 'signal' AND measurement_start_time > '2023-11-06T16:00:00' +GROUP BY msm_failure, test_version +ORDER BY test_version ASC +``` + +> **important** +> `ALTER TABLE …​ UPDATE` starts a +> [mutation](https://clickhouse.com/docs/en/sql-reference/statements/alter#mutations) +> that runs in background. + +Check for any running or stuck mutation: + +``` sql +SELECT * FROM system.mutations WHERE is_done != 1 +``` + +Start the mutation: + +``` sql +ALTER TABLE fastpath +UPDATE + msm_failure = 't', + anomaly = 'f', + scores = '{"blocking_general":0.0,"blocking_global":0.0,"blocking_country":0.0,"blocking_isp":0.0,"blocking_local":0.0,"accuracy":0.0,"msg":"bad test_version"}' +WHERE test_name = 'signal' +AND measurement_start_time > '2023-11-06T16:00:00' +AND msm_failure = 'f' +``` + +Run the previous `SELECT` queries to monitor the mutation and its +outcome. + + +### Updating tor targets +See [Tor targets](#tor-targets) 🐝 for a general description. + +Review the [Ansible](#ansible) 🔧 chapter. Checkout the repository and +update the file `ansible/roles/ooni-backend/templates/tor_targets.json` + +Commit the changes and deploy as usual: + + ./play deploy-backend.yml --diff -l ams-pg-test.ooni.org -t api -C + ./play deploy-backend.yml --diff -l ams-pg-test.ooni.org -t api + +Test the updated configuration, then: + + ./play deploy-backend.yml --diff -l backend-fsn.ooni.org -t api -C + ./play deploy-backend.yml --diff -l backend-fsn.ooni.org -t api + +git-push the changes. + +Implements [Document Tor targets](#document-tor-targets) 🐞 + + +### Creating admin API accounts +See [Auth](#auth) 🐝 for a description of the API entry points related +to account management. + +The API provides entry points to: + + * [get role](https://api.ooni.io/apidocs/#/default/get_api_v1_get_account_role__email_address_) + + * [set role](https://api.ooni.io/apidocs/#/default/post_api_v1_set_account_role). + +The latter is implemented +[here](https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/auth.py#L437). + +> **important** +> The default value for API accounts is `user`. For such accounts there is +> no need for a record in the `accounts` table. + +To change roles it is required to be authenticated and have a role as +`admin`. + +It is also possible to create or update roles by running SQL queries +directly on [ClickHouse](#clickhouse) ⚙. This can be necessary to +create the initial `admin` account on a new deployment stage. + +A quick way to identify the account ID an user is to extract logs from +the [API](#api) ⚙ either from the backend host or using +[Logs from FSN notebook](#logs-from-fsn-notebook) 📔 + +```bash +sudo journalctl --since '5 min ago' -u ooni-api | grep 'SELECT role FROM accounts WHERE account_id' -C5 +``` + +Example output: + + Nov 09 16:03:00 backend-fsn ooni-api[1763457]: DEBUG Query: SELECT role FROM accounts WHERE account_id = '' + +Then on the database test host: + +```bash +clickhouse-client +``` + +Then in the ClickHouse shell insert a record to give\`admin\` role to +the user. See [Running database queries](#running-database-queries) 📒: + +```sql +INSERT INTO accounts (account_id, role) VALUES ('', 'admin') +``` + +`accounts` is an EmbeddedRocksDB table with `account_id` as primary key. +No record deduplication is necessary. + +To access the new role the user has to log out from web UIs and login +again. + +> **important** +> Account IDs are not the same across test and production instances. + +This is due to the use of a configuration variable +`ACCOUNT_ID_HASHING_KEY` in the hashing of the email address. The +parameter is read from the API configuration file. The values are +different across deployment stages as a security feature. + + +### Fastpath runbook + +#### Fastpath code changes and deployment +Review [Backend code changes](#backend-code-changes) 📒 and +[Backend component deployment](#backend-component-deployment) 📒 for changes and deployment of the +backend stack in general. + +Also see [Modifying the fastpath table](#modifying-the-fastpath-table) 📒 + +In addition, monitor logs and [Grafana dashboards](#grafana-dashboards) 💡 +focusing on changes in incoming measurements. + +You can use the [The deployer tool](#the-deployer-tool) 🔧 tool to perform +deployment and rollbacks of the [Fastpath](#fastpath) ⚙. + +> **important** +> the fastpath is configured **not** to restart automatically during +> deployment. + +Always monitor logs and restart it as needed: + +```bash +sudo systemctl restart fastpath +``` + + +#### Fastpath manual deployment +Sometimes it can be useful to run APT directly: + +```bash +ssh +sudo apt-get update +apt-cache show fastpath | grep Ver | head -n5 +sudo apt-get install fastpath= +``` + + +#### Reprocessing measurements +Reprocess old measurement by running the fastpath manually. This can be +done without shutting down the fastpath instance running on live +measurements. + +You can run the fastpath as root or using the fastpath user. Both users +are able to read the configuration file under `/etc/ooni`. The fastpath +will download [Postcans](#postcans) 💡 in the local directory. + +`fastpath -h` generates: + + usage: + OONI Fastpath + + See README.adoc + + [-h] [--start-day START_DAY] [--end-day END_DAY] + [--devel] [--noapi] [--stdout] [--debug] + [--db-uri DB_URI] + [--clickhouse-url CLICKHOUSE_URL] [--update] + [--stop-after STOP_AFTER] [--no-write-to-db] + [--keep-s3-cache] [--ccs CCS] + [--testnames TESTNAMES] + + options: + -h, --help show this help message and exit + --start-day START_DAY + --end-day END_DAY + --devel Devel mode + --noapi Process measurements from S3 and do not start API feeder + --stdout Log to stdout + --debug Log at debug level + --clickhouse-url CLICKHOUSE_URL + ClickHouse url + --stop-after STOP_AFTER + Stop after feeding N measurements from S3 + --no-write-to-db Do not insert measurement in database + --ccs CCS Filter comma-separated CCs when feeding from S3 + --testnames TESTNAMES + Filter comma-separated test names when feeding from S3 (without + underscores) + +To run the fastpath manually use: + + ssh + sudo sudo -u fastpath /bin/bash + + fastpath --help + fastpath --start-day 2023-08-14 --end-day 2023-08-19 --noapi --stdout + +The `--no-write-to-db` option can be useful for testing. + +The `--ccs` and `--testnames` flags are useful to selectively reprocess +measurements. + +After reprocessing measurements it's recommended to manually deduplicate +the contents of the `fastpath` table. See +[Fastpath deduplication](#fastpath-deduplication) 📒 + +> **note** +> it is possible to run multiple `fastpath` processes using +> with different time ranges. +> Running the reprocessing under `byobu` is recommended. + +The fastpath will pull [Postcans](#postcans) 💡 from S3. See +[Feed fastpath from JSONL](#feed-fastpath-from-jsonl) 🐞 for possible speedup. + + +#### Fastpath monitoring +The fastpath pipeline can be monitored using the +[Fastpath dashboard](#dash:api_fp) and [API and fastpath](#api-and-fastpath) 📊. + +Also follow real-time process using: + + sudo journalctl -f -u fastpath + + +### Android probe release runbook +This runbook is meant to help coordinate Android probe releases between +the probe and backend developers and public announcements. It does not +contain detailed instructions for individual components. + +Also see the [Measurement drop runbook](#measurement-drop-tutorial) 📒. + + +Roles: \@probe, \@backend, \@media + + +#### Android pre-release +\@probe: drive the process involving the other teams as needed. Create +calendar events to track the next steps. Run the probe checklist + + +\@backend: review + +and + +for long-term trends + + +#### Android release +\@probe: release the probe for early adopters + +\@backend: monitor + +frequently during the first 24h and report any drop on +[Slack](#slack) 🔧 + +\@probe: wait at least 24h then release the probe for all users + +\@backend: monitor + +daily for 14 days and report any drop on [Slack](#slack) 🔧 + +\@probe: wait at least 24h then poke \@media to announce the release + +( + + +### CLI probe release runbook +This runbook is meant to help coordinate CLI probe releases between the +probe and backend developers and public announcements. It does not +contain detailed instructions for individual components. + +Roles: \@probe, \@backend, \@media + + +#### CLI pre-release +\@probe: drive the process involving the other teams as needed. Create +calendar events to track the next steps. Run the probe checklist and +review the CI. + +\@backend: review +\[jupyter\]() +and +\[grafana\]() +for long-term trends + + +#### CLI release +\@probe: release the probe for early adopters + +\@backend: monitor +\[jupyter\]() +frequently during the first 24h and report any drop on +[Slack](#slack) 🔧 + +\@probe: wait at least 24h then release the probe for all users + +\@backend: monitor +\[jupyter\]() +daily for 14 days and report any drop on [Slack](#slack) 🔧 + +\@probe: wait at least 24h then poke \@media to announce the release + + +### Investigating heavy aggregation queries runbook +In the following scenario the [Aggregation and MAT](#aggregation-and-mat) 🐝 API is +experiencing query timeouts impacting users. + +Reproduce the issue by setting a large enough time span on the MAT, +e.g.: + + +Click on the link to JSON, e.g. + + +Review the [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 metrics on + +(see [Netdata-specific dashboard](#netdata-specific-dashboard) 📊 for details) + +Also review the [API and fastpath](#api-and-fastpath) 📊 dashboard, looking at +CPU load, disk I/O, query time, measurement flow. + +Also see [Aggregation cache monitoring](#aggregation-cache-monitoring) 🐍 + +Refresh and review the charts on the [ClickHouse queries notebook](#clickhouse-queries-notebook) 📔. + +In this instance frequent calls to the aggregation API are found. + +Review the summary of the API quotas. See +[Calling the API manually](#calling-the-api-manually) 📒 for details: + + $ http https://api.ooni.io/api/_/quotas_summary Authorization:'Bearer ' + +Log on [backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥 and review the logs: + + backend-fsn:~$ sudo journalctl --since '5 min ago' + +Summarize the subnets calling the API: + + backend-fsn:~$ sudo journalctl --since '5 hour ago' -u ooni-api -u nginx | grep aggreg | cut -d' ' -f 8 | sort | uniq -c | sort -nr | head + + 807 + 112 + 92 + 38 + 16 + 15 + 11 + 11 + 10 + +To block IP addresses or subnets see [Nginx](#nginx) ⚙ or +[HaProxy](#haproxy) ⚙, then configure the required file in +[Ansible](#ansible) 🔧 and deploy. + +Also see [Limiting scraping](#limiting-scraping) 📒. + + +### Aggregation cache monitoring +To monitor cache hit/miss ratio using StatsD metrics the following +script can be run as needed. + +See [Metrics list](#metrics-list) 💡. + +``` python +import subprocess + +import statsd +metrics = statsd.StatsClient('localhost', 8125) + +def main(): + cmd = "sudo journalctl --since '5 min ago' -u nginx | grep 'GET /api/v1/aggregation' | cut -d ' ' -f 10 | sort | uniq -c" + out = subprocess.check_output(cmd, shell=True) + for line in out.splitlines(): + cnt, name = line.strip().split() + name = name.decode() + metrics.gauge(f"nginx_aggregation_cache_{name}", int(cnt)) + +if __name__ == '__main__': + main() +``` + + +### Limiting scraping +Aggressive bots and scrapers can be limited using a combination of +methods. Listed below ordered starting from the most user-friendly: + +1. Reduce the impact on the API (CPU, disk I/O, memory usage) by + caching the results. + +2. [Rate limiting and quotas](#rate-limiting-and-quotas) 🐝 already built in the API. It + might need lowering of the quotas. + +3. Adding API entry points to [Robots.txt](#robots.txt) 🐝 + +4. Adding specific `User-Agent` entries to [Robots.txt](#robots.txt) 🐝 + +5. Blocking IP addresses or subnets in the [Nginx](#nginx) ⚙ or + [HaProxy](#haproxy) ⚙ configuration files + +To add caching to the API or increase the expiration times: + +1. Identify API calls that cause significant load. [Nginx](#nginx) ⚙ + is configured to log timing information for each HTTP request. See + [Logs investigation notebook](#logs-investigation-notebook) 📔 for examples. Also see + [Logs from FSN notebook](#logs-from-fsn-notebook) 📔 and + [ClickHouse instance for logs](#clickhouse-instance-for-logs) ⚙. Additionally, + [Aggregation cache monitoring](#aggregation-cache-monitoring) 🐍 can be tweaked for the present use-case. + +2. Implement caching or increase expiration times across the API + codebase. See [API cache](#api-cache) 💡 and + [Purging Nginx cache](#purging-nginx-cache) 📒. + +3. Monitor the improvement in terms of cache hit VS cache miss ratio. + +> **important** +> Caching can be applied selectively for API requests that return rapidly +> changing data VS old, stable data. See [Aggregation and MAT](#aggregation-and-mat) 🐝 +> for an example. + +To update the quotas edit the API here + +and deploy as usual. + +To update the `robots.txt` entry point see [Robots.txt](#robots.txt) 🐝 and +edit the API here +*init*.py#L124 +and deploy as usual + +To block IP addresses or subnets see [Nginx](#nginx) ⚙ or +[HaProxy](#haproxy) ⚙, then configure the required file in +[Ansible](#ansible) 🔧 and deploy. + + +### Calling the API manually +To make HTTP calls to the API manually you'll need to extact a JWT from +the browser, sometimes with admin rights. + +In Firefox, authenticate against , then +open Inspect \>\> Storage \>\> Local Storage \>\> Find +`{"token": ""}` + +Extract the token ascii-encoded string without braces nor quotes. + +Call the API using [httpie](https://httpie.io/) with: + + $ http https://api.ooni.io/ Authorization:'Bearer ' + +E.g.: + + $ http https://api.ooni.io/api/_/quotas_summary Authorization:'Bearer ' + +> **note** +> Do not leave whitespaces after \"Authorization:\" + + +### Build, deploy, rollback + +Host deployments are done with the +[sysadmin repo](https://github.com/ooni/sysadmin) + +For component updates a deployment pipeline is used: + +Look at the \[Status +dashboard\]() - be aware +of badge image caching + + +### The deployer tool +Deployments can be performed with a tool that acts as a frontend for +APT. It implements a simple Continuous Delivery workflow from CLI. It +does not require running a centralized CD pipeline server (e.g. like +) + +The tool is hosted on the backend repository together with its +configuration file for simplicity: + + +At start time it traverses the path from the current working directory +back to root until it finds a configuration file named deployer.ini This +allows using different deployment pipelines stored in configuration +files across different repositories and subdirectories. + +The tool connects to the hosts to perform deployments and requires sudo +rights. It installs Debian packages from repositories already configured +on the hosts. + +It runs `apt-get update` and then `apt-get install …​` to update or +rollback packages. By design, it does not interfere with manual +execution of apt-get or through tools like [Ansible](#ansible) 🔧. +This means operators can log on a host to do manual upgrade or rollback +of packages without breaking the deployer tool. + +The tool depends only on the `python3-apt` package. + +Here is a configuration file example, with comments: + +``` ini +[environment] +## Location on the path where SVG badges are stored +badges_path = /var/www/package_badges + + +## List of packages that are handled by the deployer, space separated +deb_packages = ooni-api fastpath analysis detector + + +## List of deployment stage names, space separated, from the least to the most critical +stages = test hel prod + + +## For each stage a block named stage: is required. +## The block lists the stage hosts. + + +## Example of an unused stage (not list under stages) +[stage:alpha] +hosts = localhost + +[stage:test] +hosts = ams-pg-test.ooni.org + +[stage:hel] +hosts = backend-hel.ooni.org + +[stage:prod] +hosts = backend-fsn.ooni.org +``` + +By running the tool without any argument it will connect to the hosts +from the configuration file and print a summary of the installed +packages, for example: + +``` bash +$ deployer + + Package test prod +ooni-api 1.0.79~pr751-194 1.0.79~pr751-194 +fastpath 0.81~pr748-191 ►► 0.77~pr705-119 +analysis 1.9~pr659-61 ⚠ 1.10~pr692-102 +detector 0.3~pr651-98 0.3~pr651-98 +``` + +The green arrows between two package versions indicates that the version +on the left side is higher than the one on the right side. This means +that a rollout is pending. In the example the fastpath package on the +\"prod\" stage can be updated. + +A red warning sign indicates that the version on the right side is +higher than the one on the left side. During a typical continuous +deployment workflow version numbers should always increment The rollout +should go from left to right, aka from the least critical stage to the +most critical stage. + +Deploy/rollback a given version on the \"test\" stage: + +``` bash +./deployer deploy ooni-api test 0.6~pr194-147 +``` + +Deploy latest build on the first stage: + +``` bash +./deployer deploy ooni-api +``` + +Deploy latest build on a given stage. This usage is not recommended as +it deploys the latest build regardless of what is currently running on +previous stages. + +``` bash +./deployer deploy ooni-api prod +``` + +The deployer tool can also generate SVG badges that can then served by +[Nginx](#nginx) ⚙ or copied elsewhere to create a status dashboard. + +Example: + +![badge](../../../assets/images-backend/badge.png) + +Update all badges with: + +``` bash +./deployer refresh_badges +``` + + +### Adding new tests +This runbook describes how to add support for a new test in the +[Fastpath](#fastpath) ⚙. + +Review [Backend code changes](#backend-code-changes) 📒, then update +[fastpath core](https://github.com/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/fastpath/fastpath/core.py) +to add a scoring function. + +See for example `def score_torsf(msm: dict) → dict:` + +Also add an `if` block to the `def score_measurement(msm: dict) → dict:` +function to call the newly created function. + +Finish by adding a new test to the `score_measurement` function and +adding relevant integration tests. + +Run the integration tests locally. + +Update the +[api](https://github.rom/ooni/backend/blob/0ec9fba0eb9c4c440dcb7456f2aab529561104ae/api/ooniapi/measurements.py#L491) +if needed. + +Deploy on [ams-pg-test.ooni.org](#ams-pg-test.ooni.org) 🖥 and run end-to-end tests +using real probes. + + +### Adding support for a new test key +This runbook describes how to modify the [Fastpath](#fastpath) ⚙ +and the [API](#api) ⚙ to extract, process, store and publish a new measurement +field. + +Start with adding a new column to the [fastpath table](#fastpath-table) ⛁ +by following [Adding a new column to the fastpath](#adding-a-new-column-to-the-fastpath) 📒. + +Add the column to the local ClickHouse instance used for tests and +[ams-pg-test.ooni.org](#ams-pg-test.ooni.org) 🖥. + +Update as described in +[Continuous Deployment: Database schema changes](#continuous-deployment:-database-schema-changes) 💡 + +Add support for the new field in the fastpath `core.py` and `db.py` modules +and related tests. +See https://github.com/ooni/backend/pull/682 for a comprehensive example. + +Run tests locally, then open a draft pull request and ensure the CI tests are +running successfully. + +If needed, the current pull request can be reviewed and deployed without modifying the API to expose the new column. This allows processing data sooner while the API is still being worked on. + +Add support for the new column in the API. The change depends on where and how the +new value is to be published. +See for a generic example of updating an SQL query in the API and updating related tests. + +Deploy the changes on test and pre-production stages after creating the new column in the database. +See [The deployer tool](#the-deployer-tool) 🔧 for details. + +Perform end-to-end tests with real probes and [Public and private web UIs](#public-and-private-web-uis) 💡 as needed. + +Complete the pull request and deploy to production. + + +## Increasing the disk size on a dedicated host + +Below are some notes on how to resize the disks when a new drive is added to +our dedicated hosts: + +``` +fdisk /dev/nvme3n1 +# create gpt partition table and new RAID 5 (label 42) partition using the CLI +mdadm --manage /dev/md3 --add /dev/nvme3n1p1 +cat /proc/mdstat +# Take note of the volume count (4) and validate that nvme3n1p1 is marked as spare ("S") +mdadm --grow --raid-devices=4 /dev/md3 +``` + +``` +# resize2fs /dev/md3 +# df -h | grep md3 +/dev/md3 2.6T 1.2T 1.3T 48% / +``` + +## Replicating MergeTree tables + +Notes on how to go about converting a MergeTree family table to a replicated table, while minimizing downtime. + +See the following links for more information: + +- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-converting-mergetree-to-replicated/ +- https://clickhouse.com/docs/en/operations/system-tables/replicas +- https://clickhouse.com/docs/en/architecture/replication#verify-that-clickhouse-keeper-is-running +- https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication +- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings + +### Workflow + +You should first create the replicated database cluster following the +instructions at the [clickhouse docs](https://clickhouse.com/docs/en/architecture/replication). + +The ooni-devops repo has a role called `oonidata_clickhouse` that does that by using the [idealista.clickhouse_role](https://github.com/idealista/clickhouse_role). + +Once the cluster is created you can proceed with creating a DATABASE on the cluster by running: + +``` +CREATE DATABASE ooni ON CLUSTER oonidata_cluster +``` + +There are now a few options to go about doing this: + +1. You just create the new replicated tables and perform a copy into the destination database by running on the source database the following: + +``` +INSERT INTO FUNCTION +remote('destination-database.ooni.nu', 'obs_web', 'USER', 'PASSWORD') +SELECT * from obs_web +``` + +This will require duplicating the data and might not be feasible. + +2. If you already have all the data setup on one host and you just want to convert the database into a replicate one, you can do the following: + +We assume there are 2 tables: `obs_web_bak` (which is the source table) and +`obs_web` which is the destination table. We also assume a single shard and +multiple replicas. + +First create the destination replicated table. To retrieve the table create query you can run: + +```sql +select create_table_query +from system.tables +where database = 'default' and table = 'obs_web' +``` + +You should then modify the table to make use of the `ReplicateReplacingMergeTree` engine: + +```sql +CREATE TABLE ooni.obs_web (`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `probe_asn` UInt32, `probe_cc` String, `probe_as_org_name` String, `probe_as_cc` String, `probe_as_name` String, `network_type` String, `platform` String, `origin` String, `engine_name` String, `engine_version` String, `architecture` String, `resolver_ip` String, `resolver_asn` UInt32, `resolver_cc` String, `resolver_as_org_name` String, `resolver_as_cc` String, `resolver_is_scrubbed` UInt8, `resolver_asn_probe` UInt32, `resolver_as_org_name_probe` String, `created_at` Nullable(DateTime('UTC')), `target_id` Nullable(String), `hostname` Nullable(String), `transaction_id` Nullable(UInt16), `ip` Nullable(String), `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_query_type` Nullable(String), `dns_failure` Nullable(String), `dns_engine` Nullable(String), `dns_engine_resolver_address` Nullable(String), `dns_answer_type` Nullable(String), `dns_answer` Nullable(String), `dns_answer_asn` Nullable(UInt32), `dns_answer_as_org_name` Nullable(String), `dns_t` Nullable(Float64), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tcp_t` Nullable(Float64), `tls_failure` Nullable(String), `tls_server_name` Nullable(String), `tls_version` Nullable(String), `tls_cipher_suite` Nullable(String), `tls_is_certificate_valid` Nullable(UInt8), `tls_end_entity_certificate_fingerprint` Nullable(String), `tls_end_entity_certificate_subject` Nullable(String), `tls_end_entity_certificate_subject_common_name` Nullable(String), `tls_end_entity_certificate_issuer` Nullable(String), `tls_end_entity_certificate_issuer_common_name` Nullable(String), `tls_end_entity_certificate_san_list` Array(String), `tls_end_entity_certificate_not_valid_after` Nullable(DateTime64(3, 'UTC')), `tls_end_entity_certificate_not_valid_before` Nullable(DateTime64(3, 'UTC')), `tls_certificate_chain_length` Nullable(UInt16), `tls_certificate_chain_fingerprints` Array(String), `tls_handshake_read_count` Nullable(UInt16), `tls_handshake_write_count` Nullable(UInt16), `tls_handshake_read_bytes` Nullable(UInt32), `tls_handshake_write_bytes` Nullable(UInt32), `tls_handshake_last_operation` Nullable(String), `tls_handshake_time` Nullable(Float64), `tls_t` Nullable(Float64), `http_request_url` Nullable(String), `http_network` Nullable(String), `http_alpn` Nullable(String), `http_failure` Nullable(String), `http_request_body_length` Nullable(UInt32), `http_request_method` Nullable(String), `http_runtime` Nullable(Float64), `http_response_body_length` Nullable(Int32), `http_response_body_is_truncated` Nullable(UInt8), `http_response_body_sha1` Nullable(String), `http_response_status_code` Nullable(UInt16), `http_response_header_location` Nullable(String), `http_response_header_server` Nullable(String), `http_request_redirect_from` Nullable(String), `http_request_body_is_truncated` Nullable(UInt8), `http_t` Nullable(Float64), `probe_analysis` Nullable(String)) +ENGINE = ReplicatedReplacingMergeTree( +'/clickhouse/{cluster}/tables/{database}/{table}/{shard}', +'{replica}' +) +PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) +PRIMARY KEY (measurement_uid, observation_idx) +ORDER BY (measurement_uid, observation_idx, measurement_start_time, probe_cc, probe_asn) SETTINGS index_granularity = 8192 +``` + +Check all the partitions that exist for the source table and produce ALTER queries to map them from the source to the destination: + +```sql +SELECT DISTINCT 'ALTER TABLE ooni.obs_web ATTACH PARTITION ID \'' || partition_id || '\' FROM obs_web_bak;' from system.parts WHERE table = 'obs_web_bak' AND active; +``` + +While you are running the following, you should stop all merges by running: + +```sql +SYSTEM STOP MERGES; +``` + +This can then be scripted like so: + +```sh +clickhouse-client -q "SELECT DISTINCT 'ALTER TABLE ooni.obs_web ATTACH PARTITION ID \'' || partition_id || '\' FROM obs_web_bak;' from system.parts WHERE table = 'obs_web_bak' format TabSeparatedRaw" | clickhouse-client -u write --password XXXX -mn +``` + +You will now have a replicated table existing on one of the replicas. + +Then you shall for each other replica in the set manually create the table, but this time pass in it explicitly the zookeeper path. + +You can get the zookeeper path by running the following on the first replica you have setup + +```sql +SELECT zookeeper_path FROM system.replicas WHERE table = 'obs_web'; +``` + +For each replica you will then have to create the tables like so: + +```sql +CREATE TABLE ooni.obs_web (`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `probe_asn` UInt32, `probe_cc` String, `probe_as_org_name` String, `probe_as_cc` String, `probe_as_name` String, `network_type` String, `platform` String, `origin` String, `engine_name` String, `engine_version` String, `architecture` String, `resolver_ip` String, `resolver_asn` UInt32, `resolver_cc` String, `resolver_as_org_name` String, `resolver_as_cc` String, `resolver_is_scrubbed` UInt8, `resolver_asn_probe` UInt32, `resolver_as_org_name_probe` String, `created_at` Nullable(DateTime('UTC')), `target_id` Nullable(String), `hostname` Nullable(String), `transaction_id` Nullable(UInt16), `ip` Nullable(String), `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_query_type` Nullable(String), `dns_failure` Nullable(String), `dns_engine` Nullable(String), `dns_engine_resolver_address` Nullable(String), `dns_answer_type` Nullable(String), `dns_answer` Nullable(String), `dns_answer_asn` Nullable(UInt32), `dns_answer_as_org_name` Nullable(String), `dns_t` Nullable(Float64), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tcp_t` Nullable(Float64), `tls_failure` Nullable(String), `tls_server_name` Nullable(String), `tls_version` Nullable(String), `tls_cipher_suite` Nullable(String), `tls_is_certificate_valid` Nullable(UInt8), `tls_end_entity_certificate_fingerprint` Nullable(String), `tls_end_entity_certificate_subject` Nullable(String), `tls_end_entity_certificate_subject_common_name` Nullable(String), `tls_end_entity_certificate_issuer` Nullable(String), `tls_end_entity_certificate_issuer_common_name` Nullable(String), `tls_end_entity_certificate_san_list` Array(String), `tls_end_entity_certificate_not_valid_after` Nullable(DateTime64(3, 'UTC')), `tls_end_entity_certificate_not_valid_before` Nullable(DateTime64(3, 'UTC')), `tls_certificate_chain_length` Nullable(UInt16), `tls_certificate_chain_fingerprints` Array(String), `tls_handshake_read_count` Nullable(UInt16), `tls_handshake_write_count` Nullable(UInt16), `tls_handshake_read_bytes` Nullable(UInt32), `tls_handshake_write_bytes` Nullable(UInt32), `tls_handshake_last_operation` Nullable(String), `tls_handshake_time` Nullable(Float64), `tls_t` Nullable(Float64), `http_request_url` Nullable(String), `http_network` Nullable(String), `http_alpn` Nullable(String), `http_failure` Nullable(String), `http_request_body_length` Nullable(UInt32), `http_request_method` Nullable(String), `http_runtime` Nullable(Float64), `http_response_body_length` Nullable(Int32), `http_response_body_is_truncated` Nullable(UInt8), `http_response_body_sha1` Nullable(String), `http_response_status_code` Nullable(UInt16), `http_response_header_location` Nullable(String), `http_response_header_server` Nullable(String), `http_request_redirect_from` Nullable(String), `http_request_body_is_truncated` Nullable(UInt8), `http_t` Nullable(Float64), `probe_analysis` Nullable(String)) +ENGINE = ReplicatedReplacingMergeTree( +'/clickhouse/oonidata_cluster/tables/ooni/obs_web/01', +'{replica}' +) +PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) +PRIMARY KEY (measurement_uid, observation_idx) +ORDER BY (measurement_uid, observation_idx, measurement_start_time, probe_cc, probe_asn) SETTINGS index_granularity = 8192 +``` + +You will then have to manually copy the data over to the destination replica from the source. + +The data lives inside of `/var/lib/clickhouse/data/{database_name}/{table_name}` + +Once the data has been copied over you should now have replicated the data and you can resume merges on all database by running: + +```sql +SYSTEM START MERGES; +``` + +### Creating tables on clusters + +```sql +CREATE TABLE ooni.obs_web_ctrl ON CLUSTER oonidata_cluster +(`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `hostname` String, `created_at` Nullable(DateTime64(3, 'UTC')), `ip` String, `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_failure` Nullable(String), `dns_success` Nullable(UInt8), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tls_failure` Nullable(String), `tls_success` Nullable(UInt8), `tls_server_name` Nullable(String), `http_request_url` Nullable(String), `http_failure` Nullable(String), `http_success` Nullable(UInt8), `http_response_body_length` Nullable(Int32)) +ENGINE = ReplicatedReplacingMergeTree( +'/clickhouse/{cluster}/tables/{database}/{table}/{shard}', +'{replica}' +) +PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) +PRIMARY KEY (measurement_uid, observation_idx) ORDER BY (measurement_uid, observation_idx, measurement_start_time, hostname) SETTINGS index_granularity = 8192 +``` diff --git a/docs/Tools.md b/docs/Tools.md new file mode 100644 index 0000000..73d9f07 --- /dev/null +++ b/docs/Tools.md @@ -0,0 +1,211 @@ + +### Geolocation script +The following script can be used to compare the geolocation reported by +the probes submitting measurements compared to the geolocation of the +`/24` subnet the probe is coming from. It is meant to be run on +[backend-fsn.ooni.org](#backend-fsn.ooni.org) 🖥. + +``` python +##!/usr/bin/env python3 + +from time import sleep + +import systemd.journal +import geoip2.database # type: ignore + +asnfn = "/var/lib/ooniapi/asn.mmdb" +ccfn = "/var/lib/ooniapi/cc.mmdb" +geoip_asn_reader = geoip2.database.Reader(asnfn) +geoip_cc_reader = geoip2.database.Reader(ccfn) + + +def follow_journal(): + journal = systemd.journal.Reader() + #journal.seek_tail() + journal.get_previous() + journal.add_match(_SYSTEMD_UNIT="nginx.service") + while True: + try: + event = journal.wait(-1) + if event == systemd.journal.APPEND: + for entry in journal: + yield entry["MESSAGE"] + except Exception as e: + print(e) + sleep(0.1) + + +def geolookup(ipaddr: str): + cc = geoip_cc_reader.country(ipaddr).country.iso_code + asn = geoip_asn_reader.asn(ipaddr).autonomous_system_number + return cc, asn + + +def process(rawmsg): + if ' "POST /report/' not in rawmsg: + return + msg = rawmsg.strip().split() + ipaddr = msg[2] + ipaddr2 = msg[3] + path = msg[8][8:] + tsamp, tn, probe_cc, probe_asn, collector, rand = path.split("_") + geo_cc, geo_asn = geolookup(ipaddr) + proxied = 0 + probe_type = rawmsg.rsplit('"', 2)[-2] + if "," in probe_type: + return + if ipaddr2 != "0.0.0.0": + proxied = 1 + # Probably CloudFront, use second ipaddr + geo_cc, geo_asn = geolookup(ipaddr2) + + print(f"{probe_cc},{geo_cc},{probe_asn},{geo_asn},{proxied},{probe_type}") + + +def main(): + for msg in follow_journal(): + if msg is None: + break + try: + process(msg) + except Exception as e: + print(e) + sleep(0.1) + + +if __name__ == "__main__": + main() +``` + + +### Test list prioritization monitoring +The following script monitors prioritized test list for changes in URLs +for a set of countries. Outputs StatsS metrics. + +> **note** +> The prioritization system has been modified to work on a granularity of +> probe_cc + probe_asn rather than whole countries. + +Country-wise changes might be misleading. The script can be modified to +filter for a set of CCs+ASNs. + +``` python +##!/usr/bin/env python3 + +from time import sleep +import urllib.request +import json + +import statsd # debdeps: python3-statsd + +metrics = statsd.StatsClient("127.0.0.1", 8125, prefix="test-list-changes") + +CCs = ["GE", "IT", "US"] +THRESH = 100 + + +def peek(cc, listmap) -> None: + url = f"https://api.ooni.io/api/v1/test-list/urls?country_code={cc}&debug=True" + res = urllib.request.urlopen(url) + j = json.load(res) + top = j["results"][:THRESH] # list of dicts + top_urls = set(d["url"] for d in top) + + if cc in listmap: + old = listmap[cc] + changed = old.symmetric_difference(top_urls) + tot_cnt = len(old.union(top_urls)) + changed_ratio = len(changed) / tot_cnt * 100 + metrics.gauge(f"-{cc}", changed_ratio) + + listmap[cc] = top_urls + + +def main() -> None: + listmap = {} + while True: + for cc in CCs: + try: + peek(cc, listmap) + except Exception as e: + print(e) + sleep(1) + sleep(60 * 10) + + +if __name__ == "__main__": + main() +``` + +### Recompressing postcans on S3 +The following script can be used to compress .tar.gz files in the S3 data bucket. +It keeps a copy of the original files locally as a backup. +It terminates once a correctly compressed file is found. +Running the script on an AWS host close to the S3 bucket can significantly +speed up the process. + +Tested with the packages: + + * python3-boto3 1.28.49+dfsg-1 + * python3-magic 2:0.4.27-2 + +Set the ACCESS_KEY and SECRET_KEY environment variables. +Update the PREFIX variable as needed. + +```python +##!/usr/bin/env python3 +from os import getenv, rename +from sys import exit +import boto3 +import gzip +import magic + +BUCKET_NAME = "ooni-data-eu-fra-test" +## BUCKET_NAME = "ooni-data-eu-fra" +PREFIX = "raw/2021" + +def fetch_files(): + s3 = boto3.client( + "s3", + aws_access_key_id=getenv("ACCESS_KEY"), + aws_secret_access_key=getenv("SECRET_KEY"), + ) + cont_token = None + while True: + kw = {} if cont_token is None else dict(ContinuationToken=cont_token) + r = s3.list_objects_v2(Bucket=BUCKET_NAME, Prefix=PREFIX, **kw) + cont_token = r.get("NextContinuationToken", None) + for i in r.get("Contents", []): + k = i["Key"] + if k.endswith(".tar.gz"): + fn = k.rsplit("/", 1)[-1] + s3.download_file(BUCKET_NAME, k, fn) + yield k, fn + if cont_token is None: + return + +def main(): + s3res = session = boto3.Session( + aws_access_key_id=getenv("ACCESS_KEY"), + aws_secret_access_key=getenv("SECRET_KEY"), + ).resource("s3") + for s3key, fn in fetch_files(): + ft = magic.from_file(fn) + if "tar archive" not in ft: + print(f"found {ft} at {s3key}") + # continue # simply ignore already compressed files + exit() # stop when compressed files are found + tarfn = fn[:-3] + rename(fn, tarfn) # keep the local file as a backup + with open(tarfn, "rb") as f: + inp = f.read() + comp = gzip.compress(inp, compresslevel=9) + ratio = len(inp) / len(comp) + del inp + print(f"uploading {s3key} compression ratio {ratio}") + obj = s3res.Object(BUCKET_NAME, s3key) + obj.put(Body=comp) + del comp + +main() +``` diff --git a/docs/disk-increase.md b/docs/disk-increase.md deleted file mode 100644 index b977c99..0000000 --- a/docs/disk-increase.md +++ /dev/null @@ -1,17 +0,0 @@ -Below are some notes on how to resize the disks when a new drive is added to -our dedicated hosts: - -``` -fdisk /dev/nvme3n1 -# create gpt partition table and new RAID 5 (label 42) partition using the CLI -mdadm --manage /dev/md3 --add /dev/nvme3n1p1 -cat /proc/mdstat -# Take note of the volume count (4) and validate that nvme3n1p1 is marked as spare ("S") -mdadm --grow --raid-devices=4 /dev/md3 -``` - -``` -# resize2fs /dev/md3 -# df -h | grep md3 -/dev/md3 2.6T 1.2T 1.3T 48% / -``` diff --git a/docs/merge-tree-replication.md b/docs/merge-tree-replication.md deleted file mode 100644 index ac9e1e2..0000000 --- a/docs/merge-tree-replication.md +++ /dev/null @@ -1,127 +0,0 @@ -## Replicating MergeTree tables - -Notes on how to go about converting a MergeTree family table to a replicated table, while minimizing downtime. - -See the following links for more information: - -- https://kb.altinity.com/altinity-kb-setup-and-maintenance/altinity-kb-converting-mergetree-to-replicated/ -- https://clickhouse.com/docs/en/operations/system-tables/replicas -- https://clickhouse.com/docs/en/architecture/replication#verify-that-clickhouse-keeper-is-running -- https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication -- https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings - -### Workflow - -You should first create the replicated database cluster following the -instructions at the [clickhouse docs](https://clickhouse.com/docs/en/architecture/replication). - -The ooni-devops repo has a role called `oonidata_clickhouse` that does that by using the [idealista.clickhouse_role](https://github.com/idealista/clickhouse_role). - -Once the cluster is created you can proceed with creating a DATABASE on the cluster by running: - -``` -CREATE DATABASE ooni ON CLUSTER oonidata_cluster -``` - -There are now a few options to go about doing this: - -1. You just create the new replicated tables and perform a copy into the destination database by running on the source database the following: - -``` -INSERT INTO FUNCTION -remote('destination-database.ooni.nu', 'obs_web', 'USER', 'PASSWORD') -SELECT * from obs_web -``` - -This will require duplicating the data and might not be feasible. - -2. If you already have all the data setup on one host and you just want to convert the database into a replicate one, you can do the following: - -We assume there are 2 tables: `obs_web_bak` (which is the source table) and -`obs_web` which is the destination table. We also assume a single shard and -multiple replicas. - -First create the destination replicated table. To retrieve the table create query you can run: - -```sql -select create_table_query -from system.tables -where database = 'default' and table = 'obs_web' -``` - -You should then modify the table to make use of the `ReplicateReplacingMergeTree` engine: - -```sql -CREATE TABLE ooni.obs_web (`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `probe_asn` UInt32, `probe_cc` String, `probe_as_org_name` String, `probe_as_cc` String, `probe_as_name` String, `network_type` String, `platform` String, `origin` String, `engine_name` String, `engine_version` String, `architecture` String, `resolver_ip` String, `resolver_asn` UInt32, `resolver_cc` String, `resolver_as_org_name` String, `resolver_as_cc` String, `resolver_is_scrubbed` UInt8, `resolver_asn_probe` UInt32, `resolver_as_org_name_probe` String, `created_at` Nullable(DateTime('UTC')), `target_id` Nullable(String), `hostname` Nullable(String), `transaction_id` Nullable(UInt16), `ip` Nullable(String), `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_query_type` Nullable(String), `dns_failure` Nullable(String), `dns_engine` Nullable(String), `dns_engine_resolver_address` Nullable(String), `dns_answer_type` Nullable(String), `dns_answer` Nullable(String), `dns_answer_asn` Nullable(UInt32), `dns_answer_as_org_name` Nullable(String), `dns_t` Nullable(Float64), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tcp_t` Nullable(Float64), `tls_failure` Nullable(String), `tls_server_name` Nullable(String), `tls_version` Nullable(String), `tls_cipher_suite` Nullable(String), `tls_is_certificate_valid` Nullable(UInt8), `tls_end_entity_certificate_fingerprint` Nullable(String), `tls_end_entity_certificate_subject` Nullable(String), `tls_end_entity_certificate_subject_common_name` Nullable(String), `tls_end_entity_certificate_issuer` Nullable(String), `tls_end_entity_certificate_issuer_common_name` Nullable(String), `tls_end_entity_certificate_san_list` Array(String), `tls_end_entity_certificate_not_valid_after` Nullable(DateTime64(3, 'UTC')), `tls_end_entity_certificate_not_valid_before` Nullable(DateTime64(3, 'UTC')), `tls_certificate_chain_length` Nullable(UInt16), `tls_certificate_chain_fingerprints` Array(String), `tls_handshake_read_count` Nullable(UInt16), `tls_handshake_write_count` Nullable(UInt16), `tls_handshake_read_bytes` Nullable(UInt32), `tls_handshake_write_bytes` Nullable(UInt32), `tls_handshake_last_operation` Nullable(String), `tls_handshake_time` Nullable(Float64), `tls_t` Nullable(Float64), `http_request_url` Nullable(String), `http_network` Nullable(String), `http_alpn` Nullable(String), `http_failure` Nullable(String), `http_request_body_length` Nullable(UInt32), `http_request_method` Nullable(String), `http_runtime` Nullable(Float64), `http_response_body_length` Nullable(Int32), `http_response_body_is_truncated` Nullable(UInt8), `http_response_body_sha1` Nullable(String), `http_response_status_code` Nullable(UInt16), `http_response_header_location` Nullable(String), `http_response_header_server` Nullable(String), `http_request_redirect_from` Nullable(String), `http_request_body_is_truncated` Nullable(UInt8), `http_t` Nullable(Float64), `probe_analysis` Nullable(String)) -ENGINE = ReplicatedReplacingMergeTree( -'/clickhouse/{cluster}/tables/{database}/{table}/{shard}', -'{replica}' -) -PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) -PRIMARY KEY (measurement_uid, observation_idx) -ORDER BY (measurement_uid, observation_idx, measurement_start_time, probe_cc, probe_asn) SETTINGS index_granularity = 8192 -``` - -Check all the partitions that exist for the source table and produce ALTER queries to map them from the source to the destination: - -```sql -SELECT DISTINCT 'ALTER TABLE ooni.obs_web ATTACH PARTITION ID \'' || partition_id || '\' FROM obs_web_bak;' from system.parts WHERE table = 'obs_web_bak' AND active; -``` - -While you are running the following, you should stop all merges by running: - -```sql -SYSTEM STOP MERGES; -``` - -This can then be scripted like so: - -```sh -clickhouse-client -q "SELECT DISTINCT 'ALTER TABLE ooni.obs_web ATTACH PARTITION ID \'' || partition_id || '\' FROM obs_web_bak;' from system.parts WHERE table = 'obs_web_bak' format TabSeparatedRaw" | clickhouse-client -u write --password XXXX -mn -``` - -You will now have a replicated table existing on one of the replicas. - -Then you shall for each other replica in the set manually create the table, but this time pass in it explicitly the zookeeper path. - -You can get the zookeeper path by running the following on the first replica you have setup - -```sql -SELECT zookeeper_path FROM system.replicas WHERE table = 'obs_web'; -``` - -For each replica you will then have to create the tables like so: - -```sql -CREATE TABLE ooni.obs_web (`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `probe_asn` UInt32, `probe_cc` String, `probe_as_org_name` String, `probe_as_cc` String, `probe_as_name` String, `network_type` String, `platform` String, `origin` String, `engine_name` String, `engine_version` String, `architecture` String, `resolver_ip` String, `resolver_asn` UInt32, `resolver_cc` String, `resolver_as_org_name` String, `resolver_as_cc` String, `resolver_is_scrubbed` UInt8, `resolver_asn_probe` UInt32, `resolver_as_org_name_probe` String, `created_at` Nullable(DateTime('UTC')), `target_id` Nullable(String), `hostname` Nullable(String), `transaction_id` Nullable(UInt16), `ip` Nullable(String), `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_query_type` Nullable(String), `dns_failure` Nullable(String), `dns_engine` Nullable(String), `dns_engine_resolver_address` Nullable(String), `dns_answer_type` Nullable(String), `dns_answer` Nullable(String), `dns_answer_asn` Nullable(UInt32), `dns_answer_as_org_name` Nullable(String), `dns_t` Nullable(Float64), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tcp_t` Nullable(Float64), `tls_failure` Nullable(String), `tls_server_name` Nullable(String), `tls_version` Nullable(String), `tls_cipher_suite` Nullable(String), `tls_is_certificate_valid` Nullable(UInt8), `tls_end_entity_certificate_fingerprint` Nullable(String), `tls_end_entity_certificate_subject` Nullable(String), `tls_end_entity_certificate_subject_common_name` Nullable(String), `tls_end_entity_certificate_issuer` Nullable(String), `tls_end_entity_certificate_issuer_common_name` Nullable(String), `tls_end_entity_certificate_san_list` Array(String), `tls_end_entity_certificate_not_valid_after` Nullable(DateTime64(3, 'UTC')), `tls_end_entity_certificate_not_valid_before` Nullable(DateTime64(3, 'UTC')), `tls_certificate_chain_length` Nullable(UInt16), `tls_certificate_chain_fingerprints` Array(String), `tls_handshake_read_count` Nullable(UInt16), `tls_handshake_write_count` Nullable(UInt16), `tls_handshake_read_bytes` Nullable(UInt32), `tls_handshake_write_bytes` Nullable(UInt32), `tls_handshake_last_operation` Nullable(String), `tls_handshake_time` Nullable(Float64), `tls_t` Nullable(Float64), `http_request_url` Nullable(String), `http_network` Nullable(String), `http_alpn` Nullable(String), `http_failure` Nullable(String), `http_request_body_length` Nullable(UInt32), `http_request_method` Nullable(String), `http_runtime` Nullable(Float64), `http_response_body_length` Nullable(Int32), `http_response_body_is_truncated` Nullable(UInt8), `http_response_body_sha1` Nullable(String), `http_response_status_code` Nullable(UInt16), `http_response_header_location` Nullable(String), `http_response_header_server` Nullable(String), `http_request_redirect_from` Nullable(String), `http_request_body_is_truncated` Nullable(UInt8), `http_t` Nullable(Float64), `probe_analysis` Nullable(String)) -ENGINE = ReplicatedReplacingMergeTree( -'/clickhouse/oonidata_cluster/tables/ooni/obs_web/01', -'{replica}' -) -PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) -PRIMARY KEY (measurement_uid, observation_idx) -ORDER BY (measurement_uid, observation_idx, measurement_start_time, probe_cc, probe_asn) SETTINGS index_granularity = 8192 -``` - -You will then have to manually copy the data over to the destination replica from the source. - -The data lives inside of `/var/lib/clickhouse/data/{database_name}/{table_name}` - -Once the data has been copied over you should now have replicated the data and you can resume merges on all database by running: - -```sql -SYSTEM START MERGES; -``` - -### Creating tables on clusters - -```sql -CREATE TABLE ooni.obs_web_ctrl ON CLUSTER oonidata_cluster -(`measurement_uid` String, `observation_idx` UInt16, `input` Nullable(String), `report_id` String, `measurement_start_time` DateTime64(3, 'UTC'), `software_name` String, `software_version` String, `test_name` String, `test_version` String, `bucket_date` String, `hostname` String, `created_at` Nullable(DateTime64(3, 'UTC')), `ip` String, `port` Nullable(UInt16), `ip_asn` Nullable(UInt32), `ip_as_org_name` Nullable(String), `ip_as_cc` Nullable(String), `ip_cc` Nullable(String), `ip_is_bogon` Nullable(UInt8), `dns_failure` Nullable(String), `dns_success` Nullable(UInt8), `tcp_failure` Nullable(String), `tcp_success` Nullable(UInt8), `tls_failure` Nullable(String), `tls_success` Nullable(UInt8), `tls_server_name` Nullable(String), `http_request_url` Nullable(String), `http_failure` Nullable(String), `http_success` Nullable(UInt8), `http_response_body_length` Nullable(Int32)) -ENGINE = ReplicatedReplacingMergeTree( -'/clickhouse/{cluster}/tables/{database}/{table}/{shard}', -'{replica}' -) -PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2)) -PRIMARY KEY (measurement_uid, observation_idx) ORDER BY (measurement_uid, observation_idx, measurement_start_time, hostname) SETTINGS index_granularity = 8192 -``` diff --git a/scripts/build-docs.sh b/scripts/build-docs.sh index 0d04916..864b02a 100755 --- a/scripts/build-docs.sh +++ b/scripts/build-docs.sh @@ -1,6 +1,7 @@ #!/bin/bash DOCS_ROOT=dist/docs/ REPO_NAME="ooni/devops" +MAIN_BRANCH="main" COMMIT_HASH=$(git rev-parse --short HEAD) mkdir -p $DOCS_ROOT @@ -12,38 +13,34 @@ strip_title() { cat $infile | awk 'BEGIN{p=1} /^#/{if(p){p=0; next}} {print}' } -cat <$DOCS_ROOT/00-index.md ---- -# Do not edit! This file is automatically generated -# to edit go to: https://github.com/$REPO_NAME/edit/main/README.md -# version: $REPO_NAME:$COMMIT_HASH -title: OONI Devops -description: OONI Devops -slug: devops ---- -EOF -strip_title README.md >> $DOCS_ROOT/00-index.md +generate_doc() { + local output_file="$1" + local title="$2" + local description="$3" + local slug="$4" + local input_file="$5" -cat <$DOCS_ROOT/01-iac.md + cat <"$DOCS_ROOT/$output_file" --- # Do not edit! This file is automatically generated -# to edit go to: https://github.com/$REPO_NAME/edit/main/tf/README.md -# version: $REPO_NAME:$COMMIT_HASH -title: OONI Devops IaC -description: OONI Devops IaC Documentation -slug: devops/iac +# version: $REPO_NAME/$input_file:$COMMIT_HASH +title: $title +description: $description +slug: $slug --- EOF -strip_title tf/README.md >> $DOCS_ROOT/01-iac.md + echo "[edit file](https://github.com/$REPO_NAME/edit/$MAIN_BRANCH/$input_file)" >> "$DOCS_ROOT/$output_file" + strip_title "$input_file" >> "$DOCS_ROOT/$output_file" +} -cat <$DOCS_ROOT/02-configuration-management.md ---- -# Do not edit! This file is automatically generated -# to edit go to: https://github.com/$REPO_NAME/edit/main/ansible/README.md -# version: $REPO_NAME:$COMMIT_HASH -title: OONI Devops Configuration Management -description: OONI Devops Configuration Management Documentation -slug: devops/configuration-management ---- -EOF -strip_title ansible/README.md >> $DOCS_ROOT/02-configuration-management.md \ No newline at end of file + +generate_doc "00-index.md" "OONI Devops" "OONI OONI Devops" "devops" "README.md" +generate_doc "01-infrastructure.md" "Infrastructure" "Infrastructure documentation" "devops/infrastructure" "docs/Infrastructure.md" +generate_doc "02-monitoring-alerts.md" "Monitoring" "Monitoring and Alerts documentation" "devops/monitoring" "docs/MonitoringAlerts.md" +generate_doc "03-runbooks.md" "Runbooks" "Runbooks docs" "devops/runbooks" "docs/Runbooks.md" +generate_doc "04-incident-response.md" "Incident response" "Incident response handling guidelines" "devops/incident-response" "docs/IncidentResponse.md" +generate_doc "05-terraform.md" "Terraform setup" "Terraform setup" "devops/terraform" "tf/README.md" +generate_doc "06-ansible.md" "Ansible setup" "Ansible setup" "devops/ansible" "ansible/README.md" +generate_doc "07-tools.md" "Misc Tools" "Misc Tools" "devops/tools" "docs/Tools.md" +generate_doc "08-debian-packages.md" "Debian Packages" "Debian Packages" "devops/debian-packages" "docs/DebianPackages.md" +generate_doc "09-legacy-docs.md" "Legacy Documentation" "Legacy Documentation" "devops/legacy-docs" "docs/LegacyDocs.md"