Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a process for manually verifying GCP filestore backups have happened and old backups have been cleaned up #4628

Merged
merged 4 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions deployer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,10 @@ The `deployer.py` file is the main file, that contains all of the commands regis
│   │   ├── deploy_dashboards.py
│   │   ├── tokens.py
│   │   └── utils.py
│   └── validate
│   ├── cluster.schema.yaml
│   └── config.py
│   ├── validate
│   │   ├── cluster.schema.yaml
│   │   └── config.py
│ └── verify_backups.py
```

### The `health_check_tests` directory
Expand All @@ -135,15 +136,16 @@ This section descripts some of the subcommands the `deployer` can carry out.
**Command line usage:**

```bash
Usage: deployer [OPTIONS] COMMAND [ARGS]...

Usage: deployer [OPTIONS] COMMAND [ARGS]...
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ cilogon-client Manage cilogon clients for hubs' authentication. │
│ config Get refined information from the config folder. │
│ debug Debug issues by accessing different components and their logs │
│ decrypt-age Decrypt secrets sent to `support@2i2c.org` via `age` │
│ deploy Deploy one or more hubs in a given cluster │
Expand All @@ -156,6 +158,7 @@ This section descripts some of the subcommands the `deployer` can carry out.
│ transform Programmatically transform datasets, such as cost tables for billing purposes. │
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the deployer's credentials │
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │
│ verify-backups Verify backups of home directories have been successfully created, and old backups have been cleared out. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

Expand Down
1 change: 1 addition & 0 deletions deployer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import deployer.commands.grafana.tokens # noqa: F401
import deployer.commands.transform.cost_table # noqa: F401
import deployer.commands.validate.config # noqa: F401
import deployer.commands.verify_backups # noqa: F401
import deployer.keys.decrypt_age # noqa: F401

from .cli_app import app
Expand Down
6 changes: 6 additions & 0 deletions deployer/cli_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
grafana_app = typer.Typer(pretty_exceptions_show_locals=False)
validate_app = typer.Typer(pretty_exceptions_show_locals=False)
transform_app = typer.Typer(pretty_exceptions_show_locals=False)
verify_backups_app = typer.Typer(pretty_exceptions_show_locals=False)

app.add_typer(
generate_app,
Expand Down Expand Up @@ -57,3 +58,8 @@
name="transform",
help="Programmatically transform datasets, such as cost tables for billing purposes.",
)
app.add_typer(
verify_backups_app,
name="verify-backups",
help="Verify backups of home directories have been successfully created, and old backups have been cleared out.",
)
163 changes: 163 additions & 0 deletions deployer/commands/verify_backups.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
"""
Helper script to verify home directories are being backed up correctly

GCP
---
Wraps a gcloud command to list existing backups of a Fileshare
"""

import json
import subprocess
from datetime import datetime, timedelta

import jmespath
import typer

from deployer.cli_app import verify_backups_app
from deployer.utils.rendering import print_colour


def get_existing_gcp_backups(
project: str, region: str, filestore_name: str, filestore_share_name: str
):
"""List existing backups of a share on a filestore using the gcloud CLI.
We filter the backups based on:
- GCP project
- GCP region
- Filestore name
- Filestore share name

Args:
project (str): The GCP project the filestore is located in
region (str): The region the filestore is located in, e.g., us-central1
filestore_name (str): The name of the filestore instance
filestore_share_name (str): The name of the share on the filestore instance

Returns:
list(dict): A JSON-like object, where each dict-entry in the list describes
an existing backup of the filestore
"""
# Get all existing backups in the selected project and region
backups = subprocess.check_output(
[
"gcloud",
"filestore",
"backups",
"list",
"--format=json",
f"--project={project}",
f"--region={region}",
],
text=True,
)
backups = json.loads(backups)

# Filter returned backups by filestore and share names
backups = jmespath.search(
f"[?sourceFileShare == '{filestore_share_name}' && contains(sourceInstance, '{filestore_name}')]",
backups,
)

# Parse `createTime` property into a datetime object for comparison
backups = [
{
k: (
datetime.strptime(v.split(".")[0], "%Y-%m-%dT%H:%M:%S")
if k == "createTime"
else v
)
for k, v in backup.items()
}
for backup in backups
]

return backups


def filter_gcp_backups_into_recent_and_old(
backups: list, backup_freq_days: int, retention_days: int
):
"""Filter the list of backups into two groups:
- Recently created backups that were created within our backup window,
defined by backup_freq_days
- Out of date back ups that are older than our retention window, defined by
retention days

Args:
backups (list(dict)): A JSON-like object defining the existing backups
for the filestore and share we care about
backup_freq_days (int, optional): The time period in days for which we
create a backup
retention_days (int): The number of days above which a backup is considered
to be out of date

Returns:
recent_backups (list(dict)): A JSON-like object containing all existing
backups with a `createTime` within our backup window
old_backups (list(dict)): A JSON-like object containing all existing
backups with a `createTime` older than our retention window
"""
# Generate a list of filestore backups that are younger than our backup window
recent_backups = [
backup
for backup in backups
if datetime.now() - backup["createTime"] < timedelta(days=backup_freq_days)
]

# Generate a list of filestore backups that are older than our set retention period
old_backups = [
backup
for backup in backups
if datetime.now() - backup["createTime"] > timedelta(days=retention_days)
]

return recent_backups, old_backups


@verify_backups_app.command()
def gcp(
project: str = typer.Argument(
..., help="The GCP project the filestore is located in"
),
region: str = typer.Argument(
..., help="The GCP region the filestore is located in, e.g., us-central1"
),
filestore_name: str = typer.Argument(
..., help="The name of the filestore instance to verify backups of"
),
filestore_share_name: str = typer.Option(
"homes", help="The name of the share on the filestore"
),
backup_freq_days: int = typer.Option(
1, help="How often, in days, backups should be created"
),
retention_days: int = typer.Option(
5, help="How old, in days, backups are allowed to become before being deleted"
),
):
filestore_backups = get_existing_gcp_backups(
project, region, filestore_name, filestore_share_name
)
recent_filestore_backups, old_filestore_backups = (
filter_gcp_backups_into_recent_and_old(
filestore_backups, backup_freq_days, retention_days
)
)

if len(recent_filestore_backups) > 0:
print_colour(
f"A backup has been made within the last {backup_freq_days} day(s)!"
)
else:
print_colour(
f"No backups have been made in the last {backup_freq_days} day(s)!",
colour="red",
)

if len(old_filestore_backups) > 0:
print_colour(
f"Filestore backups older than {retention_days} day(s) have been found!",
colour="red",
)
else:
print_colour("No out-dated backups have been found!")
15 changes: 15 additions & 0 deletions docs/howto/filesystem-backups/enable-backups.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,18 @@ export CLUSTER_NAME=<cluster-name>

This will have successfully enabled automatic backups of GCP Filestores for this
cluster.

### Verify successful backups on GCP

We manually verify that backups are being successfully created and cleaned up on a regular schedule.

To verify that a backup has been recently created, and that no backups older than the retention period exist, we can use the following deployer command:

```bash
deployer verify-backups gcp <project-name> <region> <filestore-name>
```

where:
- `<project-name>` is the name of the GCP project the Filestore is located in
- `<region>` is the GCP region the Filestore is located in, e.g., `us-central1`
- `<filestore-name>` is the name of the Filestore instance
2 changes: 1 addition & 1 deletion docs/howto/upgrade-cluster/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ kubectl get pod -A -l "app.kubernetes.io/component in (dask-scheduler, dask-work
Notify others in 2i2c that your are starting this cluster upgrade in the
`#maintenance-notices` Slack channel.

### 4. Upgrade the k8s control plane
### 4. Upgrade the k8s control plane[^2]

#### 4.1. Upgrade the k8s control plane one minor version

Expand Down
5 changes: 4 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,7 @@ requests==2.*
GitPython==3.*

# Used to parse units that kubernetes understands (like GiB)
kubernetes
kubernetes

# Used to perform regex searches on JSON objects
jmespath