Skip to content

Commit

Permalink
Merge pull request #6 from dasmeta/DMVP-3291-metrics-fix
Browse files Browse the repository at this point in the history
fix(DMVP-3291): Fixed alarms' metrics
  • Loading branch information
viktoryathegreat authored Jan 9, 2024
2 parents 296bfef + d0dd30a commit 88e0ca5
Show file tree
Hide file tree
Showing 9 changed files with 250 additions and 65 deletions.
43 changes: 40 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,42 @@
# service

## What

This module
- deploys services by using `helm_release` tf resource:
- it is conditional: set `deploy_service` to false if you don't want to deploy a service,
- creates these basic alarms in CloudWatch for the service:
- service pod's received traffic is out of anomaly band,
- service pod's transmitted traffic is out of anomaly band,
- service pod has 2 or more restarts in 5 minues,
- service has 0 available replicas,
- service HPA has been on its maximum for 5 minutes: there are maximum pods of the service.

## How
Alarms are configured by default but can be customized via `alarms.custom_values` parameter.
By default all 5 alarms are enabled but each of them can be disabled:
```
module "this" {
....
alarms = {
sns_topic = "default"
restarts = {
enabled = false
}
network_out = {
enabled = false
}
}
....
}
```
In this case restarts, network_out alarms will not be created. Only maximum_replicas_usage, replicas, network_in alarms will be created.

## Use Cases
Please check `examples` folder for more detailed examples.

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

Expand Down Expand Up @@ -27,11 +64,11 @@ No requirements.

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_alarms"></a> [alarms](#input\_alarms) | Alarms enabled by default you need set sns topic name for send alarms for customize alarms threshold use custom\_values | <pre>object({<br> enabled = optional(bool, true)<br> sns_topic = string<br> custom_values = optional(any, {})<br> })</pre> | n/a | yes |
| <a name="input_alarms"></a> [alarms](#input\_alarms) | Alarms are enabled by default. You need to set SNS topic name to send alarms. Use custom\_values to customize alarms. | <pre>object({<br> enabled = optional(bool, true)<br> sns_topic = string<br> custom_values = optional(any, {})<br> restarts = optional(object({<br> enabled = bool<br> }), {<br> enabled = true<br> })<br> replicas = optional(object({<br> enabled = bool<br> }), {<br> enabled = true<br> })<br> network_in = optional(object({<br> enabled = bool<br> }), {<br> enabled = true<br> })<br> network_out = optional(object({<br> enabled = bool<br> }), {<br> enabled = true<br> })<br> maximum_replicas_usage = optional(object({<br> enabled = optional(bool, true)<br> maximum_replicas = number<br> }), {<br> enabled = true<br> maximum_replicas = 3 //The count of HPA maximum for a service. It will be used as a threshold for HPA maximum alarm.<br> })<br><br> })</pre> | n/a | yes |
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | Cluster name | `string` | n/a | yes |
| <a name="input_deploy_service"></a> [deploy\_service](#input\_deploy\_service) | Wether to deploy the service via helm or not. | `bool` | `true` | no |
| <a name="input_helm_values"></a> [helm\_values](#input\_helm\_values) | Values which is overwrite chart defaults | `any` | `null` | no |
| <a name="input_name"></a> [name](#input\_name) | Service names | `string` | n/a | yes |
| <a name="input_helm_values"></a> [helm\_values](#input\_helm\_values) | Values which overwrite chart defaults | `any` | `null` | no |
| <a name="input_name"></a> [name](#input\_name) | Service name. It's used as a helm release name and specified PodName in AWS CloudWatch metrics for which alarms will be created. | `string` | n/a | yes |
| <a name="input_namespace"></a> [namespace](#input\_namespace) | Namespace | `string` | `null` | no |

## Outputs
Expand Down
132 changes: 78 additions & 54 deletions alarms.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,64 +6,88 @@ module "cw_alerts" {

sns_topic = var.alarms.sns_topic

alerts = [
alerts = concat(
// Restarts
{
name = "${var.name} has too many restarts in ${var.cluster_name}"
source = "ContainerInsights/pod_number_of_container_restarts"
filters = {
ClusterName = var.cluster_name,
Deployment = var.name,
Namespace = var.namespace
}
period = try(var.alarms.custom_values.restarts.period, 300),
statistic = try(var.alarms.custom_values.restarts.statistic, "max"),
threshold = try(var.alarms.custom_values.restarts.threshold, 2)
equation = try(var.alarms.custom_values.restarts.equation, "gte")
},
var.alarms.restarts.enabled ? [
{
name = "${var.name} has too many restarts in ${var.cluster_name}"
source = "ContainerInsights/pod_number_of_container_restarts"
filters = {
ClusterName = var.cluster_name,
Namespace = var.namespace
PodName = var.name,
}
period = try(var.alarms.custom_values.restarts.period, 300),
statistic = try(var.alarms.custom_values.restarts.statistic, "max"),
threshold = try(var.alarms.custom_values.restarts.threshold, 2)
equation = try(var.alarms.custom_values.restarts.equation, "gte")
},
] : [],
// Replicas
{
name = "${var.name} has 0 available replicas in ${var.cluster_name}"
source = "ContainerInsights/kube_deployment_spec_replicas"
filters = {
ClusterName = var.cluster_name,
Deployment = var.name,
Namespace = var.namespace
}
period = try(var.alarms.custom_values.replicas.period, 300),
statistic = try(var.alarms.custom_values.replicas.statistic, "avg"),
threshold = try(var.alarms.custom_values.replicas.threshold, 0),
equation = try(var.alarms.custom_values.replicas.equation, "lte")
},
// CPU
{
name = "${var.name} has cpu problem in ${var.cluster_name}",
source = "ContainerInsights/pod_cpu_utilization",
filters = {
PodName = var.name,
ClusterName = var.cluster_name,
Namespace = var.namespace
var.alarms.replicas.enabled ? [
{
name = "${var.name} has 0 available replicas in ${var.cluster_name}"
source = "ContainerInsights/service_number_of_running_pods"
filters = {
ClusterName = var.cluster_name,
Namespace = var.namespace
Service = var.name,
}
period = try(var.alarms.custom_values.replicas.period, 300),
statistic = try(var.alarms.custom_values.replicas.statistic, "avg"),
threshold = try(var.alarms.custom_values.replicas.threshold, 0),
equation = try(var.alarms.custom_values.replicas.equation, "lte")
},
period = try(var.alarms.custom_values.cpu.period, 300),
statistic = try(var.alarms.custom_values.cpu.statistic, "avg"),
threshold = try(var.alarms.custom_values.cpu.threshold, 90)
equation = try(var.alarms.custom_values.cpu.equation, "gte")
},
// MEMORY
{
name = "${var.name} has memory problem in ${var.cluster_name}",
source = "ContainerInsights/pod_memory_utilization",
filters = {
PodName = var.name,
ClusterName = var.cluster_name,
Namespace = var.namespace
] : [],
// Network In
var.alarms.network_in.enabled ? [
{
name = "${var.name} is outside of Network < In band in ${var.cluster_name}",
source = "ContainerInsights/pod_network_rx_bytes",
filters = {
ClusterName = var.cluster_name,
Namespace = var.namespace
PodName = var.name,
},
period = try(var.alarms.custom_values.network_in.period, 300),
statistic = try(var.alarms.custom_values.network_in.statistic, "avg"),
equation = try(var.alarms.custom_values.network_in.equation, "ltlgtu")
anomaly_detection = true
},
period = try(var.alarms.custom_values.memory.period, 300),
statistic = try(var.alarms.custom_values.memory.statistic, "avg"),
threshold = try(var.alarms.custom_values.memory.threshold, 90)
equation = try(var.alarms.custom_values.memory.equation, "gte")
},
]
] : [],
// Network Out
var.alarms.network_out.enabled ? [
{
name = "${var.name} is outside of Network > Out band ${var.cluster_name}",
source = "ContainerInsights/pod_network_tx_bytes",
filters = {
ClusterName = var.cluster_name,
Namespace = var.namespace
PodName = var.name,
},
period = try(var.alarms.custom_values.network_out.period, 300),
statistic = try(var.alarms.custom_values.network_out.statistic, "avg"),
equation = try(var.alarms.custom_values.network_out.equation, "ltlgtu")
anomaly_detection = true
},
] : [],
// HPA Maximum
var.alarms.maximum_replicas_usage.enabled ? [
{
name = "${var.name} has been on HPA maximum for 5 minutes in ${var.cluster_name}",
source = "ContainerInsights/kube_deployment_status_replicas_available",
filters = {
ClusterName = var.cluster_name,
Namespace = var.namespace
Deployment = var.name,
},
period = try(var.alarms.custom_values.maximum_replicas_usage.period, 300),
statistic = try(var.alarms.custom_values.maximum_replicas_usage.statistic, "avg"),
threshold = try(var.alarms.custom_values.maximum_replicas_usage.threshold, var.alarms.maximum_replicas_usage.maximum_replicas)
equation = try(var.alarms.custom_values.maximum_replicas_usage.equation, "gte")
}
] : [],
)

depends_on = [
helm_release.service
Expand Down
2 changes: 1 addition & 1 deletion examples/basic-yaml/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ variables:
namespace: test
cluster_name: "eks-dev"
alarms:
- sns_topic: "Default"
sns_topic: "Default"
helm_values:
image:
repository: xxxxx.dkr.ecr.us-east-1.amazonaws.com/api
Expand Down
14 changes: 10 additions & 4 deletions examples/customized_alarms/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,16 @@ module "this" {
alarms = {
sns_topic = "Default"
custom_values = {
cpu = {
network_in = {
period = 300,
statistic = "avg",
threshold = 80
threshold = 80000
equation = "gte"
},
memory = {
network_out = {
period = 300,
statistic = "avg",
threshold = 80
threshold = 80000
equation = "gte"
},
restarts = {
Expand All @@ -47,6 +47,12 @@ module "this" {
threshold = 0
equation = "lte"
},
maximum_replicas_usage = {
period = 300,
statistic = "avg",
threshold = 6
equation = "gte"
},
}
}
}
1 change: 1 addition & 0 deletions examples/deployment_disabled/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ module "this" {
deploy_service = false

name = "api01"
namespace = "test"
cluster_name = "eks-dev"

alarms = {
Expand Down
29 changes: 29 additions & 0 deletions examples/some_alarms_disabled/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# deployment_disabled

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_this"></a> [this](#module\_this) | ../../ | n/a |

## Resources

No resources.

## Inputs

No inputs.

## Outputs

No outputs.
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
19 changes: 19 additions & 0 deletions examples/some_alarms_disabled/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
module "this" {
source = "../../"

deploy_service = false

name = "api01"
namespace = "test"
cluster_name = "eks-dev"

alarms = {
sns_topic = "default"
restarts = {
enabled = false
}
network_out = {
enabled = false
}
}
}
40 changes: 40 additions & 0 deletions examples/some_alarms_disabled/provider.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## This file and its content are generated based on config, pleas check README.md for more details
provider "aws" {
region = "us-east-1"
}

provider "kubernetes" {
cluster_ca_certificate = "cluster_ca_certificate"
host = "host"

exec {
api_version = "client.authentication.k8s.io/v1beta1"
args = [
"eks",
"--region",
"eu-central-1",
"get-token",
"--cluster-name",
"dev"]
command = "aws"
}
}

provider "helm" {
kubernetes {
cluster_ca_certificate = "cluster_ca_certificate"
host = "host"
exec {
api_version = "client.authentication.k8s.io/v1beta1"
args = [
"eks",
"--region",
"us-east-1",
"get-token",
"--cluster-name",
"eks-dev"
]
command = "aws"
}
}
}
Loading

0 comments on commit 88e0ca5

Please sign in to comment.