CheckMate

CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.

DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.

Features

Core Features

Multi-protocol support (TCP, HTTP, HTTPS with cert validation, SMTP, DNS)
Hierarchical configuration (Sites → Groups → Hosts → Checks)
High availability monitoring with configurable modes
Configurable check intervals per service
Prometheus metrics integration
Simple Rule-based monitoring with custom conditions
Flexible notification system
Service tagging system
TLS certificate expiration monitoring
Extensible design for easy protocol additions

High Availability Monitoring

Groups support two monitoring modes that can be configured at different levels:

All Mode (Default)
- Group is considered "up" if any host is responding
- Rules only trigger when all hosts are down
- For redundant services where one available host is sufficient
Any Mode
- Group monitoring tracks all hosts individually
- Rules trigger when any host goes down
- Suitable for services where each host's availability is critical

Rule modes can be configured at three levels (in order of precedence):

Check level - Overrides group settings for specific checks
Group level - Default for all checks in the group
Default - Falls back to "all" mode if not specified

Configuration

Site Configuration

monitor_site: Name of the monitoring instance
sites: List of infrastructure sites to monitor
- name: Site identifier
- tags: Site-level tags
- groups: List of service groups

Group Configuration

name: Group identifier
tags: Group-level tags (combined with site tags)
hosts: List of hosts to monitor
- host: Hostname or IP
- tags: Host-specific tags
checks: Service checks applied to all hosts
- port: Port number
- protocol: TCP, HTTP, SMTP, or DNS
- interval: Check frequency (e.g., "30s", "1m")
- tags: Check-specific tags
- rule_mode: Override group's rule mode
- verify_cert: Enable certificate checking
rule_mode: Group-level rule mode ("all" or "any")

Rule Configuration

Rules define conditions for generating notifications. Each rule requires a type field:

# Standard Rule Example
- name: "prod_service_degraded"
  type: "standard"
  condition: "responseTime > 1000 || downtime > 0"
  tags: ["prod", "critical"]
  notifications: ["log"]
# Certificate Rule Example
- name: "cert_expiring_soon"
  type: "cert"
  min_days_validity: 30
  tags: ["https-api"]
  notifications: ["log"]

Common Fields:

name: Rule identifier
type: Either "standard" or "cert"
tags: Tags to match against groups/checks
notifications: Notification types to use

Type-specific Fields:

Standard Rules:
- condition: Expression using downtime and responseTime variables
Certificate Rules:
- min_days_validity: Days before expiration to trigger alert

Notification Configuration

type: Notification type ("log", more coming soon)

Metrics

CheckMate exposes Prometheus metrics at :9100/metrics

Core Metrics

checkmate_host_check_status: Service availability (1 = up, 0 = down)
checkmate_host_check_latency_milliseconds: Response time in milliseconds
checkmate_check_latency_histogram_seconds: Response time distribution
checkmate_hosts_up: Number of hosts up in a group
checkmate_hosts_total: Total number of hosts in a group
checkmate_cert_expiry_days: Days until certificate expiration

Graph Visualization Metrics (In Development)

Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux

checkmate_node_info: Node information for graph visualization
- Labels: id, type (site/group/host), name, tags, port, protocol
- Values: 1 for active nodes, 0 for inactive
checkmate_edge_info: Edge information with latency
- Labels: source, target, type, metric, port, protocol
- Values: latency in milliseconds

Example Prometheus queries:

# Filter checks by site
checkmate_check_success{site="mars-lab"}

# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})

# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))

# Host availability ratio per group
sum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)

# Graph Visualization (In Development)
checkmate_node_info{type="host", port="443", protocol="HTTPS"}
avg(checkmate_edge_info{type="contains", metric="latency"}) by (source, target, port, protocol)

Grafana Node Graph Setup (In Development)

To visualize your infrastructure in Grafana's Node Graph:

Create a new Node Graph panel
Configure the Node Query:
```
checkmate_node_info
```
Configure the Edge Query:
```
checkmate_edge_info{metric="latency"}
```
Set transformations:
- Nodes: Use 'id' for node ID, 'type' for node class
- Edges: Use 'source' and 'target' for connections

Note: Graph visualization features are in flux and the query/configuration interface may change

Health Checks

CheckMate provides Kubernetes-compatible health check endpoints:

/health/live - Liveness probe
- Returns 200 OK when the service is running
/health/ready - Readiness probe
- Returns 200 OK when ready to receive traffic
- Returns 503 Service Unavailable during initialization

All health check endpoints are served on port 9100 alongside metrics.

Mini Roadmap

High Pri

Config Hot Reload
Notification system expansion (Slack, Email)
Configurable notification thresholds
- time between alerts
- service restoration notification
- configurable custom alert levels (example: insignificant, minor, critical, all hands on deck)
- etc.
move alert logic to notifications (any/all)

Low Pri

Database support for historical data
Web UI for monitoring (MAYBE)

Completed

Env Variables for config
Dockerfile for dev
Additional protocol support (HTTPS, TLS verification)
Kubernetes readiness/liveness probe support
Multiple host monitoring
Multi-protocol per host
Service tagging system
Site-based infrastructure organization
High availability group monitoring

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Development

Prerequisites

Go 1.21 or higher
air for live reloading (optional)

Live Reloading

For development with automatic rebuilding on code changes:

Install Air:

go install github.com/air-verse/air@latest

Run with Air:

air

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CheckMate

Features

Core Features

High Availability Monitoring

Configuration

Site Configuration

Group Configuration

Rule Configuration

Notification Configuration

Metrics

Core Metrics

Graph Visualization Metrics (In Development)

Grafana Node Graph Setup (In Development)

Health Checks

Mini Roadmap

High Pri

Low Pri

Completed

License

Development

Prerequisites

Live Reloading

Files

README.md

Latest commit

History

README.md

File metadata and controls

CheckMate

Features

Core Features

High Availability Monitoring

Configuration

Site Configuration

Group Configuration

Rule Configuration

Notification Configuration

Metrics

Core Metrics

Graph Visualization Metrics (In Development)

Grafana Node Graph Setup (In Development)

Health Checks

Mini Roadmap

High Pri

Low Pri

Completed

License

Development

Prerequisites

Live Reloading