Skip to content

Latest commit

 

History

History
221 lines (173 loc) · 7.08 KB

README.md

File metadata and controls

221 lines (173 loc) · 7.08 KB

CheckMate

License Go Version

CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.

DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.

Features

Core Features

  • Multi-protocol support (TCP, HTTP, HTTPS with cert validation, SMTP, DNS)
  • Hierarchical configuration (Sites → Groups → Hosts → Checks)
  • High availability monitoring with configurable modes
  • Configurable check intervals per service
  • Prometheus metrics integration
  • Simple Rule-based monitoring with custom conditions
  • Flexible notification system
  • Service tagging system
  • TLS certificate expiration monitoring
  • Extensible design for easy protocol additions

High Availability Monitoring

Groups support two monitoring modes that can be configured at different levels:

  • All Mode (Default)

    • Group is considered "up" if any host is responding
    • Rules only trigger when all hosts are down
    • For redundant services where one available host is sufficient
  • Any Mode

    • Group monitoring tracks all hosts individually
    • Rules trigger when any host goes down
    • Suitable for services where each host's availability is critical

Rule modes can be configured at three levels (in order of precedence):

  1. Check level - Overrides group settings for specific checks
  2. Group level - Default for all checks in the group
  3. Default - Falls back to "all" mode if not specified

Configuration

Site Configuration

  • monitor_site: Name of the monitoring instance
  • sites: List of infrastructure sites to monitor
    • name: Site identifier
    • tags: Site-level tags
    • groups: List of service groups

Group Configuration

  • name: Group identifier
  • tags: Group-level tags (combined with site tags)
  • hosts: List of hosts to monitor
    • host: Hostname or IP
    • tags: Host-specific tags
  • checks: Service checks applied to all hosts
    • port: Port number
    • protocol: TCP, HTTP, SMTP, or DNS
    • interval: Check frequency (e.g., "30s", "1m")
    • tags: Check-specific tags
    • rule_mode: Override group's rule mode
    • verify_cert: Enable certificate checking
  • rule_mode: Group-level rule mode ("all" or "any")

Rule Configuration

Rules define conditions for generating notifications. Each rule requires a type field:

# Standard Rule Example
- name: "prod_service_degraded"
  type: "standard"
  condition: "responseTime > 1000 || downtime > 0"
  tags: ["prod", "critical"]
  notifications: ["log"]
# Certificate Rule Example
- name: "cert_expiring_soon"
  type: "cert"
  min_days_validity: 30
  tags: ["https-api"]
  notifications: ["log"]

Common Fields:

  • name: Rule identifier
  • type: Either "standard" or "cert"
  • tags: Tags to match against groups/checks
  • notifications: Notification types to use

Type-specific Fields:

  • Standard Rules:
    • condition: Expression using downtime and responseTime variables
  • Certificate Rules:
    • min_days_validity: Days before expiration to trigger alert

Notification Configuration

  • type: Notification type ("log", more coming soon)

Metrics

CheckMate exposes Prometheus metrics at :9100/metrics

Core Metrics

  • checkmate_host_check_status: Service availability (1 = up, 0 = down)
  • checkmate_host_check_latency_milliseconds: Response time in milliseconds
  • checkmate_check_latency_histogram_seconds: Response time distribution
  • checkmate_hosts_up: Number of hosts up in a group
  • checkmate_hosts_total: Total number of hosts in a group
  • checkmate_cert_expiry_days: Days until certificate expiration

Graph Visualization Metrics (In Development)

Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux

  • checkmate_node_info: Node information for graph visualization

    • Labels: id, type (site/group/host), name, tags, port, protocol
    • Values: 1 for active nodes, 0 for inactive
  • checkmate_edge_info: Edge information with latency

    • Labels: source, target, type, metric, port, protocol
    • Values: latency in milliseconds

Example Prometheus queries:

# Filter checks by site
checkmate_check_success{site="mars-lab"}

# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})

# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))

# Host availability ratio per group
sum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)

# Graph Visualization (In Development)
checkmate_node_info{type="host", port="443", protocol="HTTPS"}
avg(checkmate_edge_info{type="contains", metric="latency"}) by (source, target, port, protocol)

Grafana Node Graph Setup (In Development)

To visualize your infrastructure in Grafana's Node Graph:

  1. Create a new Node Graph panel
  2. Configure the Node Query:
    checkmate_node_info
    
  3. Configure the Edge Query:
    checkmate_edge_info{metric="latency"}
    
  4. Set transformations:
    • Nodes: Use 'id' for node ID, 'type' for node class
    • Edges: Use 'source' and 'target' for connections

Note: Graph visualization features are in flux and the query/configuration interface may change

Health Checks

CheckMate provides Kubernetes-compatible health check endpoints:

  • /health/live - Liveness probe

    • Returns 200 OK when the service is running
  • /health/ready - Readiness probe

    • Returns 200 OK when ready to receive traffic
    • Returns 503 Service Unavailable during initialization

All health check endpoints are served on port 9100 alongside metrics.

Mini Roadmap

High Pri

  • Config Hot Reload
  • Notification system expansion (Slack, Email)
  • Configurable notification thresholds
    • time between alerts
    • service restoration notification
    • configurable custom alert levels (example: insignificant, minor, critical, all hands on deck)
    • etc.
  • move alert logic to notifications (any/all)

Low Pri

  • Database support for historical data
  • Web UI for monitoring (MAYBE)

Completed

  • Env Variables for config
  • Dockerfile for dev
  • Additional protocol support (HTTPS, TLS verification)
  • Kubernetes readiness/liveness probe support
  • Multiple host monitoring
  • Multi-protocol per host
  • Service tagging system
  • Site-based infrastructure organization
  • High availability group monitoring

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Development

Prerequisites

  • Go 1.21 or higher
  • air for live reloading (optional)

Live Reloading

For development with automatic rebuilding on code changes:

  1. Install Air:
go install github.com/air-verse/air@latest
  1. Run with Air:
air