CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.
DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.
- Multi-protocol support (TCP, HTTP, HTTPS with cert validation, SMTP, DNS)
- Hierarchical configuration (Sites → Groups → Hosts → Checks)
- High availability monitoring with configurable modes
- Configurable check intervals per service
- Prometheus metrics integration
- Simple Rule-based monitoring with custom conditions
- Flexible notification system
- Service tagging system
- TLS certificate expiration monitoring
- Extensible design for easy protocol additions
Groups support two monitoring modes that can be configured at different levels:
-
All Mode (Default)
- Group is considered "up" if any host is responding
- Rules only trigger when all hosts are down
- For redundant services where one available host is sufficient
-
Any Mode
- Group monitoring tracks all hosts individually
- Rules trigger when any host goes down
- Suitable for services where each host's availability is critical
Rule modes can be configured at three levels (in order of precedence):
- Check level - Overrides group settings for specific checks
- Group level - Default for all checks in the group
- Default - Falls back to "all" mode if not specified
monitor_site
: Name of the monitoring instancesites
: List of infrastructure sites to monitorname
: Site identifiertags
: Site-level tagsgroups
: List of service groups
name
: Group identifiertags
: Group-level tags (combined with site tags)hosts
: List of hosts to monitorhost
: Hostname or IPtags
: Host-specific tags
checks
: Service checks applied to all hostsport
: Port numberprotocol
: TCP, HTTP, SMTP, or DNSinterval
: Check frequency (e.g., "30s", "1m")tags
: Check-specific tagsrule_mode
: Override group's rule modeverify_cert
: Enable certificate checking
rule_mode
: Group-level rule mode ("all" or "any")
Rules define conditions for generating notifications. Each rule requires a type
field:
# Standard Rule Example
- name: "prod_service_degraded"
type: "standard"
condition: "responseTime > 1000 || downtime > 0"
tags: ["prod", "critical"]
notifications: ["log"]
# Certificate Rule Example
- name: "cert_expiring_soon"
type: "cert"
min_days_validity: 30
tags: ["https-api"]
notifications: ["log"]
Common Fields:
name
: Rule identifiertype
: Either "standard" or "cert"tags
: Tags to match against groups/checksnotifications
: Notification types to use
Type-specific Fields:
- Standard Rules:
condition
: Expression usingdowntime
andresponseTime
variables
- Certificate Rules:
min_days_validity
: Days before expiration to trigger alert
type
: Notification type ("log", more coming soon)
CheckMate exposes Prometheus metrics at :9100/metrics
checkmate_host_check_status
: Service availability (1 = up, 0 = down)checkmate_host_check_latency_milliseconds
: Response time in millisecondscheckmate_check_latency_histogram_seconds
: Response time distributioncheckmate_hosts_up
: Number of hosts up in a groupcheckmate_hosts_total
: Total number of hosts in a groupcheckmate_cert_expiry_days
: Days until certificate expiration
Note: These metrics are designed for Grafana's Node Graph visualization and are currently in flux
-
checkmate_node_info
: Node information for graph visualization- Labels: id, type (site/group/host), name, tags, port, protocol
- Values: 1 for active nodes, 0 for inactive
-
checkmate_edge_info
: Edge information with latency- Labels: source, target, type, metric, port, protocol
- Values: latency in milliseconds
Example Prometheus queries:
# Filter checks by site
checkmate_check_success{site="mars-lab"}
# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})
# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))
# Host availability ratio per group
sum(checkmate_hosts_up) by (id) / sum(checkmate_hosts_total) by (id)
# Graph Visualization (In Development)
checkmate_node_info{type="host", port="443", protocol="HTTPS"}
avg(checkmate_edge_info{type="contains", metric="latency"}) by (source, target, port, protocol)
To visualize your infrastructure in Grafana's Node Graph:
- Create a new Node Graph panel
- Configure the Node Query:
checkmate_node_info
- Configure the Edge Query:
checkmate_edge_info{metric="latency"}
- Set transformations:
- Nodes: Use 'id' for node ID, 'type' for node class
- Edges: Use 'source' and 'target' for connections
Note: Graph visualization features are in flux and the query/configuration interface may change
CheckMate provides Kubernetes-compatible health check endpoints:
-
/health/live
- Liveness probe- Returns 200 OK when the service is running
-
/health/ready
- Readiness probe- Returns 200 OK when ready to receive traffic
- Returns 503 Service Unavailable during initialization
All health check endpoints are served on port 9100 alongside metrics.
- Config Hot Reload
- Notification system expansion (Slack, Email)
- Configurable notification thresholds
-
- time between alerts
-
- service restoration notification
-
- configurable custom alert levels (example: insignificant, minor, critical, all hands on deck)
-
- etc.
- move alert logic to notifications (any/all)
- Database support for historical data
- Web UI for monitoring (MAYBE)
- Env Variables for config
- Dockerfile for dev
- Additional protocol support (HTTPS, TLS verification)
- Kubernetes readiness/liveness probe support
- Multiple host monitoring
- Multi-protocol per host
- Service tagging system
- Site-based infrastructure organization
- High availability group monitoring
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- Go 1.21 or higher
- air for live reloading (optional)
For development with automatic rebuilding on code changes:
- Install Air:
go install github.com/air-verse/air@latest
- Run with Air:
air