-
-
Notifications
You must be signed in to change notification settings - Fork 102
Overview of Nagios
Will Parker edited this page Jan 12, 2021
·
2 revisions
Nagios is an industry standard server monitoring service, allowing an at-a-glance overview of machines in their server farm. It allows flexibility for users to specify exactly what aspects of each machine they want to monitor, along with how often they are checked.
For most typical build and test machines, we have the following checks in place:
Check | How Often | Warning Boundary | Critical Boundary |
---|---|---|---|
Check SSH | Every 15 minutes | - | Can't connect to machine |
Current Load | Every 30 minutes | 15,10,5 | 30,24,20 |
Disk Space Root Partition | Every 60 minutes | 20% free | 10% free |
Check Jenkins connection | Every 30 minutes | Temporarily disabled | Fully disconnected |
Ping | Every 15 minutes | rta 200, 20% packet loss | rta 500, 60% packet loss |
Check RAM | Every 10 minutes | 15% free | 5% free |
Check Timesync | Every 15 minutes | Time not synchronized / service not running | Can't find required info |
Check Package Manager | Once a day | Any updates required | Critical Updates required |
The check_ssh
check output defines if the host is considered connected to the Nagios Server. If this is critical, it can be assumed no other checks will work either.
Note: For the checks in bold, the checks are platform specific.
An up-to-date version of the checks can be found in ansible/playbooks/Supporting_Scripts/Nagios_Ansible_Config_Tool/templates