diff --git a/check-plugins/disk-io/README.rst b/check-plugins/disk-io/README.rst index 9a814566..db9f96ac 100644 --- a/check-plugins/disk-io/README.rst +++ b/check-plugins/disk-io/README.rst @@ -6,20 +6,21 @@ Overview Checks disk bandwidth over a period of time. The check tracks the maximum bandwidth and alerts if the bandwidth over the last n reads is above a certain percentage (by default 80/90% over the last 5 reads). This works similar to Load5, but at the disk I/O level. -On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder. +On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder. If you want to monitor more disk than the automatic scan provides, you can use the match parameter. This will generate a list of all disks including the "important" ones and will then act on the ones matching the regex provided. This is indeed necessary on systems with e.g zfs pools, where the pools will not be automatically recognised and you will need to monitor the raw disks with the match option. As a starting point the following regex match will include most disks ``^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)`` Disk I/O always starts at 10 MiB/sec, but stores the highest measured bandwidth, so it adjusts the ``RWmax/s`` value accordingly. For this reason, this check takes some time to warm up its (cached) readings: The check will throw some warnings and criticals during the first major disk activities above 10Mib/sec until the maximum bandwidth of the disk has been determined. Example: The (shortened) result of ``./disk-io --count 5 --warning 80 --critical 90`` could look like this: .. code-block:: text - - /dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max - - Name ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s - -----+---------+----------+----------+----------+----------+-------------------- - dm-0 ! 44.9MiB ! 42.8MiB ! 17.2MiB ! 23.1MiB ! 18.6MiB ! 36.3MiB [CRITICAL] - dm-1 ! 10.0MiB ! 4.7KiB ! 4.0KiB ! 2.0KiB ! 6.8KiB ! 8.7KiB + /dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops + + Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s ! R1/s ! W1/s ! R5/s ! W5/s + -----+---------+-------------+---------+---------+----------+---------+---------+--------------------+------+------+------+------ + dm-0 ! / ! ubuntu-root ! 44.9MiB ! 42.8MiB ! 17.2MiB ! 23.1MiB ! 18.6MiB ! 36.3MiB [CRITICAL] ! 0 ! 89 ! 0 ! 71 + md0 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 + dm-2 ! /var ! ubuntu-var ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 + dm-1 ! /home ! ubuntu-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 ... The first line always shows the disk with the currently highest bandwidth (here ``dm-0``). @@ -30,6 +31,8 @@ The table columns mean: * R1, W1: The current bandwidth is 23.6 MB/sec read and 17.2 MB/sec write. * R5, W5: The bandwidth from now to 5 measured values in the past is 23.1 MB/sec read and 18.6 MB/sec write. * First line in the table, RW5: Compared to the current values, there was a higher bandwidth for a while. Since a maximum of 44.9 MB/sec bandwidth has been measured for this disk so far, a mean bandwidth (RW5) value of 36.3 MB/sec results in a warning (``36.3 MB/sec >= 44.9 MB/sec * 80%``). The current value of 42.8 MB/sec doesn't matter, this is only a peak. The check alerts because there is unusual high disk I/O over a certain amount of time. +* R1, W1: The current IOPs for read and write +* R5, W5: The IOPs from now to 5 measured valued in the past for read and write Hints: @@ -96,32 +99,32 @@ Just check disk ``dm-0`` (if listed as ``/dev/dm-0``): .. code-block:: bash - ./disk-io --match='.*dm-0$' + ./disk-io --match='dm-0$' Match all disks except ``vdc``, ``vdh`` and ``vdz``: .. code-block:: bash - ./disk-io --match='^(?:(?!.*vdc|.*vdh|.*vdz).)*$' + ./disk-io --match='^(?:(?!vdc|vdh|vdz).)*$' + +Match all disks starting with sd, vd, md, dm and nvme disks except the raw disk itself + +.. code-block:: bash + + ./disk-io --match='^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)' Example Output: .. code-block:: text - /dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max - - Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s - -----+----------------+------------------+---------+--------+---------+--------+---------+--------- - dm-0 ! / ! rl-root ! 10.0MiB ! 0.0B ! 426.0B ! 0.0B ! 343.0B ! 343.0B - vda2 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B - vda1 ! /boot/efi ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B - dm-5 ! /var ! rl-var ! 10.0MiB ! 0.0B ! 586.0B ! 0.0B ! 1.1KiB ! 1.1KiB - dm-8 ! /data ! rl-lv_data ! 10.0MiB ! 5.6KiB ! 2.2MiB ! 8.3KiB ! 2.3MiB ! 2.3MiB - dm-6 ! /tmp ! rl-tmp ! 10.0MiB ! 0.0B ! 4.8KiB ! 0.0B ! 7.1KiB ! 7.1KiB - dm-7 ! /home ! rl-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B - dm-2 ! /var/tmp ! rl-var_tmp ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B - dm-4 ! /var/log ! rl-var_log ! 10.0MiB ! 0.0B ! 51.8KiB ! 0.0B ! 51.2KiB ! 51.2KiB - dm-3 ! /var/log/audit ! rl-var_log_audit ! 10.0MiB ! 0.0B ! 918.0B ! 0.0B ! 876.0B ! 876.0B + /dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops + + Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s ! R1/s ! W1/s ! R5/s ! W5/s + -----+---------+-------------+---------+------+----------+------+----------+----------+------+------+------+------ + dm-0 ! / ! ubuntu-root ! 10.0MiB ! 0.0B ! 380.0KiB ! 0.0B ! 305.0KiB ! 305.0KiB ! 0 ! 89 ! 0 ! 71 + md0 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 + dm-2 ! /var ! ubuntu-var ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 + dm-1 ! /home ! ubuntu-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0 Top 5 processes that generate the most I/O traffic: 1. nfsd: 149.2GiB/5.7TiB (r/w) @@ -149,8 +152,10 @@ Per (matched) disk, where is the block device name: Name, Type, Description _busy_time, Continous Counter, Time spent doing actual I/Os (in milliseconds). _read_bytes, Continous Counter, Number of bytes read. + _read_count, Continous Counter, Number of read operations. _read_time, Continous Counter, Time spent reading from disk (in milliseconds). _write_bytes, Continous Counter, Number of bytes written. + _write_count, Continous Counter, Number of write operations. _write_time, Continous Counter, Time spent writing to disk (in milliseconds). diff --git a/check-plugins/disk-io/disk-io b/check-plugins/disk-io/disk-io index f20151b6..88b923fb 100755 --- a/check-plugins/disk-io/disk-io +++ b/check-plugins/disk-io/disk-io @@ -138,16 +138,19 @@ def get_max_bandwidth(disk, current_bandwidth): return max_bandwidth -def get_rate(ts1, ts2, r1, r2, w1, w2): +def get_rate(ts1, ts2, rr1, rr2, wr1, wr2, r1, r2, w1, w2): """Given two read-, write- and timestamp-values, return the read- and write-rate - plus bandwidth. + plus bandwidth and iops. """ timediff = abs(ts1 - ts2) # in seconds if timediff == 0: - return 0, 0, 0, 0 + return 0, 0, 0, 0, 0, 0 + rr = abs(int(float(rr1 - rr2) / timediff)) + wr = abs(int(float(wr1 - wr2) / timediff)) r = abs(int(float(r1 - r2) / timediff)) w = abs(int(float(w1 - w2) / timediff)) - return timediff, r, w, r + w + + return timediff, rr, wr, rr + wr, r, w def top(count): @@ -214,6 +217,8 @@ def main(): bd TEXT NOT NULL, dmd TEXT, mp TEXT, + read_count INT DEFAULT 0, + write_count INT DEFAULT 0, busy_time INT DEFAULT 0, read_bytes INT DEFAULT 0, read_merged_count INT DEFAULT 0, @@ -244,17 +249,26 @@ def main(): # analyze and enrich data, store it to database real_disks = lib.disk.get_real_disks() + + # if match argument is supplied, try the match on all interfaces from pustil disk_io_counters + # do not try the match if the interface is already included in real_disks + if args.MATCH: + for disk in disk_io_counters.keys(): + if lib.base.coe(lib.txt.match_regex(compiled_regex, disk)) and not any(disk in x['bd'] for x in real_disks): + real_disks.append({'bd': disk, 'dmd': '', 'mp': ''}) + for disk in real_disks: + psutil_name = os.path.basename(disk['bd']) + # disks we have to match if args.MATCH \ and all(( - not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['bd'])), + not lib.base.coe(lib.txt.match_regex(compiled_regex, psutil_name)), not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['dmd'])), not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['mp'])), )): continue - psutil_name = os.path.basename(disk['bd']) if psutil_name not in disk_io_counters: continue @@ -262,7 +276,8 @@ def main(): data['bd'] = disk['bd'] data['dmd'] = disk['dmd'] data['mp'] = disk['mp'] - # read_count and write_count are the same value over all disks, so simply ignore them + data['read_count'] = getattr(disk_io_counters[psutil_name], 'read_count', 0) + data['write_count'] = getattr(disk_io_counters[psutil_name], 'write_count', 0) data['busy_time'] = getattr(disk_io_counters[psutil_name], 'busy_time', 0) data['read_bytes'] = getattr(disk_io_counters[psutil_name], 'read_bytes', 0) data['read_merged_count'] = getattr(disk_io_counters[psutil_name], 'read_merged_count', 0) @@ -300,13 +315,17 @@ def main(): lib.base.oao('Waiting for more data.', state) # calculate current rates (like "load1") - timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1 = get_rate( + timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1, read_per_second1, write_per_second1 = get_rate( data[0]['timestamp'], data[1]['timestamp'], data[0]['read_bytes'], data[1]['read_bytes'], data[0]['write_bytes'], data[1]['write_bytes'], + data[0]['read_count'], + data[1]['read_count'], + data[0]['write_count'], + data[1]['write_count'] ) if timediff <= 0: # often happens after a reboot @@ -318,28 +337,34 @@ def main(): if bandwidth1 > busiest_disk: # get the current busiest disk for the first line of the message - msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max'.format( + msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max, {}/s readops, {}/s writeops'.format( disk['bd'], lib.human.bytes2human(read_bytes_per_second1), lib.human.bytes2human(write_bytes_per_second1), lib.human.bytes2human(bandwidth1), lib.human.bytes2human(bandwidth_max), + read_per_second1, + write_per_second1 ) if args.MATCH: msg += ' (disks matching `{}`).'.format(args.MATCH) busiest_disk = bandwidth1 - # calculate read/write rate over the entire period (like "load15") + # calculate read/write rate over the entire period (like "load5") if len(data) != args.COUNT: # not enough data yet continue - timediff, read_bytes_per_second15, write_bytes_per_second15, bandwidth15 = get_rate( + timediff, read_bytes_per_second5, write_bytes_per_second5, bandwidth5, read_per_second5, write_per_second5 = get_rate( data[0]['timestamp'], data[args.COUNT - 1]['timestamp'], data[0]['read_bytes'], data[args.COUNT - 1]['read_bytes'], data[0]['write_bytes'], data[args.COUNT - 1]['write_bytes'], + data[0]['read_count'], + data[args.COUNT - 1]['read_count'], + data[0]['write_count'], + data[args.COUNT - 1]['write_count'], ) if timediff <= 0: # often happens after a reboot @@ -348,7 +373,7 @@ def main(): # get state based on max measured I/O values local_state = lib.base.get_state( - bandwidth15, + bandwidth5, bandwidth_max * args.WARN / 100, bandwidth_max * args.CRIT / 100, ) @@ -360,21 +385,25 @@ def main(): 'dmd': disk['dmd'].replace('/dev/mapper/', ''), 'mp': disk['mp'], 'max': lib.human.bytes2human(bandwidth_max), - 'r1': lib.human.bytes2human(read_bytes_per_second1), - 'w1': lib.human.bytes2human(write_bytes_per_second1), - 'r15': lib.human.bytes2human(read_bytes_per_second15), - 'w15': lib.human.bytes2human(write_bytes_per_second15), - 't15': lib.human.bytes2human(bandwidth15) + lib.base.state2str(local_state, prefix=' '), + 'rr1': lib.human.bytes2human(read_bytes_per_second1), + 'wr1': lib.human.bytes2human(write_bytes_per_second1), + 'rr5': lib.human.bytes2human(read_bytes_per_second5), + 'wr5': lib.human.bytes2human(write_bytes_per_second5), + 'tr5': lib.human.bytes2human(bandwidth5) + lib.base.state2str(local_state, prefix=' '), + 'r1': read_per_second1, + 'w1': write_per_second1, + 'r5': read_per_second5, + 'w5': write_per_second5, }) # perfdata try: perfdata += lib.base.get_perfdata('{}_busy_time'.format(bd), data[0]['busy_time'], 'c', None, None, 0, None) # pylint: disable=C0301 perfdata += lib.base.get_perfdata('{}_read_bytes'.format(bd), data[0]['read_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301 - #perfdata += lib.base.get_perfdata('{}_read_merged_count'.format(bd), data[0]['read_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301 + perfdata += lib.base.get_perfdata('{}_read_count'.format(bd), data[0]['read_count'], 'c', None, None, 0, None) # pylint: disable=C0301 perfdata += lib.base.get_perfdata('{}_read_time'.format(bd), data[0]['read_time'], 'c', None, None, 0, None) # pylint: disable=C0301 perfdata += lib.base.get_perfdata('{}_write_bytes'.format(bd), data[0]['write_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301 - #perfdata += lib.base.get_perfdata('{}_write_merged_count'.format(bd), data[0]['write_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301 + perfdata += lib.base.get_perfdata('{}_write_count'.format(bd), data[0]['write_count'], 'c', None, None, 0, None) # pylint: disable=C0301 perfdata += lib.base.get_perfdata('{}_write_time'.format(bd), data[0]['write_time'], 'c', None, None, 0, None) # pylint: disable=C0301 except: pass @@ -391,11 +420,15 @@ def main(): 'mp', 'dmd', 'max', + 'rr1', + 'wr1', + 'rr5', + 'wr5', + 'tr5', 'r1', 'w1', - 'r15', - 'w15', - 't15', + 'r5', + 'w5' ], header=[ 'Name', @@ -406,7 +439,11 @@ def main(): 'W1/s', 'R{}/s'.format(args.COUNT), 'W{}/s'.format(args.COUNT), - 'RW{}/s'.format(args.COUNT) + 'RW{}/s'.format(args.COUNT), + 'R1/s', + 'W1/s', + 'R{}/s'.format(args.COUNT), + 'W{}/s'.format(args.COUNT) ], ) diff --git a/check-plugins/disk-io/grafana/disk-io.yml b/check-plugins/disk-io/grafana/disk-io.yml index db7c19fe..03885e01 100644 --- a/check-plugins/disk-io/grafana/disk-io.yml +++ b/check-plugins/disk-io/grafana/disk-io.yml @@ -41,7 +41,7 @@ spec: name: metric query: SHOW TAG VALUES FROM "cmd-check-disk-io" WITH KEY = "metric" refresh: 2 - regex: /^(.*)_.*_.*_.*_.*$/ + regex: /^(.*)_.*_.*$/ sort: 1 type: query @@ -271,6 +271,142 @@ spec: operator: =~ value: /^${metric}_write_bytes_per_second15/ + - title: Disk I/O - $metric - IOPs per Second + type: timeseries + gridPos: + h: 8 + w: 12 + x: 0 + y: 1 + fieldConfig: + defaults: + color: + mode: palette-classic + custom: + lineInterpolation: smooth + spanNulls: true + decimals: 0 + mappings: [] + min: 0 + unit: number + options: + legend: + calcs: + - first + - min + - mean + - max + - last + displayMode: table + placement: bottom + showLegend: true + + targets: + + - alias: read_count_per_second1 + groupBy: + - params: + - $interval + type: time + measurement: /^$command$/ + refId: disk-io-read_count_per_second1 + resultFormat: time_series + select: + - - params: + - value + type: field + - params: [] + type: mean + tags: + - key: hostname + operator: =~ + value: /^$hostname$/ + - condition: AND + key: service + operator: '=' + value: Disk I/O + - key: metric + operator: =~ + value: /^${metric}_read_count_per_second1/ + + - alias: read_count_per_second5 + groupBy: + - params: + - $interval + type: time + measurement: /^$command$/ + refId: disk-io-read_count_per_second5 + resultFormat: time_series + select: + - - params: + - value + type: field + - params: [] + type: mean + tags: + - key: hostname + operator: =~ + value: /^$hostname$/ + - condition: AND + key: service + operator: '=' + value: Disk I/O + - key: metric + operator: =~ + value: /^${metric}_read_count_per_second5/ + + - alias: write_count_per_second1 + groupBy: + - params: + - $interval + type: time + measurement: /^$command$/ + refId: disk-io-write_count_per_second1 + resultFormat: time_series + select: + - - params: + - value + type: field + - params: [] + type: mean + tags: + - key: hostname + operator: =~ + value: /^$hostname$/ + - condition: AND + key: service + operator: '=' + value: Disk I/O + - key: metric + operator: =~ + value: /^${metric}_write_count_per_second1/ + + - alias: write_count_per_second5 + groupBy: + - params: + - $interval + type: time + measurement: /^$command$/ + refId: disk-io-write_count_per_second5 + resultFormat: time_series + select: + - - params: + - value + type: field + - params: [] + type: mean + tags: + - key: hostname + operator: =~ + value: /^$hostname$/ + - condition: AND + key: service + operator: '=' + value: Disk I/O + - key: metric + operator: =~ + value: /^${metric}_write_count_per_second5/ + - title: Disk I/O - $metric - Bytes type: timeseries