GitHub - thirstler/all-thing: Performance monitoring system for HPC or development systems

thirstler / all-thing Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Performance monitoring system for HPC or development systems

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
agent		agent
config		config
deadbeat		deadbeat
master		master
rpmspec		rpmspec
scripts		scripts
.gitignore		.gitignore
COPYING		COPYING
Makefile		Makefile
README		README
at.h		at.h
cJSON.c		cJSON.c
cJSON.h		cJSON.h
ini.c		ini.c
ini.h		ini.h
report_sample.json		report_sample.json

Repository files navigation

All-Thing
=========

Continuous, high-resolution monitoring system for HPC/cloud environments

Current state: design/pre-alpha

All-Thing is a monitoring system similar to Ganglia in scope and function with
a few major and minor differences. Most notably, it's designed for high
resolution sampling of data with minimal performance impact for “real-time”
performance analysis. Default poll interval is 5 seconds. Currently, All-Thing
consists of:

* A monitoring daemon, written in C, that polls various data on an Linux OS 
  image. This daemon is very simple and written to limit CPU and I/O activity
  as much as possible. It has no plug-in architecture or fancy hooks. It cannot
  be queried or remotely told to change behavior.
* A master daemon that listens for reports from monitoring daemons. Data is
  accepted, counters are calculated and data is made available via a simple 
  service that accepts queries for data. There is no authentication for
  connecting either the monitor clients or the data services.
* A web front-end is in the making that allows the user to view the heck out of
  performance data collected from the reporting monitors.
  
I'd love to say that the lack of monitor extensibility, encryption,
authentication or multi-platform support was all due to some snotty design
philosophy but really it's just that I don't have the time for it right now.

Features:

* Continuous monitoring of nodes limited only by the bandwidth, memory and CPU
  available to the master daemon(s). For practical purposes, this could be tens
  of thousands of nodes. Perhaps hundreds of thousands in beefy environments.
* Recording of time-series data to a database (PostgreSQL, whatever) according
  to user configuration. Prototyping to be done with PostgreSQL. Support for
  other formats will not be at all difficult to implement since database
  activities are rather simple and the web app is Django and should be fairly
  agnostic.
* Elegant, intuitive web application for watching node data in real-time and for
  analyzing historical time-series data on all monitored attributes.
* Node performance report generating within web application.
* Data transfer from reporting daemons to master daemon is in JSON format. I
  though about fancy containers and even binary chunks but I like the
  flexibility of JSON since I may want to reimpliment the master process in
  Python.
* Data service uses JSON for queries and data transfer. Since the end target
  is a web application, JSON is used everywhere on the wire.
* cJSON is used on the monitor processes and JANSSON is used on the master
  process. I'll have to convert the master to cJSON – the reason for this is
  because JANSSON isn't available on Red Hat 6 and I can include cJSON right in
  the source.

Compared to Ganglia:

* Like Ganglia, this is a continuous monitoring system that commits data to a
  database for analysis.
* It is currently designed to work in a much less extensible manner making it
  easier to configure and use
* It uses no plug-in architecture for extending the data it collects however
  what it collects and reports can be configured. Extending capabilities 
  in C can't be more difficult than enabling a plug-in arch.
* It consists of two very separate daemons: a reporting daemon and a
  collection daemon. As opposed to Ganglia in which the binary is the same
  and daemons are designated as "deaf" or "mute." Ganglia does big distributed
  architectures. This does not.
* Monitor traffic is JSON. Ganglia uses XDR which is more suitable for the
  extensible nature of that project, but adds some complexity I don't want to
  deal with.
* Monitor daemon is extremely light-weight and is designed to collect data
  at frequent intervals without appreciably impacting running jobs. The default
  interval is every five seconds. This daemon was benchmarked using hosts
  running large molecular dynamics jobs. There was no impact on the
  performance (wall time) of the jobs over a multi-month time-scale. This was
  the goal of the first versions of this software.
* Use cases will be different. This is simply a real-time monitor for lots of
  hosts. This makes it great for scientific and HPC environments where
  developers/scientists can watch their jobs for problems that could impact the
  efficiency of their work. I've found that scaling and I/O constraint issues
  often go unnoticed in these environments.