-
Notifications
You must be signed in to change notification settings - Fork 0
Performance monitoring system for HPC or development systems
License
thirstler/all-thing
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
All-Thing ========= Continuous, high-resolution monitoring system for HPC/cloud environments Current state: design/pre-alpha All-Thing is a monitoring system similar to Ganglia in scope and function with a few major and minor differences. Most notably, it's designed for high resolution sampling of data with minimal performance impact for “real-time” performance analysis. Default poll interval is 5 seconds. Currently, All-Thing consists of: * A monitoring daemon, written in C, that polls various data on an Linux OS image. This daemon is very simple and written to limit CPU and I/O activity as much as possible. It has no plug-in architecture or fancy hooks. It cannot be queried or remotely told to change behavior. * A master daemon that listens for reports from monitoring daemons. Data is accepted, counters are calculated and data is made available via a simple service that accepts queries for data. There is no authentication for connecting either the monitor clients or the data services. * A web front-end is in the making that allows the user to view the heck out of performance data collected from the reporting monitors. I'd love to say that the lack of monitor extensibility, encryption, authentication or multi-platform support was all due to some snotty design philosophy but really it's just that I don't have the time for it right now. Features: * Continuous monitoring of nodes limited only by the bandwidth, memory and CPU available to the master daemon(s). For practical purposes, this could be tens of thousands of nodes. Perhaps hundreds of thousands in beefy environments. * Recording of time-series data to a database (PostgreSQL, whatever) according to user configuration. Prototyping to be done with PostgreSQL. Support for other formats will not be at all difficult to implement since database activities are rather simple and the web app is Django and should be fairly agnostic. * Elegant, intuitive web application for watching node data in real-time and for analyzing historical time-series data on all monitored attributes. * Node performance report generating within web application. * Data transfer from reporting daemons to master daemon is in JSON format. I though about fancy containers and even binary chunks but I like the flexibility of JSON since I may want to reimpliment the master process in Python. * Data service uses JSON for queries and data transfer. Since the end target is a web application, JSON is used everywhere on the wire. * cJSON is used on the monitor processes and JANSSON is used on the master process. I'll have to convert the master to cJSON – the reason for this is because JANSSON isn't available on Red Hat 6 and I can include cJSON right in the source. Compared to Ganglia: * Like Ganglia, this is a continuous monitoring system that commits data to a database for analysis. * It is currently designed to work in a much less extensible manner making it easier to configure and use * It uses no plug-in architecture for extending the data it collects however what it collects and reports can be configured. Extending capabilities in C can't be more difficult than enabling a plug-in arch. * It consists of two very separate daemons: a reporting daemon and a collection daemon. As opposed to Ganglia in which the binary is the same and daemons are designated as "deaf" or "mute." Ganglia does big distributed architectures. This does not. * Monitor traffic is JSON. Ganglia uses XDR which is more suitable for the extensible nature of that project, but adds some complexity I don't want to deal with. * Monitor daemon is extremely light-weight and is designed to collect data at frequent intervals without appreciably impacting running jobs. The default interval is every five seconds. This daemon was benchmarked using hosts running large molecular dynamics jobs. There was no impact on the performance (wall time) of the jobs over a multi-month time-scale. This was the goal of the first versions of this software. * Use cases will be different. This is simply a real-time monitor for lots of hosts. This makes it great for scientific and HPC environments where developers/scientists can watch their jobs for problems that could impact the efficiency of their work. I've found that scaling and I/O constraint issues often go unnoticed in these environments.
About
Performance monitoring system for HPC or development systems
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published