Monitoring¶

Research Computing¶

What tools can you use to monitor the health of a cluster? Here are some recommendations:

Grafana: provides operational dashboards to monitor cluster health.
Slurm-web: web frontend and REST API to [Slurm]({{ site.baseurl }}/docs/schedulers/slurm) workload manager. You can see job states, reservations, and node metrics.
Prometheus: open source monitoring solution used outside of the HPC space
Telegraf: open source server to collect system metrics.
InfluxDB: time series data platform
Netdata: real-time node-level performance, however does not provide much history
XDMoD 17: provides historical performance of jobs and queues, and resources
XDMoD SUPReMM: an extension that also provides job-level performance
Ganglia: Not real-time, but provides node-level performance and history. Note that in 2019 they were looking for maintainers.

Last update: Feb 02, 2023