Monitoring

Research Computing

What tools can you use to monitor the health of a cluster? Here are some recommendations:

  • Grafana: provides operational dashboards to monitor cluster health.

  • Slurm-web: web frontend and REST API to [Slurm]({{ site.baseurl }}/docs/schedulers/slurm) workload manager. You can see job states, reservations, and node metrics.

  • Prometheus: open source monitoring solution used outside of the HPC space

  • Telegraf: open source server to collect system metrics.

  • InfluxDB: time series data platform

  • Netdata: real-time node-level performance, however does not provide much history

  • XDMoD 17: provides historical performance of jobs and queues, and resources

  • XDMoD SUPReMM: an extension that also provides job-level performance

  • Ganglia: Not real-time, but provides node-level performance and history. Note that in 2019 they were looking for maintainers.


Last update: Feb 02, 2023