Help:Performance monitoring

From CECS wiki
Jump to navigation Jump to search

This is a list of performance monitoring tools to help you optimize your jobs and understand their bottlenecks.

Locally written tools[edit]

If you want to use any of these tools and they are not available, please ask for instructions on how to get them!

nvidia-ps
view of short term gpu utilization, alternate of nvidia-smi which gives instant view
ganglia
web view of system and cluster system performance meters
heatmap
web view of gpu performance and job statistics and slurm status
squeue-gpu
command line gpu performance and job statistics and slurm status
gpust
stacked gpu graph for a cluster
cgp
web front end for viewing collectd statistics
slimits
view system wide QOS limits and usage (beta)
scontrol show assoc_mgr
gory details of slurm goo

slurm command line tools[edit]

squeue
view pending and current jobs
scontrol
view and modify all parameters for current and pending jobs
aacct
view collected statistics for past jobs
sinfo
view current status of cluster nodes

screen based system performance[edit]

  • top
  • atop
  • htop
  • nmon
  • iftop (root only)

command line based system performance[edit]

  • iostat
  • netstat
  • ps

Process based performance ad debugging[edit]

  • perf