Skip to main content

Orchestration Dashboard

Enterprise Feature

This feature is available under the ClearML Enterprise plan

Use the orchestration dashboard to monitor all of your available and in-use compute resources:

  • Available and utilized resources global total and by category
  • Resource utilization over time
  • Resource history event log
  • Detailed resource performance metrics

Orchestration Dashboard

Resource Categories and Groups

The orchestration dashboard shows your workers by groups and categories, specified by the following naming policy: <category>:<group>:<name>.

When no category is specified, workers are assigned the DEFAULT category.

When no group is specified, workers are assigned the Default Group group.

Current Usage Data

The top of the dashboard displays the current resource availability and utilization counts. This gives you an overall picture of the resources available and in use. The Total section displays available and idle resource counts. These counts are also available per worker category.

The Totals section displays:

  • GPUs - The total number of GPUs in currently running workers out of the total number of GPUs in all provisioned workers, and the number of idle GPUs. GPUs are considered idle when their average utilization falls below 80%.
  • CPUs - The total number of CPUs in currently running workers out of the total number of CPUs in all provisioned workers, and the number of idle CPUs. CPUs are considered idle when their average utilization falls below 30%.
  • Workers - The number of currently running workers out of the total number of provisioned workers (through autoscalers or K8S), and the number of idle workers. Workers are considered idle if all of their GPUs and CPUs are idle or if they are not executing any task.

Category sections display the resource count and utilization for:

  • Workers
  • GPUs
  • CPUs

Hover over any of this data to see the number of currently idle machines.

Use the Event Log to view updates of worker events: worker addition/removal, worker has become idle/busy. Hover over the log to download (Download) it or open the expanded view (Maximize).

Resource Graph

The Resource graph displays resource usage over time. The graph time span can be controlled through the dropdown menu above the graph (between 3 hours and 1 month). Hover over the plot to see specific data point values.

Click on a group in the Resource Groups list below the graph to have the graph display usage for that specific group.

When viewing a group's usage, you can select what data to view in the dropdown menu at the top of the plot:

  • Compute Units - Available/Idle CPUs/GPUs
  • Compute Utilization - Average CPU/GPU utilization
  • Available Memory - Total and Free RAM
  • Free Home Storage
  • Network Throughput - Rx/Tx

Resource Groups

The Resource Groups table displays current usage numbers for each group:

  • Worker count - number of workers in the group
  • Average GPU Utilization (%)
  • Average CPU Load (%)
  • Available (total) RAM (GB)
  • Free RAM (GB)
  • Free home disk (GB)
  • Network (Tx/Rx Mbps)

Click Expand to expand the resource group and view the stats of each worker within the group. Filters can be applied by clicking Filter on a column, and the relevant filter appears. To clear all active filters, click Clear filters.

Hover over a worker and click Dot menu to access the worker's information panel.

The table highlights values that cross user configured thresholds.

Resource groups

Click Tune to define the threshold values.

info

The threshold values applied to the dashboard table affect all workspace users who view the page

Threshold modal

Clicking on a resource group opens the group's info panel and replace the Overview graph with that resource's usage history.

Resource group info panel

The info panel displays the group's:

  • Total GPU count
  • Total CPU count
  • Total Worker RAM
  • Total GPU RAM
  • Aggregate Idle time in last 30 days