Datadog Announces "GPU Monitoring" to Support Cost Optimization and Performance Improvement as AI Projects Scale
Datadog has launched "GPU Monitoring" globally to help organizations manage the rising costs and performance challenges of AI infrastructure by providing unified visibility across the entire AI stack.
📋 Article Processing Timeline
- 📰 Published: April 24, 2026 at 20:00
- 🔍 Collected: April 24, 2026 at 11:31
- 🤖 AI Analyzed: April 25, 2026 at 02:34 (15h 3m after Collected)
NEW YORK – Datadog, Inc. (NASDAQ: DDOG), the leading observability and security platform for cloud applications, today announced the general availability of GPU Monitoring for customers worldwide. This product addresses one of the most common challenges currently faced by organizations seeking scalable and effective management methods to cope with expanding AI costs.
"GPU instances account for 14% of compute costs, representing a significant challenge as organizations strive to build AI-first technologies scalably and efficiently," said Yanbin Li, Chief Product Officer at Datadog. "While many companies recognize increasing costs, they are unable to allocate GPU costs by business unit, lack context for workloads, or cannot identify clear next steps for improvement. Consequently, proper budgeting and planning have become extremely difficult."
The launch of GPU Monitoring provides integrated visibility across the entire AI stack for the first time in a single solution. This enables teams to monitor the health, cost, and performance of GPU fleets in connection with the departments and members using them on a single screen, allowing for rapid troubleshooting of underperforming workloads and cost reduction.
Li continued, "When situations such as misallocated capacity, stalled training and inference workloads, or rising costs occur, proper management of AI costs becomes a boardroom priority. Everyone recognizes that managing GPU costs is a major issue to solve, but many companies are still in the trial-and-error stage, making it very difficult to understand what is happening across the stack in a single view. GPU Monitoring solves this challenge with unprecedented efficiency and reliability."
Current GPU tools provide high-level metrics regarding device health but fail to reveal cross-departmental resource contention issues, explain why training or inference workloads fail, or visualize which devices are idle or used inefficiently. This lack of visibility leads to time-consuming investigations, and development departments tend to over-provision resources as a safety measure, resulting in wasted costs.
GPU Monitoring streamlines this process by directly linking GPU fleet telemetry to the workloads consuming those resources. It also provides a shared view for platform engineering and machine learning teams to collaborate on investigations, enabling the following:
- Scale AI while controlling excessive costs: Visibility and forecasting based on GPU usage patterns, along with specific guidelines for deciding whether to purchase new GPUs or release existing resources, allow platform teams to avoid expensive capital investments or long procurement processes. ML teams can secure necessary capacity faster, and executives can achieve higher ROI under predictable spending.
- Accelerate AI implementation and deployment: By directly associating stalled workloads with the underlying GPUs, Pods, and processes, teams can identify performance bottlenecks in minutes instead of hours, allowing engineers to focus on delivering AI projects.
- Avoid costly failures: Proactively identify unhealthy GPUs and address them before failures ripple across the cluster and delay training or inference.
- Maximize ROI of GPU costs: Teams take responsibility for GPU utilization and costs, easily identifying where over-provisioning or under-utilization occurs. This enables resource reclamation and reallocation, reducing wasteful spending.
"Datadog GPU Monitoring has made it easy for us to understand the status of our multi-tenant GPU infrastructure," said Kai Fan, Head of Product at Hyperbolic. "Without additional configuration, we can immediately visualize core utilization, memory, power consumption, and temperature at the instance and device level. The dashboards are comprehensive out of the box and easy to customize, allowing us to build isolated screens for each customer in minutes."
"GPU instances account for 14% of compute costs, representing a significant challenge as organizations strive to build AI-first technologies scalably and efficiently," said Yanbin Li, Chief Product Officer at Datadog. "While many companies recognize increasing costs, they are unable to allocate GPU costs by business unit, lack context for workloads, or cannot identify clear next steps for improvement. Consequently, proper budgeting and planning have become extremely difficult."
The launch of GPU Monitoring provides integrated visibility across the entire AI stack for the first time in a single solution. This enables teams to monitor the health, cost, and performance of GPU fleets in connection with the departments and members using them on a single screen, allowing for rapid troubleshooting of underperforming workloads and cost reduction.
Li continued, "When situations such as misallocated capacity, stalled training and inference workloads, or rising costs occur, proper management of AI costs becomes a boardroom priority. Everyone recognizes that managing GPU costs is a major issue to solve, but many companies are still in the trial-and-error stage, making it very difficult to understand what is happening across the stack in a single view. GPU Monitoring solves this challenge with unprecedented efficiency and reliability."
Current GPU tools provide high-level metrics regarding device health but fail to reveal cross-departmental resource contention issues, explain why training or inference workloads fail, or visualize which devices are idle or used inefficiently. This lack of visibility leads to time-consuming investigations, and development departments tend to over-provision resources as a safety measure, resulting in wasted costs.
GPU Monitoring streamlines this process by directly linking GPU fleet telemetry to the workloads consuming those resources. It also provides a shared view for platform engineering and machine learning teams to collaborate on investigations, enabling the following:
- Scale AI while controlling excessive costs: Visibility and forecasting based on GPU usage patterns, along with specific guidelines for deciding whether to purchase new GPUs or release existing resources, allow platform teams to avoid expensive capital investments or long procurement processes. ML teams can secure necessary capacity faster, and executives can achieve higher ROI under predictable spending.
- Accelerate AI implementation and deployment: By directly associating stalled workloads with the underlying GPUs, Pods, and processes, teams can identify performance bottlenecks in minutes instead of hours, allowing engineers to focus on delivering AI projects.
- Avoid costly failures: Proactively identify unhealthy GPUs and address them before failures ripple across the cluster and delay training or inference.
- Maximize ROI of GPU costs: Teams take responsibility for GPU utilization and costs, easily identifying where over-provisioning or under-utilization occurs. This enables resource reclamation and reallocation, reducing wasteful spending.
"Datadog GPU Monitoring has made it easy for us to understand the status of our multi-tenant GPU infrastructure," said Kai Fan, Head of Product at Hyperbolic. "Without additional configuration, we can immediately visualize core utilization, memory, power consumption, and temperature at the instance and device level. The dashboards are comprehensive out of the box and easy to customize, allowing us to build isolated screens for each customer in minutes."