High CPU Usage Alert – TrueScan Host (Datadog)
Category: Monitoring → Infrastructure → CPU Alerts
Applicable Systems: All TrueScan Windows Hosts monitored via Datadog Agent
1. Overview
This KB provides troubleshooting and resolution steps for High CPU Usage alerts triggered in Datadog for TrueScan hosts.
Metrics Reference:
- system.cpu.user
- system.cpu.system
- system.cpu.idle
- process.cpu.total_pct
2. Alert Description
Alert Name: High CPU Usage – TrueScan Host
Example Host: TrueScan-1-1-7
High CPU indicates the system is spending most of its time processing tasks, leaving little idle capacity.
3. Business Impact
If not addressed promptly, high CPU may cause:
- Sluggish application performance
- Delayed background services
- Timeout errors
- Service degradation
- Possible system crash (if sustained)
4. Troubleshooting Procedure
Step 1: Identify Affected Host
1. Login to Datadog.
2. Navigate to Infrastructure → Hosts.
3. Search for the affected host.
4. Open Host Details.
Step 2: Review CPU Metrics
Check:
- system.cpu.user
- system.cpu.system
- system.cpu.idle
- process.cpu.total_pct
Step 3: Check Live Processes
1. Navigate to Infrastructure → Live Processes.
2. Filter by affected host.
3. Sort by CPU usage.
4. Identify top CPU-consuming process.
Step 4: Validate via Server (If Required)
For Linux:
top -o %CPU
For Windows (PowerShell):
Get-Process | Sort CPU -Descending
5. Root Cause Scenarios & Resolution
Runaway Process → Restart or stop the high-CPU process
Traffic Spike → Validate workload increase
Recent Deployment → Rollback or review change
JVM / Container Misconfiguration → Tune memory & CPU limits
Scheduled Task Overload → Reschedule or optimize task
6. Resolution Steps
- Restart affected application service
- Stop unnecessary background jobs
- Verify system patch level
- Escalate to application team if required
- Monitor CPU for 15–30 minutes post-resolution
7. Escalation Matrix
Application related → Application Team
Infrastructure issue → Infra Team
Repeated alerts → Service Lead / Engineering
8. Preventive Measures
- Configure CPU alert threshold at 80% for 5 minutes
- Enable process-level monitoring
- Review system sizing
- Implement capacity planning
- Analyze recurring spike patterns