High CPU Usage Alert – TrueScan Host (Datadog)

Category: Monitoring → Infrastructure → CPU Alerts
Applicable Systems: All TrueScan Windows Hosts monitored via Datadog Agent

1. Overview
This KB provides troubleshooting and resolution steps for High CPU Usage alerts triggered in Datadog for TrueScan hosts.

Metrics Reference:
- system.cpu.user
- system.cpu.system
- system.cpu.idle
- process.cpu.total_pct

2. Alert Description
Alert Name: High CPU Usage – TrueScan Host
Example Host: TrueScan-1-1-7

High CPU indicates the system is spending most of its time processing tasks, leaving little idle capacity.

3. Business Impact
If not addressed promptly, high CPU may cause:
- Sluggish application performance
- Delayed background services
- Timeout errors
- Service degradation
- Possible system crash (if sustained)

4. Troubleshooting Procedure

Step 1: Identify Affected Host
1. Login to Datadog.
2. Navigate to Infrastructure → Hosts.
3. Search for the affected host.
4. Open Host Details.

Step 2: Review CPU Metrics
Check:
- system.cpu.user
- system.cpu.system
- system.cpu.idle
- process.cpu.total_pct

Step 3: Check Live Processes
1. Navigate to Infrastructure → Live Processes.
2. Filter by affected host.
3. Sort by CPU usage.
4. Identify top CPU-consuming process.

Step 4: Validate via Server (If Required)
For Linux:
top -o %CPU

For Windows (PowerShell):
Get-Process | Sort CPU -Descending

5. Root Cause Scenarios & Resolution

Runaway Process → Restart or stop the high-CPU process
Traffic Spike → Validate workload increase
Recent Deployment → Rollback or review change
JVM / Container Misconfiguration → Tune memory & CPU limits
Scheduled Task Overload → Reschedule or optimize task

6. Resolution Steps
- Restart affected application service
- Stop unnecessary background jobs
- Verify system patch level
- Escalate to application team if required
- Monitor CPU for 15–30 minutes post-resolution

7. Escalation Matrix
Application related → Application Team
Infrastructure issue → Infra Team
Repeated alerts → Service Lead / Engineering

8. Preventive Measures
- Configure CPU alert threshold at 80% for 5 minutes
- Enable process-level monitoring
- Review system sizing
- Implement capacity planning
- Analyze recurring spike patterns