Low Available Memory on Host (TrueScan-12)

Overview
An alert was triggered indicating that host TrueScan-12 has low available memory.

The alert is based on the system.mem.pct_usable metric reporting less than 10% usable memory on average over the last 5 minutes.

This condition typically means that running processes are consuming most of the system RAM, leaving insufficient memory for additional workloads or spikes in demand.
Symptoms
·       Slower application performance due to paging or swapping
·       Increased service latency
·       Service crashes or unexpected restarts
·       Background job failures
·       System instability or degraded responsiveness
·       Out Of Memory (OOM) kill events in logs

Impact
If not resolved, low memory can impact user-facing applications, backend services, scheduled jobs, monitoring agents, and overall host stability.

Alert Details
Metric: system.mem.pct_usable
Threshold: Less than 10%
Evaluation Window: 5-minute average

Initial Troubleshooting Steps

Review Metrics in Datadog
Navigate to Infrastructure → Host Map → Select Host → Metrics Tab.
Review:
- system.mem.pct_usable
- system.mem.total
- system.mem.used
- system.swap.used
- process.mem.rss
- process.mem.real
Check Live Processes
Use Infrastructure → Live Processes and sort by memory usage (RSS) to identify high memory-consuming processes.
SSH Into the Host
Run the following commands:
·       top -o %MEM
·       htop
·       free -m
·       ps aux --sort=-%mem | head -10
·       vmstat 1 5

Common Causes & Resolutions
1.     Memory leak or runaway process: Restart or stop the problematic process.
2.     Memory-intensive workload: Scale vertically (increase RAM) or redistribute workloads.
3.     Concurrent heavy jobs: Stagger or reschedule batch tasks.
4.     Misconfigured JVM/application heap: Tune heap size, buffers, or memory limits.
5.     Insufficient instance size: Resize instance or migrate to higher memory class.
6.     Swap disabled or insufficient: Enable/configure swap (short-term mitigation only).

Preventive Measures
·       Configure early warning alerts at 20–25% usable memory.
·       Implement autoscaling where applicable.
·       Enable process-level memory monitoring.
·       Conduct regular capacity planning reviews.
·       Monitor OOM events and logs.