How to Tune CPU, Memory, and Disk Alerts Without Missing Real Incidents
CPU, memory, and disk alerts are the easiest alerts to create and the easiest to get wrong. Static thresholds without duration or context create noise.
CPU
CPU alerts should separate short spikes from sustained contention. A batch server and a web checkout server do not deserve identical severity rules.
Warning: CPU > 85% for 15 minutes
Critical: CPU > 95% for 10 minutes AND production service impact
Ticket: only if owner and runbook exist
Memory
Memory alerts need OS-specific interpretation. High used memory is not automatically bad if cache behavior is normal. Focus on available memory, paging, swap, process growth, or application errors.
Disk
Disk alerting should use both percent free and absolute free space. Ten percent free on a 50 GB volume is different from ten percent free on a 20 TB volume.
| Volume type | Warning | Critical |
|---|---|---|
| OS disk | < 15% and < 10 GB | < 8% or < 5 GB |
| Database/log volume | Growth trend abnormal | Projected full before next support window |
| Temporary/cache volume | Only if app impact exists | Service degradation or failed jobs |
When to create incidents
Create an incident when there is action to take. Create an event or alert only when it needs observation. This distinction keeps operators from becoming human filters for poorly designed monitors.
Tuning workflow
- Pull last 30 days of CPU, memory, and disk incidents.
- Mark incidents closed with no action.
- Mark incidents that self-cleared under 10 minutes.
- Raise duration threshold for self-clearing noise.
- Add runbook links to remaining actionable alerts.