How to Tune CPU, Memory, and Disk Alerts Without Missing Real Incidents

CPU, memory, and disk alerts are the easiest alerts to create and the easiest to get wrong. Static thresholds without duration or context create noise.

Quick answer: Use duration, consecutive samples, resource-specific thresholds, and separate severity rules for production impact. A one-minute spike should not behave like sustained user impact.

CPU

CPU alerts should separate short spikes from sustained contention. A batch server and a web checkout server do not deserve identical severity rules.

Warning: CPU > 85% for 15 minutes
Critical: CPU > 95% for 10 minutes AND production service impact
Ticket: only if owner and runbook exist

Memory

Memory alerts need OS-specific interpretation. High used memory is not automatically bad if cache behavior is normal. Focus on available memory, paging, swap, process growth, or application errors.

Disk

Disk alerting should use both percent free and absolute free space. Ten percent free on a 50 GB volume is different from ten percent free on a 20 TB volume.

Volume typeWarningCritical
OS disk< 15% and < 10 GB< 8% or < 5 GB
Database/log volumeGrowth trend abnormalProjected full before next support window
Temporary/cache volumeOnly if app impact existsService degradation or failed jobs

When to create incidents

Create an incident when there is action to take. Create an event or alert only when it needs observation. This distinction keeps operators from becoming human filters for poorly designed monitors.

Tuning workflow

  1. Pull last 30 days of CPU, memory, and disk incidents.
  2. Mark incidents closed with no action.
  3. Mark incidents that self-cleared under 10 minutes.
  4. Raise duration threshold for self-clearing noise.
  5. Add runbook links to remaining actionable alerts.
Measure: Track duplicate count, self-clear count, and incidents requiring actual remediation.
About the author

Jason Purvis works in enterprise monitoring and IT operations, with hands-on experience across ServiceNow ITOM/Event Management, SolarWinds-style infrastructure monitoring, Microsoft 365 operations, alert routing, and incident process improvement.