Monitoring Outage Playbook

Use this when the monitoring platform, collector, agent, or integration path appears to be broken.

Use this when: Use this when the monitoring platform, collector, agent, or integration path appears to be broken.
AdvertisementIn-content ad placement

First 10 minutes

  1. Confirm whether the issue is one CI, one application, one monitoring source, or everything.
  2. Check if users are impacted or if this is monitoring noise only.
  3. Identify the top repeated alert signature, CI, resource, and source tool.
  4. Stop duplicate ticket creation if the flood is operationally unsafe.

Questions to answer

Fast commands and checks

# Windows service status
Get-Service | Where-Object {$_.Status -ne 'Running'} | Select Name,Status

# Basic port test
Test-NetConnection server01 -Port 443

# Last reboot
Get-CimInstance Win32_OperatingSystem | Select CSName,LastBootUpTime

Containment

Containment does not mean hiding the outage. It means stopping duplicate operational work while preserving the real signal. Suppress known duplicate patterns, pause broken integrations if needed, and create one parent incident or problem record for coordinated response.

After-action tuning

Related: All Playbooks · Quick Fix Library · Deep Guides