ServiceNow Alert Grouping by CI and Resource: A Practical Rule Design
Good alert grouping is not “group everything on the same server.” That creates quiet failures. The safer pattern is to group by configuration item plus resource plus symptom, then allow related alerts to roll into the same operational story.
The grouping rule that usually works
For infrastructure events, start with this key:
Grouping key = node_ci + resource + metric_family + environment
Examples:
| Event | Bad grouping | Better grouping |
|---|---|---|
| High CPU on server A | server A | server A + CPU + production |
| D drive low space on server A | server A | server A + D: + disk + production |
| SQL service stopped on server A | server A | server A + MSSQLSERVER + service + production |
Why CI-only grouping causes pain
CI-only grouping looks clean during demos because the incident count drops fast. In production it creates a different problem: unrelated symptoms get buried under a parent alert. A server can have high CPU, a dead service, and low disk at the same time. Those may share a root cause, but they still need separate evidence until someone confirms the relationship.
Fields to normalize before grouping
- CI: The actual operational CI, not a generic “other” record.
- Resource: Disk letter, mount point, interface name, process, service, URL, database, queue, or synthetic check.
- Metric family: CPU, memory, disk, availability, latency, errors, process, certificate, job, backup.
- Environment: Production, DR, QA, dev, lab.
- Source: SolarWinds, Dynatrace, Zabbix, Splunk, Azure Monitor, etc.
Routing rule example
// Pseudocode for assignment group fallback
if (ci.support_group) {
assignment_group = ci.support_group;
} else if (ci.change_group) {
assignment_group = ci.change_group;
} else {
assignment_group = "SysOps";
}
This gives operations a deterministic route even when the CMDB is incomplete. It also exposes missing CMDB ownership as a data quality issue instead of letting alerts die in a generic queue.
Suppression versus grouping
Grouping keeps related alerts visible under a shared operational context. Suppression hides or delays alerts that are not useful during a known condition. Do not use suppression to compensate for bad grouping. Suppress planned maintenance, flapping non-production checks, and duplicate raw events. Group actionable symptoms that still need visibility.
Validation checklist
- Pull the top 25 alert patterns from the last 30 days.
- Confirm every pattern has a CI, resource, source, severity, and environment.
- Test grouping against historical events before enabling incident creation.
- Manually review examples where multiple alerts hit the same CI.
- Measure incident count reduction and missed-severity complaints after rollout.