Have you ever missed (or almost missed) a critical network alarm that could have prevented a serious network performance or availability problem because it was hidden among non-essential alarms? Hopefully the answer is no, but the situation highlights a serious problem – “alarm clutter”.
Today’s network devices and servers are capable of providing a dizzying set of alarms on almost anything from packet errors to available memory. That’s a lot of power for troubleshooting and problem solving, but it can also mean that even in a small network of only a few hundred elements you can become overwhelmed by a storm of alarms.
Here are three easy techniques for managing the volume of alarms and their relative severity. Using them in the right circumstances can help you find and fix problems more quickly by spending less time wading through a sea of distractions.
Technique 1: Duration-based alarming
Duration-based alarming is a common technique for reducing the number of alarms from a particular device or server. Instead of reporting every instance of an alarm condition, an alarm is issued only if the condition persists for an unusual period of time.
For example, suppose interface utilization on a router occasionally exceeds 90% every few minutes. Normally, this wouldn’t be a concern and an alarm isn’t warranted (in fact, it could mean the router is optimally “sized” for the expected or nominal level of traffic for the interface). On the other hand, if interface utilization exceeds 90% for 15 minutes or more, a bottleneck has developed and an alarm should be generated. With duration based alarming, you are notified only when an actual problem develops—not every time a short, transient condition occurs.
Technique 2: Average-value alarming
Average-value alarming offers a similar approach. Instead of creating an alarm every time a measure exceeds a pre-determined threshold, an alarm is issued only if the average value of the measure over time exceeds the threshold.
It’s not uncommon, for example, to see processor utilization periodically “spike” at 100% for a few seconds. However, if a processor experiences an average of 90% utilization for 20 minutes that would be cause for concern and you would fully expect an alarm.
Technique 3: Severity-level alarming
Rather than setting just one alarm threshold, try setting multiple threshold values that represent increasing levels of severity.
Disk space used, for example, increases gradually to the point where applications can no longer function. Obviously, you want an alarm when available disk space is at 90%, but wouldn’t it be helpful to know when disk space is at 70% and then 80% so you have time to “clean up” the disk before applications suffer? You could configure a minor alarm when available disk space is at 70%, a major alarm at 80%, and finally a critical alarm at 90%.
These are just three of the most useful ways to reduce alarm clutter to focus on actionable alarms. Using them will help you identify significant network issues earlier, before users are impacted.