Wednesday, July 21, 2010

Munin and Alerting: Method 1

Munin internal alerting system

Munin graphs are nice; but if you don't want to check them every morning for suspiciously high network traffic or critical disk usage, you would probably want munin to send you an alert if it finds an "unusual" value. Munin has a very basic alerting system built-in. Imagine your email adress is user@foo.com, and you want to receive a mail if the load on serverA goes over 3, and another mail if it reaches the critical value 5. You also want your partner, partner@foo.com, to be notified.

In /etc/munin/munin.conf, add the following line (usually over the part defining the nodes to monitor) :


contact.user.command mail −s "Munin Notification" user@foo.com
contact.partner.command mail −s "Munin Notification" partner@foo.com


Then, in the part describing serverA:


[domain;serverA]
        address aaa.aaa.aaa.aaa
        use_node_name yes
load.warning 3
load.critical 5
contacts user partner



The values 3 and 5 are here maximal values. If you wanted to say, I want to be warned if the load goes under 1, you could replace 3 by 1:. You can also set a minimum and a maximum value: load.warn 1:3 would warn you if the load goes under 1 or over 3. 
To monitor part of a service with munin, you will need the internal name of the element you want to check. For example, we want to be warned if the usage of the disk /dev/sdb1 on serverA exceeds 95%; the line we will add is _dev_sdb1.warning 95, devsdb1 being the internal name of the element. There are two ways to find this internal name.


We know the usage of that disk is monitored by the plugin "df". So we can go the HTML page produced by munin, click on the graph corresponding to the df plugin; on the bottom of the page, a table lists all the elements monitored by the df plugin, with their internal name. The other way is to connect to the node with telnet, and fetch the df plugin:



fetch df
_dev_sdb1.value 90
varrun_var_run.value 1
varlock_var_lock.value 0
procbususb.value 1
udev_dev.value 1
devshm_dev_shm.value 0
lrm_lib_modules_2_6_20_16_generic_volatile.value 9


Anyway, remember: munin is run by cron every five minutes. And no, munin doesn't keep track of who it has already mailed or not. I let you imagine what would happen if the usage of your disk /dev/sdb1 goes up to 96% friday evening, just after you left work. You may have a surprise on monday morning, when checking your mails, it may be tens if not hundreds of mails... You can not make groups of contacts neither, or group of machines. If you want to have warnings on 10 services on 50 machines, it starts to get quite complicated... Therefore I would recommend you use one of the Nagios methods.

No comments:

Post a Comment