Munin and alerting

Submitted by yann (not verified) on Mon, 12/21/2009 - 11:28

Munin internal alerting system

Munin graphs are nice; but if you don't want to check them every morning for suspiciously high network traffic or critical disk usage, you would probably want munin to send you an alert if it finds an "unusual" value. Munin has a very basic alerting system built-in. Imagine your email adress is yann@foo.com, and you want to receive a mail if the load on serverA goes over 3, and another mail if it reaches the critical value 5. You also want chris, chris@foo.com, to be notified.

In /etc/munin/munin.conf, add the following line (usually over the part defining the nodes to monitor) :

contact.yann.command mail −s "Munin notification" yann@foo.com
contact.chris.command mail −s "Munin notification" chris@foo.com

Then, in the part describing serverA:

[domain;serverA]
address aaa.aaa.aaa.aaa
use_node_name yes
load.warning 3
load.critical 5
contacts yann chris

The values 3 and 5 are here maximal values. If you wanted to say, I want to be warned if the load goes under 1, you could replace 3 by 1:. You can also set a minimum and a maximum value: load.warn 1:3 would warn you if the load goes under 1 or over 3.
To monitor part of a service with munin, you will need the internal name of the element you want to check. For example, we want to be warned if the usage of the disk /dev/sdb1 on serverA exceeds 95%; the line we will add is _dev_sdb1.warning 95, devsdb1 being the internal name of the element. There are two ways to find this internal name.

We know the usage of that disk is monitored by the plugin "df". So we can go the HTML page produced by munin, click on the graph corresponding to the df plugin; on the bottom of the page, a table lists all the elements monitored by the df plugin, with their internal name. The other way is to connect to the node with telnet, and fetch the df plugin:

fetch df
_dev_sdb1.value 90
varrun_var_run.value 1
varlock_var_lock.value 0
procbususb.value 1
udev_dev.value 1
devshm_dev_shm.value 0
lrm_lib_modules_2_6_20_16_generic_volatile.value 9

Anyway, remember: munin is run by cron every five minutes. And no, munin doesn't keep track of who it has already mailed or not. I let you imagine what would happen if the usage of your disk /dev/sdb1 goes up to 96% friday evening, just after you left work. You may have a surprise on monday morning, when checking your mails, it may be tens if not hundreds of mails... You can not make groups of contacts neither, or group of machines. If you want to have warnings on 10 services on 50 machines, it starts to get quite complicated... Therefore I would recomment you use one of the Nagios methods.

Integration with Nagios: via a NSCA server

First you need a way for Nagios to accept messages from Munin. Nagios has exactly such a thing, namely the NSCA which is documented here: http://nagios.sourceforge.net/docs/1_0/addons.html#nsca.

NSCA consists of a client (a binary usually named send_nsca) and a server usually run from inetd. We recommend that you enable encryption on NSCA communication.

You also need to configure Nagios to accept messages via NSCA. Those will be passive alerts.

# This is an example of the correct way to activate Nagios warnings
contact.nagios.command /usr/local/nagios/bin/send_nsca
nagioshost.example.com -c /usr/local/nagios/etc/send_nsca.cfg -to 60

Integration with Nagios: via a Nagios plugin

If you don't want to use passive checks. You can use check_munin_rrd plugin.

Basically Munin-node data get stored on the munin server as usual and Nagios is reading those data to check the status of the node.

$ /usr/lib/nagios/plugins/check_munin_rrd.pl --help

Monitor server via Munin-node pulled data
Usage: /usr/lib/nagios/plugins/check_munin_rrd.pl -H <host> -M
<Module> [-D <domain>] -w <warn level> -c <crit level> [-V]
-h, --help
print this help message
-H, --hostname=HOST
name or IP address of host to check
-M, --module=MUNIN MODULE
Munin module value to fetch
-D, --domain=DOMAIN
Domain as defined in munin
-w, --warn=INTEGER
warning level
-c, --crit=INTEGER
critical level
-v --verbose
Be verbose
-V, --version
prints version number
check_munin_rrd.pl (nagios-plugins 1.4.2) 0.9
The nagios plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

Previous implementation was using a check from Nagios directly onto Munin-node which is overkill since the Munin server gets the data already via cron.

You need to define a

    new command :
    define command{
    command_name check_munin
    command_line /usr/lib/nagios/plugins/check_munin_rrd.pl -H $HOSTALIAS$ -M $ARG1$ -w $ARG2$ -c $ARG3$
    }
    new service template :
    # generic service template definition check via munin
    define service{
    name generic-munin-service ; The 'name' of this service template
    active_checks_enabled 1 ; Active service checks are enabled
    passive_checks_enabled 0 ; Passive service checks are enabled/accepted
    parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
    obsess_over_service 1 ; We should obsess over this service (if necessary)
    check_freshness 0 ; Default is to NOT check service 'freshness'
    notifications_enabled 1 ; Service notifications are enabled
    event_handler_enabled 1 ; Service event handler is enabled
    flap_detection_enabled 1 ; Flap detection is enabled
    failure_prediction_enabled 1 ; Failure prediction is enabled
    process_perf_data 1 ; Process performance data
    retain_status_information 1 ; Retain status information across program restarts
    retain_nonstatus_information 1 ; Retain non-status information across program restarts
    notification_interval 0 ; Only send notifications on status change by default.
    is_volatile 0
    check_period 24x7
    normal_check_interval 5 ; This directive is used to define the number of "time units" to wait before scheduling the next "regular" check of the service.
    retry_check_interval 3 ; This directive is used to define the number of "time units" to wait before scheduling a re-check of the service.
    max_check_attempts 2 ; This directive is used to define the number of times that Nagios will retry the service check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the service check again.
    notification_period 24x7
    notification_options w,u,c,r
    contact_groups admins
    register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
    }

    Don't use smaller value for normal_check_interval, munin updates data every 5 minutes.

    new service example :
    # check the disk usage via munin
    define service{
    hostgroup_name web-servers
    service_description disk-usage
    check_command check_munin_rrd!df!75!90
    use generic-munin-service
    }