Thursday, July 22, 2010

Linux Server Monitoring With Munin




Introduction

Munin is a monitoring tool written in Perl started by Jimmy Olsen on late 2003, based on the excellent RRD tool by Tobi Oetiker. Even if the development has slowed down since 2005, Munin is a stable tool; it is also very widely used, thanks to its very easy setup.

It consists of munin-node, a daemon you will install on every server you want to monitor and which will gather the data, and munin, which you will install on your monitoring server and which will connect at a regular basis to every node to retrieve it. Munin will then use the data to generate the corresponding graphs and HTML pages.

Below, you will find links that will explain the different features of Munin and how to use those features. Have fun setting up you Munin monitoring server and nodes !

I hope you have learned from these post and I want to thank you for reading.

Wednesday, July 21, 2010

Munin's Performance

Munin's CPU usage


As seen before, munin is run by a cron job every five minutes. So, every five minutes, it connects to all the servers it has to monitor, fetches all the data, writes the data in hundreds of RRD files, and recreates all the HTML files and hundreds of PNG files; the more servers monitored, the more CPU munin will use.

Some other tests are also rather interesting:

/usr/share/munin$ time sudo -u munin ./munin-update

real    0m27.453s
user    0m0.152s
sys     0m0.036s


/usr/share/munin$ time sudo -u munin ./munin-limits


real    0m0.179s

user    0m0.132s
sys     0m0.016s


/usr/share/munin$ time sudo -u munin ./munin-html


real    0m0.270s


user    0m0.176s
sys     0m0.020s


/usr/share/munin$ time sudo -u munin ./munin-graph


real    0m11.376s



user    0m10.465s
sys     0m0.500s


This test (made on my desktop, one node monitored only) shows two interesting things: first, the generation of the PNGs is the heaviest part of the process (10.965 seconds of cpu usage vs 0.532 for the three other processes); second, the munin-update process takes nearly 30 seconds to complete, but barely uses the CPU - probably because it is waiting for the node to run all its plugins. That's why when munin starts, it forks, and run a process for each node, and why you should not prevent it from forking (there is an option for that - don't use it).



If now I was monitoring 10 nodes, it would take approx. 110 seconds on my desktop (if nothing else is running), every five minutes. In other words: as you add nodes to munin, it tends to become quite heavy.


Run Munin as a CGI

One of the ways to improve the performances is to change the way Munin creates the graphs; instead of recreating the graphs every five minutes, we can create them only when a user has requested them, by displaying one of the webpages. This is made possible with CGI.

So, how does it work? When installed, Munin creates a script in /usr/lib/cgi-bin/, munin-cgi-graph. When configured as CGI, Munin changes the links to the pictures in the HTML files, making them point to munin-cgi-graph:

img src="/cgi-bin/munin-cgi-graph/localdomain/localhost.localdomain/df_inode-day.png" ...

Depending on the path, munin-cgi-graph will create the appropriate graph, which will then be displayed. There is also a caching system, so that if you reload the page within five minutes, the graphs won't be regenerated again; therefore, as munin will write the files to the disk, the directory /var/www/munin must be writeable by the apache process. Making the files belong to the user munin and the group www-data, and giving the group write access, is one solution:



/var/www$ sudo chown -R munin:www-data /var/www/munin
/var/www$ sudo chmod -R g+w /var/www/munin

The performance gain is huge; but one of the drawbacks to this method is that it takes a lot more time to display a page containing several graphs, like the node view.

To configure Munin as CGI you need to add the following lines to your /etc/munin/munin.conf:


graph_strategy cgi
cgiurl /cgi-bin
cgiurl_graph /cgi-bin/munin-cgi-graph

These lines help Munin to create correct links to the graphs. Now, assuming you are using Apache, you need to edit your main apache configuration file, to allow /usr/lib/cgi-bin to run CGI scripts:

        AllowOverride None
        Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
        Order allow,deny
        Allow from all


Finally, you need to tell Apache that your website is going to use CGI. If you have a special virtual host set up for munin, then add that line there; else add it somewhere in the main apache configuration file:

ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

Munin-cgi-graph also uses the perl module Date::Manip; which you need to install. Your Munin is now running as CGI!

Move Munin's RRD databases to a TMPFS


On an install of approximatively 30 servers monitored, I have over 2000 RRD databases in /var/lib/munin. This number can vary depending on the number of services you monitor per server, but what we can remember is: every time munin runs (every 5 minutes), hundreds if not thousands of databases are written to and read from. If your disks aren't very fast, this can prove quite costly as the number of servers monitored grows.

This can be improved by moving the files contained in /var/lib/munin to a tmpfs. In my example, with 30 servers monitored and 2250 RRD files, only 115MB are used on the disk - considering the amount of RAM in servers nowadays, it may be worth saving some disk i/o at the cost of some RAM. As all the data would be lost in case of a server restart, we will back the data up every hour/day/week depending on how much data you are willing to lose.

Make a backup of the folder:


cd /var/lib
cp -ra munin/ munin-cache

Add this to your /etc/fstab:

tmpfs /var/lib/munin tmpfs rw,size=512M 0 0

Mount it:

sudo rm -rf /var/lib/munin/*
sudo mount /var/lib/munin

Copy the data back from the backup:

sudo cp -ra /var/lib/munin-cache/* /var/lib/munin

Create an hourly (or daily) cronjob that copies the files from munin to munin-cache:

ServerX $ cd /etc/cron.hourly
ServerX $ ls -l
total 4
-rwxr-xr-x 1 root root 57 2009-03-10 18:55 munin-cache
ServerX $ cat munin-cache
#!/bin/sh
cp -ra /var/lib/munin/* /var/lib/munin-cache/

And then to restore the files from the backup automatically after a reboot, add this at the end of /etc/rc.local:

cp -ra /var/lib/munin-cache/* /var/lib/munin

Frequently asked questions




Is there a munin-node for Windows?
Yes, but it is unofficial. It is maintained by TOCOMPLETE? and is written in C++. Search Google for a download link.



Expand the Power of Munin: Write Your Own Plugins!

Introduction


Munin comes with many plugins, which work out of the box after the installation. Even though, you may still want to monitor values for which no plugin has been created yet. For this chapter, we will create a plugin which monitors the CPU usage for a defined set of users; I created it to help identify which applications were using most of the CPU in a fastCGI environment, on one of our Ubuntu-eu servers.

You can create plugins in the language of your choice; most of the ones munin comes with are either in Perl or shell scripts, but you should be able to use whatever language you are familiar with. Although, as the plugin I will describe here is in sh, this chapter requires a basic knowledge of shell programming and awk.

Retrieve the Data



First, you need to define which data you want to graph, and how to retrieve it. In our example, it would be the CPU usage for a specified user, at the time the plugin is run. I noticed the output of ps on my system, with it's BSD syntax, printed out the CPU usage for every process. It is definitely not very precise, as the values are rounded up; but I believe it gives a fairly good overview of which users are using most of the CPU.


~$ ps aux | head

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root             1        0.0        0.2       2952  1856 ?                Ss    09:21        0:02    /sbin/init
root            2        0.0        0.0          0        0    ?                S<   09:21        0:00   [kthreadd]
root           3        0.0        0.0          0        0    ?                S<   09:21        0:00   [migration/0]
root          4        0.0        0.0          0        0    ?                SN   09:21       0:00   [ksoftirqd/0]
[...]


So, there are three things to do: filter out the processes that belong to a specific user, isolate the CPU column, and add all the values. To filter out processes that belong to a user, we will use the -u option of ps; to isolate the CPU column, the easiest way is to use the -o option of ps:



~$ ps -u "yann" -o "%C"

%CPU
 0.0
 0.0
 1.5
 0.3
 0.0
 0.0
 [...]

We can now add the values with a small awk script (here in a single line, to make it easy to try out):


~$ ps -u "yann" -o "%C" | awk 'BEGIN {FS=" ";CPU_USER=0}{CPU_USER+=$0}END{print CPU_USER}'
8.6

Now that looks interesting... But we don't want to monitor only a single user, but several, and the variables need to be presented under the form var_name.value:



#!/bin/sh
export USERS="root user"
for USER in $USERS ; do {


        ps -u "$USER" -o "%C" |
        awk '
        BEGIN {
                FS=" "
                CPU_USER=0
        }


        {
                CPU_USER+=$0
        }


        END { 
                print  "'$USER'.value "CPU_USER
        }'
}


done;


Save this in /usr/share/munin/plugins/cpubyuser, add a soft link to it in /etc/munin/plugins/, and try it out:


~$ sudo vi /usr/share/munin/plugins/cpubyuser
~$ sudo ln -s /usr/share/munin/plugins/cpubyuser /etc/munin/plugins/cpubyuser
~$ sudo chmod 755 /usr/share/munin/plugins/cpubyuser
~$ /etc/munin/plugins/cpubyuser

root.value 3.6
user.value 4.2

Now, we don't want the usernames to be hardcoded in the plugin, but set in one of munin's configuration files. Edit the file /etc/munin/plugin-conf.d/munin-node, and add the following lines at the end:



[cpubyuser]
env.USERS root user


... and remove the line defining $USERS in our script. Reload munin-node (sudo /etc/init.d/munin-node restart).

~$ /etc/munin/plugins/cpubyuser 
~$ munin-run cpubyuser

root.value 3


user.value 3

If you run the script directly, the variable $USERS won't be defined, and the script will exit. If you run it via munin-run, munin-node will create the environment variable for the script, so it will work.


Autoconf


As we have seen in the previous chapter, when munin-node installs, it runs every script with the autoconf parameter, to decide if it should or not activate it. Therefore, our plugin also need to handle this.




if [ "$1" = "autoconf" ] ; then
        if [ -n "$USERS" ] ; then
                echo "yes"
        else
                echo "no (\$USERS not defined - edit your /etc/munin/plugin-conf.d/!)."
        fi
        exit
fi

Add this at the top of the plugin, before the main loop. If the plugin is run with the autoconf parameter, it will check for the environment variable $USERS; if it is defined, our plugin with reply with "yes", meaning that it can be used. Else, it will reply "no". Your plugin is supposed to check in this section if everything is ok for the plugin to run correctly; like checking if apache is running if your plugin is about apache.


Config


When run with the "config" parameter, your plugin is supposed to output data for RRDTool to tell it how to graph the different values given by our plugin, and metadata like the title of the graphs, the title of the axes...



if [ "$1" = "config" ] ; then
        echo "graph_args --base 1000 -r --lower-limit 0";
        echo "graph_title CPU usage, by user";
        echo "graph_category system";
        echo "graph_info This graph shows CPU usage, for monitored users.";
        echo 'graph_vlabel %'
        echo 'graph_scale no'
        echo 'graph_period second'
        echo "graph_order $USERS"


        FIRSTUSER=1;
        for USER in $USERS; do
                echo "${USER}.label $USER"
                echo "${USER}.info CPU used by user $USER"
                echo "${USER}.type GAUGE"
                if [ $FIRSTUSER -eq 1 ] ; then
                        echo "${USER}.draw AREA"
                        export FIRSTUSER=0;
                else
                        echo "${USER}.draw STACK"
                fi
        done ;


        exit
fi


The "base" parameter on the second line states if we consider that 1M=1024K or 1M=1000K. One of the most important part of this is the loop in the end, which tells rrdtool how it should process the data we feed it with; for every variable monitored, we need to tell rrdtool what to do with it.

For example, the variable "user" should be labelled "user" (you can change it to "CPU usage for $USER", or whatever you think is better). The label is what will appear in the legend, on the graph, it should be reasonably short. For a longer description, you should use the "info" section.

For the "type" of the value, you should read man rrdcreate, there is a whole part on the different types of data sources. A "gauge" is described like this: " GAUGE: is for things like temperatures or number of people in a room or the value of a RedHat share.".

Add informations to your plugin

Well, our plugin is nearly done. Now add these two lines at the top, between the shebang and the autoconf part:


#%# family=auto
#%# capabilities=autoconf


This will tell munin that it should attempt to automatically install and configure your plugin. Now you just need to add some comments, installation instructions, license, history, author contact, etc.... and your first munin plugin is ready!

Now, this was a very simple plugin; depending on what you want to do, it can be a lot harder (look at the memory plugin!). If you want to create complex graphs, you should definitely complete your knowledge by reading rrdtool's documentation. On the other hand, if you only want to create simple plugins, reading plugins in /usr/share/munin/plugins and trying to adapt them is a very good way to get started.


Download and Install Munin Plugins

Munin-node (maybe depending on the package) comes with a lot of plugins; they can be found in /usr/share/munin/plugins (on a server with munin-node installed). When you install it, it checks which ones are relevant (for example, it won't add apache monitoring if you don't have any apache server installed), and add soft links to these in /etc/munin/plugins. So, to get the list of currently used plugins:

$ ls /etc/munin/plugins

acpi      if_err_eth0  irqstats     postfix_mailqueue
cpu       if_err_eth1  load         postfix_mailvolume
df        if_eth0      memory       processes
df_inode  if_eth1      netstat      swap
entropy   interrupts   open_files   vmstat
forks     iostat       open_inodes

And to see the list of available plugins:


$ ls /usr/share/munin/plugins/

acpi                    irqstats              sensors_
apache_accesses         load                  server_room_tmp
apache_processes        loggrep               smart_
apache_volume           memory                snmp__df
apt                     multips               snmp__fc_if_
apt_all                 munin_graph           snmp__fc_if_err_
courier_mta_mailqueue   munin_update          snmp__if_err_
courier_mta_mailstats   mysql_bytes           snmp__if_localhost_
courier_mta_mailvolume  mysql_isam_space_     snmp__load
cps_                    mysql_queries         snmp__processes
cpu                     mysql_slowqueries     snmp__sensors_fsc_bx_fan
cpu0                    mysql_threads         snmp__sensors_fsc_bx_temp
cpu1                    netstat               snmp__sensors_fsc_fan
cupsys_pages            nfs_client            snmp__sensors_fsc_temp
df                      nfsd                  snmp__sensors_mbm_fan
df_abs                  ntp_                  snmp__sensors_mbm_temp
df_inode                ntp_states            snmp__sensors_mbm_volt
entropy                 open_files            snmp__server_room_tmp
exim_mailqueue          open_inodes           snmp__test_temp
exim_mailstats          ping_                 snmp__users
forks                   plugins.history       squid_cache
fw_conntrack            port_                 squid_icp
fw_forwarded_local      postfix_mailqueue     squid_requests
fw_packets              postfix_mailstats     squid_traffic
hddtemp_smartctl        postfix_mailvolume    swap
if_                     processes             sybase_space
if_err_                 ps_                   uptime
interrupts              psu_                  vlan_
iostat                  sendmail_mailqueue    vlan_inetuse_
ip_                     sendmail_mailstats    vlan_linkuse_
ircu                    sendmail_mailtraffic  vmstat

There is a tool provided with munin-node which does pretty much the same, and can sometimes be useful, munin-node-configure. Run without parameter, it will list all the plugins, and say if they are used or not.


/usr/share/munin/plugins
$ sudo munin-node-configure

Plugin                            | Used | Extra information                    
------                                | ---- | -----------------                    
acpi                                 | yes  |                                      
apache_accesses             | no   |                                      
apache_processes           | no   |                                      
apache_volume               | no   |                    
[...]

As you can see, there are a lot of plugins available, and only a few that are activated. To see if a plugin is usable, run it with the "autoconf" parameter:


/usr/share/munin/plugins$ ./apache_accesses autoconf
no (no apache server-status or ExtendedStatus missing on ports 80)

The plugin will tell you "yes" if it is installable, and no, sometimes with an explicit additional error message, if it is not. You can also run munin-node-configure with the --suggest parameter to see, among the plugins which are not installed, which ones you can install, or the error messages for those you can't:


~$ sudo munin-node-configure --suggest

[sudo] password for user:
Plugin                                      | Used | Suggestions                          
------                                         | ---- | -----------                          
apache_accesses                      | no   | yes                                  
apache_processes                    | no   | yes                                  
apache_volume                       | no   | yes                                  
courier_mta_mailqueue          | no   | [spooldir not found]                  
courier_mta_mailstats             | no   | [could not find executable]          
courier_mta_mailvolume        | no   | [could not find executable]          
cpu0                                         | no   | yes        
[...]

If you find an interesting plugin which is not already installed and which is installable, you can install it by adding a link in /etc/munin/plugins:


~$ sudo ln -s /usr/share/munin/plugins/plugin_name /etc/munin/plugins/plugin_name
~$ 
sudo /etc/init.d/munin-node restart




Overriding Munin's Plugins Default Configuration

One of the best aspects of munin is its ease of configuration. The drawbacks to this is that sometimes you will get informations you may wish to complete; "Wireless" may be more precise than eth1, "NAS share" more informational than /dev/sdb3. This is made possible by munin with the overriding, on a per-node basis, of some configuration data of the plugins.

Alright, let's assume on your computer eth1 corresponds to your wireless network. The graph concerning the bandwidth usage on that interface is generated by the plugin if_eth1 (all the plugins are in /usr/share/munin/plugins; it is fairly easy with the name to guess which plugin produces which graph).

We need to get the output of the config for that plugin. We have seen before how to do that with telnet, this time we will connect directly to the server with munin-node installed and use munin-run - the result is exactly the same, only this time you execute the operation locally.


serverA $ sudo -u munin munin-run if_eth1 config
graph_order down up
graph_title eth1 traffic
graph_args --base 1000
graph_vlabel bits in (-) / out (+) per ${graph_period}
graph_category network
graph_info This graph shows the traffic of the eth1 network interface. Please note that the traffic is shown in bits per second, not bytes. IMPORTANT: Since the data source for this plugin use 32bit counters, this plugin is really unreliable and unsuitable for most 100Mb (or faster) interfaces, where bursts are expected to exceed 50Mbps. This means that this plugin is usuitable for most production environments. To avoid this problem, use the ip_ plugin instead.
down.label received
[...]


The element we want to override here is the variable graph_title. This is done on the server with munin (serverX), in the munin.conf. Find the node referring to the server for which you want to do the modification, and add the line concerning that value:


[domain;serverA]
address aaa.aaa.aaa.aaa
use_node_name yes
if_eth1.graph_title Wireless traffic




This should already be enough :) You can also overwrite the other variables returned by config, like the description (graph_info), which can be useful to add special informations, like "the disk is getting full, but it's fine, we already ordered a new one" or being more precise in describing how monitoring the entropy of a system can be useful...


Aggregate Munin Data on a Single Graph

It is sometimes useful to compare data coming from several nodes; if you have a cluster of 5 load balanced HTTP servers, it can be interesting to have the curves for the load of all 5 servers on a single graph, to check if the load balancer is properly configured. You could also put on the same graph the load of your fileserver, and the CPU time spent on "i/o wait" on a server accessing it, to study the correlation between both values.

Let's take the example with the 5 HTTP servers. First, we create a new virtual node:

[domain;Comparisons]
  update no # Turn off data-fetching for this "host".

Then we create our new graph:

apaches_load.graph_title Loads of our HTTP servers
# Use domain;server1 or server1.domain, depending on the notation you use
apaches_load.graph_order server1=domain;server1:load.load server2=domain;server2:load.load server3=domain;server3:load.load server4=domain;server4:load.load server5=domain;server5:load.load
apaches_load.category Apache

After the next run of munin-cron, your new graph should be available.

A little more complex example now. I have 3 plugins collecting the number of users on a forum; the first plugin collects the number of guests, hidden users and online registered users on a forum . The second plugin collects the number of guests and registered users on a second forum, and the third the number of guests, hidden users and online registered users on a third forum.

The idea is to aggregate all these users on a single graph; I want one stack for each forum, so for each plugin we will have to sum guests, hidden users and registered users. Drops in the following graph are due to timeouts of one of the plugins due to heavy load on the host:



This is made with the following configuration, in munin.conf, in the node you want to put the graph in:

forum_sum.graph_title Active users on the forums we host

forum_sum.users_fr.sum Misc;Forums:forum_users_fr.members Misc;Forums:forum_users_fr.guests
forum_sum.users_fr.draw STACK
forum_sum.users_fr.label Users on Forum1

forum_sum.users_de.sum Misc;Forums:forum_users_de.members Misc;Forums:forum_users_de.hidden Misc;Forums:forum_users_de.guests
forum_sum.users_de.draw STACK
forum_sum.users_de.label Users on Forum2

forum_sum.users_ru.sum Misc;Forums:forum_users_ru.members Misc;Forums:forum_users_ru.hidden Misc;Forums:forum_users_ru.guests
forum_sum.users_ru.draw STACK
forum_sum.users_ru.label Users on Forum3

Misc is the domain, Forums the name of the node. forum_sum is the name of the virtual plugin we create; users_fr, users_de and users_ru the name of the different fields for that plugin. forum_users_* are the names of the plugins fetching data from the forums.

Munin-node Monitoring External Sources of Data, or Virtual Nodes Explained

It happened to me twice to write or to use plugins for munin-node which were not concerning directly the server where the node was installed on. The first one was a plugin using the SNMP interface of an UPS to check the temperature in our server room, and the other one a plugin returning the number of visitors on a forum, by downloading the page and finding the value with a regular expression.

As these plugins were installed on the monitoring machine, I had the temperature of the server room somewhere in between the available entropy and the load graphs on the monitoring machine. With the virtual nodes, I can create a virtual node called ServerRoom1, with for example two plugins, temperature and humidity. There is no physical machine called "ServerRoom1", so the munin-node installed on serverA will just tell me, while I fetches its nodes, which ones are for serverA and which one for ServerRoom1. This is what "config temperature" in a telnet session to my server will return:


config temperature
host_name ServerRoom1
[...]


As you can see, there is an additional line specifying the host name. For the other plugins, no host_name is defined, therefore it is assumed there are for the default node (the one you get the greeting from just after connecting with telnet, remember?). The nodes command will also return several nodes:


nodes
serverA
ServerRoom1
.



There are several ways to say to munin that a plugin is reporting data for a virtual node. The preferred method is the following; edit the file /etc/munin/plugin.conf.d/munin-node on the machine running the node:


serverA $ sudo vi /etc/munin/plugin-conf.d/munin-node

Find the section related to your plugin, or if you can not find it, add it at the end; if your plugin name is temperature, and the name you want to give to the virtual node is ServerRoom1, then add this:



[temperature]
host_name ServerRoom1


Remember to restart munin-node; you should now be able to check via telnet that the node on serverA is presenting two nodes.

You then need to configure munin on the monitoring server:


[ServerRoom1]
  address aaa.aaa.aaa.aaa
  use_node_name no



ServerRoom1 being the name of the virtualnode, and aaa.aaa.aaa.aaa the IP address of the server with the munin-node.
Save, wait 5 minutes, it should work :)


Munin and Alerting: Method 3


Integration With Nagios: Via a Nagios Plugin

If you don't want to use passive checks. You can use check_munin_rrd plugin.

Basically Munin-node data get stored on the munin server as usual and Nagios is reading those data to check the status of the node.

$ /usr/lib/nagios/plugins/check_munin_rrd.pl --help

Monitor server via Munin-node pulled data
Usage: /usr/lib/nagios/plugins/check_munin_rrd.pl  -H -M
[-D ] -w -c [-V]
-h, --help
      print this help message
-H, --hostname=HOST
      name or IP address of host to check
-M, --module=MUNIN MODULE
      Munin module value to fetch
-D, --domain=DOMAIN
      Domain as defined in munin
-w, --warn=INTEGER
      warning level
-c, --crit=INTEGER
      critical level
-v      --verbose
      Be verbose
-V, --version
      prints version number
check_munin_rrd.pl (nagios-plugins 1.4.2) 0.9
The nagios plugins come with ABSOLUTELY NO WARRANTY. You may redistribute
copies of the plugins under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING.

Previous implementation was using a check from Nagios directly onto Munin-node which is overkill since the Munin server gets the data already via cron.

You need to define a

  • new command :
define command{
     command_name check_munin
     command_line /usr/lib/nagios/plugins/check_munin_rrd.pl -H $HOSTALIAS$ -M $ARG1$ -w $ARG2$ -c $ARG3$
     }
  • new service template :
# generic service template definition check via munin
define service{
       name                            generic-munin-service ; The 'name' of this service template
       active_checks_enabled           1       ; Active service checks are enabled
       passive_checks_enabled          0       ; Passive service checks are enabled/accepted
       parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
       obsess_over_service             1       ; We should obsess over this service (if necessary)
       check_freshness                 0       ; Default is to NOT check service 'freshness'
       notifications_enabled           1       ; Service notifications are enabled
       event_handler_enabled           1       ; Service event handler is enabled
       flap_detection_enabled          1       ; Flap detection is enabled 
       failure_prediction_enabled      1       ; Failure prediction is enabled
       process_perf_data               1       ; Process performance data
       retain_status_information       1       ; Retain status information across program restarts
       retain_nonstatus_information    1       ; Retain non-status information across program restarts
       notification_interval           0       ; Only send notifications on status change by default.
       is_volatile                     0
       check_period                    24x7
       normal_check_interval           5             ; This directive is used to define the number of "time units" to wait before scheduling the next "regular" check of the service.
       retry_check_interval            3       ; This directive is used to define the number of "time units" to wait before scheduling a re-check of the service.
       max_check_attempts              2             ; This directive is used to define the number of times that Nagios will retry the service check command if it returns any state other than an OK state. Setting this value to 1 will cause Nagios to generate an alert without retrying the service check again.
       notification_period             24x7
       notification_options            w,u,c,r
       contact_groups                  admins
       register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
       }

Don't use smaller value for normal_check_interval, munin updates data every 5 minutes.

  • new service example :
# check the disk usage via munin
define service{
       hostgroup_name                  web-servers
       service_description             disk-usage
       check_command                   check_munin_rrd!df!75!90
       use                             generic-munin-service
     }


Munin and Alerting: Method 2

Integration with Nagios: via a NSCA server


First you need a way for Nagios to accept messages from Munin. Nagios has exactly such a thing, namely the NSCA which is documented here: http://nagios.sourceforge.net/docs/1_0/addons.html#nsca.

NSCA consists of a client (a binary usually named send_nsca) and a server usually run from inetd. I recommend that you enable encryption on NSCA communication.

You also need to configure Nagios to accept messages via NSCA. Those will be passive alerts.

# This is an example of the correct way to activate Nagios warnings
contact.nagios.command /usr/local/nagios/bin/send_nsca
nagioshost.example.com -c /usr/local/nagios/etc/send_nsca.cfg

Munin and Alerting: Method 1

Munin internal alerting system

Munin graphs are nice; but if you don't want to check them every morning for suspiciously high network traffic or critical disk usage, you would probably want munin to send you an alert if it finds an "unusual" value. Munin has a very basic alerting system built-in. Imagine your email adress is user@foo.com, and you want to receive a mail if the load on serverA goes over 3, and another mail if it reaches the critical value 5. You also want your partner, partner@foo.com, to be notified.

In /etc/munin/munin.conf, add the following line (usually over the part defining the nodes to monitor) :


contact.user.command mail −s "Munin Notification" user@foo.com
contact.partner.command mail −s "Munin Notification" partner@foo.com


Then, in the part describing serverA:


[domain;serverA]
        address aaa.aaa.aaa.aaa
        use_node_name yes
load.warning 3
load.critical 5
contacts user partner



The values 3 and 5 are here maximal values. If you wanted to say, I want to be warned if the load goes under 1, you could replace 3 by 1:. You can also set a minimum and a maximum value: load.warn 1:3 would warn you if the load goes under 1 or over 3. 
To monitor part of a service with munin, you will need the internal name of the element you want to check. For example, we want to be warned if the usage of the disk /dev/sdb1 on serverA exceeds 95%; the line we will add is _dev_sdb1.warning 95, devsdb1 being the internal name of the element. There are two ways to find this internal name.


We know the usage of that disk is monitored by the plugin "df". So we can go the HTML page produced by munin, click on the graph corresponding to the df plugin; on the bottom of the page, a table lists all the elements monitored by the df plugin, with their internal name. The other way is to connect to the node with telnet, and fetch the df plugin:



fetch df
_dev_sdb1.value 90
varrun_var_run.value 1
varlock_var_lock.value 0
procbususb.value 1
udev_dev.value 1
devshm_dev_shm.value 0
lrm_lib_modules_2_6_20_16_generic_volatile.value 9


Anyway, remember: munin is run by cron every five minutes. And no, munin doesn't keep track of who it has already mailed or not. I let you imagine what would happen if the usage of your disk /dev/sdb1 goes up to 96% friday evening, just after you left work. You may have a surprise on monday morning, when checking your mails, it may be tens if not hundreds of mails... You can not make groups of contacts neither, or group of machines. If you want to have warnings on 10 services on 50 machines, it starts to get quite complicated... Therefore I would recommend you use one of the Nagios methods.

Understanding Munin's Protocol

Along this tutorial, you will create and install plugins, add nodes, virtual nodes... It is really important for troubleshooting and debugging that you understand how munin communicates with the nodes - so we are going to play with telnet a little bit here. Open a shell on a machine which is allowed to connect to the munin-node; for our example I will take ServerA (the node) and serverX (the server allowed to access the node). The default port of munin is the 4949.


ServerX $ telnet aaa.aaa.aaa.aaa 4949
Trying aaa.aaa.aaa.aaa...
Connected to ServerA.
Escape character is '^]'.
# munin node at ServerA


As there is no help command, just enter something and hit enter to get a list of the commands available:


?
# Unknown command. Try list, nodes, config, fetch, version or quit

Let's start with the list command: as you can guess, it lists all the "services" presented by the node (your output may differ a little bit).


list
open_inodes if_err_eth0 irqstats entropy if_eth0 processes postfix_mailqueue acpi netstat interrupts swap load df_inode if_err_eth1 if_eth1 postfix_mailvolume iostat open_files forks memory vmstat


This list actually depend on what you have installed; if you have a mysql server installed, you will see mysql-related services in the list. The two commands config and fetch take a service as argument. Let's have a look at our swap:


config swap  
graph_title Swap in/out
graph_args -l 0 --base 1000
graph_vlabel pages per ${graph_period} in (-) / out (+)
graph_category system
swap_in.label swap
swap_in.type DERIVE
swap_in.max 100000
swap_in.min 0
swap_in.graph no
swap_out.label swap
swap_out.type DERIVE
swap_out.max 100000
swap_out.min 0
swap_out.negative swap_in

.
fetch swap
swap_in.value 5
swap_out.value 8



The config command will tell our munin how to build the graph. It will give a title for the graph, a category, a legend for the axes... These variables are needed by rrdtool to build the graphs. The fetch command actually retrieves the values themselves.

The nodes command list the nodes made available by the current node; Right now we have only one node, but as one node can in theory monitor several servers or equipments, munin as introduced the concept of virtual node which we will detail later. This command is not too important for now.


nodes
ServerX
.


Finally, we have the version and the quit commands:


version
munins node on thunder version: 1.4.4

quit
Connection closed by foreign host.

Using telnet to access your munin-node is usually not useful, although it can be sometimes for debugging purposes. It has been described here to help you understand how munin and munin-node communicate together, and help you understand the whole data gathering process.

Getting Started With Munin - A Simple Setup

We have two machines; a server we want to monitor (serverA, IP: aaa.aaa.aaa.aaa) and a server which will monitor it (serverX, IP: xxx.xxx.xxx.xxx).


Setting up the node


On the server we want to monitor, we need to install munin-node.

serverA $ sudo apt-get install munin-node

By default, only serverA itself will be allowed to connect to this node to retrieve the data; we need to explicitly allow serverX to connect to it; this is done at the end of the configuration file of munin-node (usually in /etc/munin/munin-node.conf).

serverA $ sudo vi /etc/munin/munin-node.conf


You will find the following line:

allow ^127\.0\.0\.1$


Below, allow the IP adress of serverX to connect (the ^ and $ at the beginning and at the end are important):

allow ^xxx\.xxx\.xxx\.xxx$


As munin-node runs as a daemon, you need to restart it to make the changes active.

serverA $ sudo /etc/init.d/munin-node restart


Setting up the munin


On the monitoring server, we install munin:

serverX $ sudo apt-get install munin


We need to tell munin that we want it to monitor serverA. Munin's configuration munin.conf is usually to be found in /etc/munin/.

serverX $ sudo vi /etc/munin/munin.conf


At the end of the file, add the following:

[Domain;serverA]
address aaa.aaa.aaa.aaa
use_node_name yes


serverA should be the name of your machine. Domain is the "domain" of your machine; in fact it is more a group name, used to sort your servers. You can choose to sort by location (server1.edmonton...), by role (server1.apache), or whatever you feel is relevant. The usual notation I found everywhere is [serverA.domain]; I don't know why, as it creates a problem with domain names using dots, and makes the domain name appear after every node name in the overview page. I would recommend to use the notation I gave.

Munin is a Perl script run every 5 minutes by cron. The cronjob should have been set automatically during the installation. Therefore, you don't have to restart it; just wait 5 minutes.

Making the files accessible


Your files should already be available in /var/www/munin by now. Make them available by installing an HTTP server; lighttpd would do the job here.

serverX $ sudo apt-get install lighttp


You can then access the monitoring via your favorite browser with the address http://xxx.xxx.xxx.xxx/munin/.

Congratulations! Munin is working. Your server is now monitored, the default setup should be enough for most of the cases. But there is a lot of other cool stuff you can do with munin, which I will describe now :).

Monday, July 5, 2010

Linux Tips: Learn 10 good Linux usage habits

Introduction

When you use a system often, you tend to fall into set usage patterns. Sometimes, you do not start the habit of doing things in the best possible way. Sometimes, you even pick up bad practices that lead to clutter and clumsiness. One of the best ways to correct such inadequacies is to conscientiously pick up good habits that counteract them. This article suggests 10 UNIX command-line habits worth picking up -- good habits that help you break many common usage foibles and make you more productive at the command line in the process. Each habit is described in more detail following the list of good habits.

Adopt 10 good habits

Ten good habits to adopt are:
  1. Make directory trees in a single swipe.
  2. Change the path; do not move the archive.
  3. Combine your commands with control operators.
  4. Quote variables with caution.
  5. Use escape sequences to manage long input.
  6. Group your commands together in a list.
  7. Use xargs outside of find .
  8. Know when grep should do the counting -- and when it should step aside.
  9. Match certain fields in output, not just lines.
  10. Stop piping cats.
Conclusion: Embrace good habits

It is good to examine your command-line habits for any bad usage patterns. Bad habits slow you down and often lead to unexpected errors. This article presents 10 new habits that can help you break away from many of the most common usage errors. Picking up these good habits is a positive step toward sharpening your Linux command-line skills.

    Good Habit #10: Stop piping cats

    A basic-but-common grep usage error involves piping the output of cat to grep to search the contents of a single file. This is absolutely unnecessary and a waste of time, because tools such as grep take file names as arguments. You simply do not need to use cat in this situation at all, as in the following example:

    Listing 1. Example of good and bad habit #10: Using grep with and without ca
     
    ~ $ time cat tmp/a/longfile.txt | grep and
    2811
    
    real    0m0.015s
    user    0m0.003s
    sys     0m0.013s
    ~ $ time grep and tmp/a/longfile.txt
    2811
    
    real    0m0.010s
    user    0m0.006s
    sys     0m0.004s
    ~ $ 

    This mistake applies to many tools. Because most tools take standard input as an argument using a hyphen (-), even the argument for using cat to intersperse multiple files with stdin is often not valid. It is really only necessary to concatenate first before a pipe when you use cat with one of its several filtering options.

    Good Habit #9: Match certain fields in output, not just lines

    A tool like awk is preferable to grep when you want to match the pattern in only a specific field in the lines of output and not just anywhere in the lines.

    The following simplified example shows how to list only those files modified in December:

    Listing 1. Example of bad habit #9: Using grep to find patterns in specific fields

    ~/tmp $ ls -l /tmp/a/b/c | grep Dec
    -rw-r--r--  7 joe joe  12043 Jan 27 20:36 December_Report.pdf
    -rw-r--r--  1 root root  238 Dec 03 08:19 README
    -rw-r--r--  3 joe joe   5096 Dec 14 14:26 archive.tar
    ~/tmp $

    In this example, grep filters the lines, outputting all files with Dec in their modification dates as well as in their names. Therefore, a file such as December_Report.pdf is matched, even if it has not been modified since January.

    This probably is not what you want. To match a pattern in a particular field, it is better to use awk, where a relational operator matches the exact field, as in the following example:

    Listing 2. Example of good habit #9: Using awk to find patterns in specific fields

    ~/tmp $ ls -l | awk '$6 == "Dec"'
    -rw-r--r--  3 joe joe   5096 Dec 14 14:26 archive.tar
    -rw-r--r--  1 root root  238 Dec 03 08:19 README
    ~/tmp $
     
     Check the awk man pages for more details about how to use awk

    Good Habit #8: Know when grep should do the counting -- and when it should step aside

    Avoid piping a grep to wc -l in order to count the number of lines of output. The -c option to grep gives a count of lines that match the specified pattern and is generally faster than a pipe to wc, as in the following example:

    Listing 1. Example of good habit #8: Counting lines with and without grep

    ~ $ time grep and tmp/a/longfile.txt | wc -l
    2811
    
    real    0m0.097s
    user    0m0.006s
    sys     0m0.032s
    ~ $ time grep -c and tmp/a/longfile.txt
    2811
    
    real    0m0.013s
    user    0m0.006s
    sys     0m0.005s
    ~ $ 

    An addition to the speed factor, the -c option is also a better way to do the counting. With multiple files, grep with the -c option returns a separate count for each file, one on each line, whereas a pipe to wc gives a total count for all files combined.

    However, regardless of speed considerations, this example showcases another common error to avoid. These counting methods only give counts of the number of lines containing matched patterns -- and if that is what you are looking for, that is great. But in cases where lines can have multiple instances of a particular pattern, these methods do not give you a true count of the actual number of instances matched.

    To count the number of instances, use wc to count, after all. First, run a grep command with the -o option, if your version supports it. This option outputs only the matched pattern, one on each line, and not the line itself. But you cannot use it in conjunction with the -c option, so use wc -l to count the lines, as in the following example:

    Listing 2. Example of good habit #8: Counting pattern instances with grep

    ~ $ grep -o and tmp/a/longfile.txt | wc -l
    3402
    ~ $

    In this case, a call to wc is slightly faster than a second call to grep with a dummy pattern put in to match and count each line (such as grep -c).

    Good Habit #7: Use xargs outside of find

    Use the xargs tool as a filter for making good use of output culled from the find command. The general precept is that a find run provides a list of files that match some criteria. This list is passed on to xargs, which then runs some other useful command with that list of files as arguments, as in the following example:

    Listing 1. Example of the classic use of the xargs tool

    ~ $ find some-file-criteria some-file-path | \
    > xargs some-great-command-that-needs-filename-arguments
                    
                
    However, do not think of xargs as just a helper for find; it is one of those underutilized tools that, when you get into the habit of using it, you want to try on everything, including the following uses.

    Passing a space-delimited list

    In its simplest invocation, xargs is like a filter that takes as input a list (with each member on a single line). The tool puts those members on a single space-delimited line:

    Listing 2. Example of output from the xargs tool

    ~ $ xargs
                    a
                    b
                    c
                    
                        Control-D
                    
    a b c
    ~ $

    You can send the output of any tool that outputs file names through xargs to get a list of arguments for some other tool that takes file names as an argument, as in the following example:

    Listing 3. Example of using of the xargs tool

    ~/tmp $ ls -1 | xargs
    December_Report.pdf README a archive.tar mkdirhier.sh
    ~/tmp $ ls -1 | xargs file
    December_Report.pdf: PDF document, version 1.3
    README: ASCII text
    a: directory
    archive.tar: POSIX tar archive
    mkdirhier.sh: Bourne shell script text executable
    ~/tmp $

    The xargs command is useful for more than passing file names. Use it any time you need to filter text into a single line:

    Listing 4. Example of good habit #7: Using the xargs tool to filter text into a single line

    ~/tmp $ ls -l | xargs
    -rw-r--r-- 7 joe joe 12043 Jan 27 20:36 December_Report.pdf -rw-r--r-- 1 \
    root root 238 Dec 03 08:19 README drwxr-xr-x 38 joe joe 354082 Nov 02 \
    16:07 a -rw-r--r-- 3 joe joe 5096 Dec 14 14:26 archive.tar -rwxr-xr-x 1 \
    joe joe 3239 Sep 30 12:40 mkdirhier.sh
    ~/tmp $

    Be cautious using xargs
     
    Technically, a rare situation occurs in which you could get into trouble using xargs. By default, the end-of-file string is an underscore (_); if that character is sent as a single input argument, everything after it is ignored. As a precaution against this, use the -e flag, which, without arguments, turns off the end-of-file string completely.

    Good Habit #6: Group your commands together in a list

    Most shells have ways to group a set of commands together in a list so that you can pass their sum-total output down a pipeline or otherwise redirect any or all of its streams to the same place. You can generally do this by running a list of commands in a subshell or by running a list of commands in the current shell.

    Run a list of commands in a subshell

    Use parentheses to enclose a list of commands in a single group. Doing so runs the commands in a new subshell and allows you to redirect or otherwise collect the output of the whole, as in the following example:

    Listing 1. Example of good habit #6: Running a list of commands in a subshell

    ~ $ ( cd tmp/a/b/c/ || mkdir -p tmp/a/b/c && \
    > VAR=$PWD; cd ~; tar xvf -C $VAR archive.tar ) \
    > | mailx admin -S "Archive contents"
                

    In this example, the content of the archive is extracted in the tmp/a/b/c/ directory while the output of the grouped commands, including a list of extracted files, is mailed to the admin address.

    The use of a subshell is preferable in cases when you are redefining environment variables in your list of commands and you do not want those definitions to apply to your current shell.

    Run a list of commands in the current shell

    Use curly braces ({}) to enclose a list of commands to run in the current shell. Make sure you include spaces between the braces and the actual commands, or the shell might not interpret the braces correctly. Also, make sure that the final command in your list ends with a semicolon, as in the following example:

    Listing 2. Another example of good habit #6: Running a list of commands in the current shell

    ~ $ { cp ${VAR}a . && chown -R guest.guest a && \
    > tar cvf newarchive.tar a; } | mailx admin -S "New archive"
    
    
    
    

    Good Habit #5: Use escape sequences to manage long input

    You have probably seen code examples in which a backslash (\) continues a long line over to the next line, and you know that most shells treat what you type over successive lines joined by a backslash as one long line. However, you might not take advantage of this function on the command line as often as you can.

    The backslash is especially handy if your terminal does not handle multi-line wrapping properly or when your command line is smaller than usual (such as when you have a long path on the prompt). The backslash is also useful for making sense of long input lines as you type them, as in the following example:

    Listing 1. Example of good habit #5: Using a backslash for long input

    ~ $ cd tmp/a/b/c || \
    > mkdir -p tmp/a/b/c && \
    > tar xvf -C tmp/a/b/c ~/archive.tar
                

    Alternatively, the following configuration also works:

    Listing 2. Alternative example of good habit #5: Using a backslash for long input

    ~ $ cd tmp/a/b/c \
    >                 || \
    > mkdir -p tmp/a/b/c \
    >                    && \
    > tar xvf -C tmp/a/b/c ~/archive.tar
                

    However you divide an input line over multiple lines, the shell always treats it as one continuous line, because it always strips out all the backslashes and extra spaces.

    Note: In most shells, when you press the up arrow key, the entire multi-line entry is redrawn on a single, long input line.

    Good Habit #4: Quote variables with caution

    Always be careful with shell expansion and variable names. It is generally a good idea to enclose variable calls in double quotation marks, unless you have a good reason not to. Similarly, if you are directly following a variable name with alphanumeric text, be sure also to enclose the variable name in curly braces ({}) to distinguish it from the surrounding text. Otherwise, the shell interprets the trailing text as part of your variable name -- and most likely returns a null value.

    Listing 1 provides examples of various quotation and non-quotation of variables and their effects.

    Listing 1. Example of good habit #4: Quoting (and not quoting) a variable

    ~ $ ls tmp
    /a b
    ~ $ VAR="tmp/*"
    ~ $ echo $VAR
    tmp/a tmp/b
    ~ $ echo "$VAR" 
    tmp/*
    ~ $ echo $VARa
     
    ~ $ echo "$VARa"
    
    ~ $ echo "${VAR}a"
    tmp/*a
    ~ $ echo ${VAR}a
    tmp/a
    ~ $


    Good Habit #3: Combine your commands with control operators

    You probably already know that in most shells, you can combine commands on a single command line by placing a semicolon (;) between them. The semicolon is a shell control operator, and while it is useful for stringing together multiple discrete commands on a single command line, it does not work for everything. For example, suppose you use a semicolon to combine two commands in which the proper execution of the second command depends entirely upon the successful completion of the first. If the first command does not exit as you expected, the second command still runs -- and fails. Instead, use more appropriate control operators (some are described in this article). As long as your shell supports them, they are worth getting into the habit of using them.

    Run a command only if another command returns a zero exit status

    Use the && control operator to combine two commands so that the second is run only if the first command returns a zero exit status. In other words, if the first command runs successfully, the second command runs. If the first command fails, the second command does not run at all. For example:

    Listing 1. Example of good habit #3: Combining commands with control operators

    ~ $ cd tmp/a/b/c && tar xvf ~/archive.tar
                

    In this example, the contents of the archive are extracted into the ~/tmp/a/b/c directory unless that directory does not exist. If the directory does not exist, the tar command does not run, so nothing is extracted.

    Run a command only if another command returns a non-zero exit status

    Similarly, the || control operator separates two commands and runs the second command only if the first command returns a non-zero exit status. In other words, if the first command is successful, the second command does not run. If the first command fails, the second command does run. This operator is often used when testing for whether a given directory exists and, if not, it creates one:

    Listing 2. Another example of good habit #3: Combining commands with control operators

    ~ $ cd tmp/a/b/c || mkdir -p tmp/a/b/c
                

    You can also combine the control operators described in this section. Each works on the last command run:

    Listing 3. A combined example of good habit #3: Combining commands with control operators

    ~ $ cd tmp/a/b/c || mkdir -p tmp/a/b/c && tar xvf -C tmp/a/b/c ~/archive.tar