A Blog for Those With a Big Appetite for IT Knowledge...: Understanding Linux system performance management using top

If there's something wrong with the performance of your Linux server, chances are that you're already using top to find out what's happening. It seems however that few people really know how to tell what their system is doing from the information that top provides. Here I will explain how to understand the performance data that top provides.

When starting top, make sure that you are in root. To start, open a console on your favorite Linux distribution and enter the top command. The result should look similar to this:

host:~ # top
top - 12:41:34 up 1 day, 3:29, 6 users, load average: 0.00, 0.00, 0.00
Tasks: 99 total, 1 running, 98 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 775064k total, 560056k used, 215008k free, 136216k buffers
Swap: 136544k total, 0k used, 136544k free, 275624k cached
   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8830 root 15 0 2232 936 692 R 2 0.1 0:00.03 top
   1 root 16 0 728 284 244 S 0 0.0 0:01.25 init
   2 root RT 0 0 0 0 S 0 0.0 0:00.10 migration/0
   3 root 34 19 0 0 0 S 0 0.0 0:00.34 ksoftirqd/0
   4 root RT 0 0 0 0 S 0 0.0 0:00.04 migration/1
   5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
   6 root 10 -5 0 0 0 S 0 0.0 0:00.33 events/0
   7 root 10 -5 0 0 0 S 0 0.0 0:00.10 events/1
   8 root 11 -5 0 0 0 S 0 0.0 0:00.01 khelper
   9 root 12 -5 0 0 0 S 0 0.0 0:00.00 kthread
   13 root 10 -5 0 0 0 S 0 0.0 0:00.13 kblockd/0
   14 root 10 -5 0 0 0 S 0 0.0 0:00.08 kblockd/1
   15 root 13 -5 0 0 0 S 0 0.0 0:00.00 kacpid
   16 root 13 -5 0 0 0 S 0 0.0 0:00.00 kacpi_notify
   227 root 20 0 0 0 0 S 0 0.0 0:00.00 pdflush
   228 root 15 0 0 0 0 S 0 0.0 0:00.93 pdflush
   229 root 17 0 0 0 0 S 0 0.0 0:00.00 kswapd0

The first part of relevant information that top provides can be found in the first line: the load average parameters. These describe how busy your computer is at the moment. The average workload of your server is always given in three digits. Each represents the load average for the last minute, the last five minutes and the last fifteen minutes. You should always start by interpreting these numbers, as they tell you if your system is overloaded or not.

To understand the load average values, you must relate them to the number of CPU's or CPU cores in your computer. If you're not sure, just press the 1 button when the top interface is active, this will give you a line for each CPU core that is present in your computer. When a CPU core has been completely busy in the last minute, top will show you 1.00 if it's a one core system. If you have eight cores installed in your computer, and one has been completely busy, while the others were doing nothing, top will show you 0.125 as the value in the load average. In order to interpret the value in the load average lines, you need to know the normal value for your server. For instance, on a four-core machine, that would be 4.00. Anything above that value is bad, as it indicates that queuing occurs and processes are waiting for their slice of system time. Anything below this value is good. If your system is getting beyond the ideal value for that system, the next step is to determine what exactly is happening. Listing 2 gives an example of a one-core system that is too busy:

top - 12:49:38 up 1 day, 3:37, 6 users, load average: 1.37, 0.34, 0.11
Tasks: 101 total, 4 running, 97 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.1%us, 16.7%sy, 0.0%ni, 6.2%id, 67.3%wa, 0.4%hi, 2.4%si, 0.0%st
Mem: 775064k total, 767664k used, 7400k free, 514236k buffers
Swap: 136544k total, 0k used, 136544k free, 102744k cached
   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8859 root 18 0 1788 524 448 R 46 0.1 0:21.25 cat
8860 root 18 0 1792 524 448 R 41 0.1 0:18.83 cat
  229 root 15 0 0 0 0 S 9 0.0 0:01.54 kswapd0
3695 root 16 0 1864 700 616 S 3 0.1 0:16.00 hald-addon-stor
4580 root 16 0 95916 14m 11m S 3 1.9 0:36.44 main-menu
4552 root 15 0 101m 24m 17m S 2 3.2 0:13.44 nautilus
   13 root 10 -5 0 0 0 S 1 0.0 0:00.70 kblockd/0
4270 root 15 0 146m 12m 5716 S 1 1.7 1:01.66 X
4536 root 15 0 29012 9688 7896 S 1 1.2 0:05.47 gnome-settings-
4578 root 16 0 18136 5500 4124 S 1 0.7 0:14.87 gnome-power-man
8861 root 16 0 2236 1036 780 R 1 0.1 0:00.64 top
   14 root 10 -5 0 0 0 S 1 0.0 0:00.49 kblockd/1
4575 root 15 0 95736 13m 10m S 1 1.8 0:07.39 application-bro
3131 root 16 0 4692 3244 1444 S 1 0.4 0:34.57 hald
4586 root 15 0 93868 11m 9796 S 1 1.6 0:06.62 mixer_applet2
  228 root 15 0 0 0 0 S 0 0.0 0:00.95 pdflush

If the workload is getting too high, you need to find out what is happening. To do this, you have to look at the CPU line(s). You will see no less than eight different parameters, and of these, only three really matter. First is the "us" parameter. This indicates the amount of time your system is busy handling requests that were made in user space. If a task is not in user space, it is a high-privileged task that runs in system space, which you can see reflected in the "sy" parameter. In kernel space, processes can communicate directly to the drivers. Therefore, you should worry more if your system gives a high load in system space. The third parameter that is important in the CPU line is "wa." This stands for waiting, and indicates the amount of time your system waits for I/O-devices. A high parameter here indicates a problem on the I/O-channel, normally this is a hard disk that is too slow or a misconfigured network.

The second listing example shows that the system is way too busy waiting for I/O. This is far too common, many times system performance problems are related to slow I/O devices. One solution is to install a faster hard drive, but before doing that, it is a good idea to check the BIOS of your server and see if there are parameters that you can tune. One of the most important candidates for that, is the write cache parameter. By writing data to write cache before writing it to the disk platters themselves, you can dramatically reduce waiting times. Since write cache is about 1,000 times as fast as the hard disk, chances are that you can win a lot by enabling this feature.

Use top to reveal memory efficiency on Linux servers

Apart from the information on how busy your system's CPU is, top also shows you how memory-efficient your server is. You can find information about this in the lines that start with Mem: and Swap:. Let's start discussing swap. This is RAM that is emulated on the hard drive that your computer should never use. There are some exceptions though: if your server runs Oracle, SAP or any other specific application that is built to use swap. But normally, Linux starts swapping only if it is totally out of normal memory. In an exception, your server could pre-allocate some swap so that it can use it faster if it's needed. But in most cases, you should install new RAM on a server that starts swapping.

After you have verified that your system isn't swapping, you should find out what it is doing with available memory. To understand memory, you should know that Linux uses memory quite efficiently. If there's no real need for memory to service processes, it will be used as read cache or write buffers. The read cache contains files that were recently read from your computer's hard drive. The kernel just keeps them in RAM, because you might need them again and if you do, it's a lot faster to serve these files from RAM than from hard disk. The write buffers on the other hand, are used as a waiting room for your server's hard drive. Instead of offering data directly to the hard drive, the operating system places them in the write buffers where they can wait until the hard disk decides it has time to flush these write buffers (e.g., writes them to disk). This also gives you a performance benefit.

The nice thing about read cache and write buffers is that the operating system can make them available instantaneously when it needs memory. Therefore, you should add the read cache and write buffers to the total amount of available memory. A nice way of doing this, is by using the free -m command. On the +/- buffers/cache line, you can see how much free memory your computer really has.

host:~ # free -m
   total used free shared buffers cached
Mem: 756 746 10 0 595 40
-/+ buffers/cache: 110 646
Swap: 133 0 133

As you can see in this listing, at first sight it looks as if this server almost has no more available memory, but if you know that buffers can be flushed immediately, you can see that it has largely enough available memory.

Determining active processes on a Linux server

The last interesting part of top is where it shows the most active process on your server. This is not hard to determine: the most active process is listed first on the process list. If this process uses too much system resources, top offers some options to handle it. You can terminate it by pressing the k-key from the top interface. Top will then ask you what signal you want to send to this process. You should always try signal 15 first, this represents the nice way of asking the process to please stop its activity. If that doesn't work, use signal 9, which just terminates the process without further waiting.

Another way of taming a process, is by "renicing" it – i.e., adjust the priority the process is using. To do this, press the r-key from the top interface. By giving a process a negative nice value, you increase the priority with regard to other processes. By assigning a positive nice value, you give more room for the other processes. The values you can use are between -20 and 19. It's a good idea not to assign the value of -20. By doing this, you would give the highest possible priority to a process, thus allowing it to leave no time for the other processes (if it is a busy process).

A Blog for Those With a Big Appetite for IT Knowledge...

Wednesday, October 27, 2010

Understanding Linux system performance management using top

No comments:

Post a Comment

Pages

Search This Blog

Followers

Blog Archive

About Me