Monday, July 5, 2010

Good Habit #8: Know when grep should do the counting -- and when it should step aside

Avoid piping a grep to wc -l in order to count the number of lines of output. The -c option to grep gives a count of lines that match the specified pattern and is generally faster than a pipe to wc, as in the following example:

Listing 1. Example of good habit #8: Counting lines with and without grep

~ $ time grep and tmp/a/longfile.txt | wc -l
2811

real    0m0.097s
user    0m0.006s
sys     0m0.032s
~ $ time grep -c and tmp/a/longfile.txt
2811

real    0m0.013s
user    0m0.006s
sys     0m0.005s
~ $ 

An addition to the speed factor, the -c option is also a better way to do the counting. With multiple files, grep with the -c option returns a separate count for each file, one on each line, whereas a pipe to wc gives a total count for all files combined.

However, regardless of speed considerations, this example showcases another common error to avoid. These counting methods only give counts of the number of lines containing matched patterns -- and if that is what you are looking for, that is great. But in cases where lines can have multiple instances of a particular pattern, these methods do not give you a true count of the actual number of instances matched.

To count the number of instances, use wc to count, after all. First, run a grep command with the -o option, if your version supports it. This option outputs only the matched pattern, one on each line, and not the line itself. But you cannot use it in conjunction with the -c option, so use wc -l to count the lines, as in the following example:

Listing 2. Example of good habit #8: Counting pattern instances with grep

~ $ grep -o and tmp/a/longfile.txt | wc -l
3402
~ $

In this case, a call to wc is slightly faster than a second call to grep with a dummy pattern put in to match and count each line (such as grep -c).

No comments:

Post a Comment