AWK in a day's work

It is pretty common that when you have a log file, you sometimes tend to zone in on a single column and do some sort of aggregration. This happened when I was checking out my HTTP access logs (I had not installed awstats yet). So to get the frequency of the host hits, I did the following:

    cat access.log | cut -d" " -f1 | sort | uniq -c | sort -nr
    

Assuming you have the logging where the request IP comes first, we take the first column and then cut/sort and then have the uniq command count it and then the subsequent command sort its numerically in a descending order of hits and the unevenful output will look like:

    558 66.249.76.137
    428 66.249.76.55
    223 66.249.76.67
    159 66.249.76.99
    ...
    ...
    

With this output, I got curious with the list of IPs and started firing ‘nslookup’ on the list of the IPs to find out the actual domains of the IPs. Instead of typing the individual ‘nslookup’ commands, it would be great if my output were to look like this:

    63 68-188-44-138.static.stls.mo.charter.com.
    32 ppp-99-109.25-151.libero.it.
    21 02dcfc50.bb.sky.com.
    14 173.199.116.187.choopa.net.
    10 spider-199-21-99-80.yandex.com.
    ...
    ...
    

This started the quest to develop a single liner where instead of IPs, the output lists the domain names with their frequencies.

Using the host hits and IPs as the starting point, I came up with this one-liner on my Mac.

    echo "9 173.48.211.36" | awk '{print $1,"\n", system("nslookup " $2 "|cat")}' | awk '/name/ {print count,$NF}; { if(NR==1) count=$0}'
    

This command saves the host hits in a “count” variable and then searches for the “name” field in the nslookup command. This command worked fine on Mac but not on Ubuntu because the order of host hits and output of nslookup was reversed. To fix this I added an explicit action block for host hits and also gave it a column name. The improved one-liner is as follows:

    echo "9 173.48.211.36" | awk '{print "Count: "$1};{ns=system("nslookup " $2 "|cat")}' | awk '/name/ {print count,$NF}; /Count:/ { count=$NF}'
    

Now let’s merge the above one liner with the one with the IPs and this gets us a final version.

    cat access.log | cut -d" " -f1 | sort | uniq -c | sort -nr | awk '{print "Count: "$1};{ns=system("nslookup " $2 "|cat")}' | awk '/name/ {print count,$NF}; /Count:/ { count=$NF}'
    

But wait, there’s more, As I read more about nslookup, I found that it has been deprecated in favor of ‘host’ and ‘dig’. So finally the one-liner using ‘host’ looks like this:

    cat access.log | cut -d" " -f1 | sort | uniq -c | sort -nr | awk '{printf "%s %s ", $1,$2};{ns=system("host " $2 "|cat")}' | awk '{print $1, $2, $7}'  
    

The new output looks this:

    56 95.108.158.231 img-spider-95-108-158-231.yandex.com.
    39 89.178.214.185 89-178-214-185.broadband.corbina.ru.
    32 67.152.243.196 ip67-152-243-196.z243-152-67.customer.algx.net.
    4 65.96.129.102 c-65-96-129-102.hsd1.ma.comcast.net.
    3 77.75.77.32 fulltextrobot-77-75-77-32.seznam.cz.
    3 199.21.99.80 spider-199-21-99-80.yandex.com.
    2 81.38.42.65 65.Red-81-38-42.dynamicIP.rima-tde.net.
    1 74.125.19.28 3(NXDOMAIN)
    1 74.125.183.31 3(NXDOMAIN)
    1 66.249.76.228 crawl-66-249-76-228.googlebot.com.
    1 180.76.6.223 3(NXDOMAIN)
    1 180.76.5.64 2(SERVFAIL)
    1 173.199.116.51 173.199.116.51.choopa.net.
    1 173.199.116.227 173.199.116.227.choopa.net.
    

If you have any more thoughts how this one-liner can be improved, drop in a comment.