Bash script to clean Bots out of Apache Logs

If you’ve ever spent some time looking at webserver logs, you know how much crap there is in there from crawlers, bots, indexers, and all the bottom feeders of the internet. If you’re looking for a specific problem with the webserver, this stuff can quickly become a nuisance, stopping you from finding the information you need. In addition, its often a surprise exactly how much of the traffic your website serves up to these bots.

The script below helps with both these problems. It takes stats of a logfile (apache, but should also work on nginx), makes a backup, counts the number of lines it removes and each kind of bot, and then repeats the new stats at the end. Copy the following into a file eg. and run with the name of the logfile as an argument. eg logfile.log (Use a copy of the logfile. It DELETES the lines)
You’ll definitely want to edit the LOCALTRAFFIC bit to fit your needs. You may also want to add bots to the BOTLIST. Run the script once on a copy of the logfile and then view it to see what bots are left …



#filestats before
PRELINES=$(cat $INFILE | wc -l )
PRESIZE=$( stat -c %s $INFILE )
echo $INFILE is $PRESIZE bytes and contains $PRELINES lines

# List of patterns to delete from logfiles
LOCALTRAFFIC=" wp-cron.php "

echo "-------- Removing local traffic ---------"
    TERMCOUNT=$( grep "$TERM" $INFILE | wc -l )
    echo Removing $TERMCOUNT instances of $TERM
    sed -i  "/$TERM/d" $INFILE

# List of patterns to delete from logfiles, space separated
BOTLIST="ahrefs Baiduspider bingbot Cliqzbot DomainCrawler DuckDuckGo Exabot Googlebot linkdexbot magpie-crawler MJ12bot msnbot opensiteexplorer pingdom rogerbot SemrushBot SeznamBot\/docs tt-rss Wotbox YandexBot YandexImages ysearch\/slurp "

echo "------- Removing Bots ---------"
for TERM in $BOTLIST; do
    TERMCOUNT=$( grep "$TERM" $INFILE | wc -l )
    echo Removing $TERMCOUNT instances of $TERM
    sed -i  "/$TERM/d" $INFILE

#filestats after

POSTLINES=$(cat $INFILE | wc -l )
POSTSIZE=$( stat -c %s $INFILE )
PERCENT=$(awk "BEGIN { pc=100*${POSTLINES}/${PRELINES}; i=int(pc); print (pc-i<0.5)?i:i+1 }")

echo $INFILE is now $POSTSIZE bytes and contains $POSTLINES lines
echo Log reduced to $PERCENT percent of its original size.

And here is a sample output.

~/temp/log $ ./ 02Apr.log
02Apr.log is 2432560 bytes and contains 10238 lines

-------- Removing local traffic ---------
Removing 1054 instances of wp-cron.php
Removing 776 instances of
------- Removing Bots ---------
Removing 525 instances of Googlebot
Removing 226 instances of DomainCrawler
Removing 1061 instances of Baiduspider
Removing 377 instances of pingdom
Removing 1343 instances of
Removing 1087 instances of opensiteexplorer
Removing 212 instances of YandexBot
Removing 26 instances of YandexImages
Removing 163 instances of SemrushBot
Removing 17 instances of ysearch\/slurp
Removing 44 instances of msnbot
Removing 385 instances of bingbot
Removing 95 instances of Wotbox
Removing 51 instances of Cliqzbot
Removing 22 instances of Exabot
Removing 116 instances of SeznamBot
Removing 5 instances of magpie-crawler
Removing 57 instances of\/docs
Removing 6 instances of rogerbot

02Apr.log is now 566074 bytes and contains 2590 lines
Log reduced to 25 percent of its original size.

So you can see that around 75% of the traffic on here is crap. And now the log file is much easier to read.


Turning off ipV6 in Ubuntu 16

My home router doesn’t handle IPv6, and for that matter, neither does my ISP, so I get a lot of IPv6 related garbage in my syslog and kern.log. To turn it off, you need to create a new file, rather than editing a system file, and then reload these settings.

sudo nano /etc/sysctl.d/95-disable-ipv6.conf
#add the following lines
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
# reload kernel parameters
sudo service procps reload
# Check that its taken ... the result of this should be 1
cat /proc/sys/net/ipv6/conf/all/disable_ipv6


Secret WordPress Options Page

OK, so maybe its not a complete secret, but after around 10 years of running WordPress I only just found out about it. Here it is:

Obviously you have to be logged in as admin. This basically gives you access to the wp_options table without having to go to a separate database management app, or do some mysql command line ninja.

Who knew? Its not on a menu, probably to hide it from fat fingers, but hey, I wish I’d known about this earlier. Would have saved me a few hours over the course of my work life.

Blocking irritating PLDT Billing popups

So you forgot to pay your bill. It happens. PLDT used to set an automated phone call to call you once a day until you paid. Now they have something much more irritating, and something that feels borderline illegal. What they do is they hijack your internet connection. Every four or five page loads, they inject an HTML frame with a monster ad, which sits on top of all your work. You have no choice but to click on the button which takes you to a page with the amount you owe (which means they know which IP is your account BTW).

However there is no way to remove the popup, which sits on top of your browser. There is no ‘close window’ button, which means your only choice is to refresh the page, or close the window and re-open it. Too bad if you were working on something and hadn’t pressed the save button: that’s now gone. And even after you pay, the popup sticks around for a day or two … (how hard can it be to automatically cancel it?)

So how do we block it? The frame is an iframe which uses an IP address, so we can’t use DNS blocking or hostfile blocking. So changing the routing table seems to be the way to go.

On Linux, using sudo if necessary:

# check routing table
route -n
# Add rule
route add gw lo
# check routing table again
route -n
# check desired result
ip route get


Getting logwatch to print out a list of apt packages with upgrades

Useful one to have appear in your inbox in the morning. Have just done this on a server, so I thought I’d put it here to remind myself. This works for debian and Ubuntu variants, which have the apt command (a meta-script for the apt-get ecosystem)

In note form …

Add the following text to /etc/logwatch/scripts/services/apt

/usr/bin/apt list --upgradable

Add the following text to /etc/logwatch/conf/services/apt.conf

# The title shown in the report.
Title = "Packages to upgrade"

# The name of the log file group (file name). 
# e.g for /etc/logwatch/conf/logfiles/apt.conf, we'd have Logfile = apt
LogFile = NONE

As we have NONE there, we don’t need to create /etc/logwatch/conf/logfiles/apt.conf.