Bash script to clean Bots out of Apache Logs

If you’ve ever spent some time looking at webserver logs, you know how much crap there is in there from crawlers, bots, indexers, and all the bottom feeders of the internet. If you’re looking for a specific problem with the webserver, this stuff can quickly become a nuisance, stopping you from finding the information you need. In addition, its often a surprise exactly how much of the traffic your website serves up to these bots.

The script below helps with both these problems. It takes stats of a logfile (apache, but should also work on nginx), makes a backup, counts the number of lines it removes and each kind of bot, and then repeats the new stats at the end. Copy the following into a file eg. and run with the name of the logfile as an argument. eg logfile.log (Use a copy of the logfile. It DELETES the lines)
You’ll definitely want to edit the LOCALTRAFFIC bit to fit your needs. You may also want to add bots to the BOTLIST. Run the script once on a copy of the logfile and then view it to see what bots are left …



#filestats before
PRELINES=$(cat $INFILE | wc -l )
PRESIZE=$( stat -c %s $INFILE )
echo $INFILE is $PRESIZE bytes and contains $PRELINES lines

# List of patterns to delete from logfiles
LOCALTRAFFIC=" wp-cron.php "

echo "-------- Removing local traffic ---------"
    TERMCOUNT=$( grep "$TERM" $INFILE | wc -l )
    echo Removing $TERMCOUNT instances of $TERM
    sed -i  "/$TERM/d" $INFILE

# List of patterns to delete from logfiles, space separated
BOTLIST="ahrefs Baiduspider bingbot Cliqzbot DomainCrawler DuckDuckGo Exabot Googlebot linkdexbot magpie-crawler MJ12bot msnbot opensiteexplorer pingdom rogerbot SemrushBot SeznamBot\/docs tt-rss Wotbox YandexBot YandexImages ysearch\/slurp "

echo "------- Removing Bots ---------"
for TERM in $BOTLIST; do
    TERMCOUNT=$( grep "$TERM" $INFILE | wc -l )
    echo Removing $TERMCOUNT instances of $TERM
    sed -i  "/$TERM/d" $INFILE

#filestats after

POSTLINES=$(cat $INFILE | wc -l )
POSTSIZE=$( stat -c %s $INFILE )
PERCENT=$(awk "BEGIN { pc=100*${POSTLINES}/${PRELINES}; i=int(pc); print (pc-i<0.5)?i:i+1 }")

echo $INFILE is now $POSTSIZE bytes and contains $POSTLINES lines
echo Log reduced to $PERCENT percent of its original size.

And here is a sample output.

~/temp/log $ ./ 02Apr.log
02Apr.log is 2432560 bytes and contains 10238 lines

-------- Removing local traffic ---------
Removing 1054 instances of wp-cron.php
Removing 776 instances of
------- Removing Bots ---------
Removing 525 instances of Googlebot
Removing 226 instances of DomainCrawler
Removing 1061 instances of Baiduspider
Removing 377 instances of pingdom
Removing 1343 instances of
Removing 1087 instances of opensiteexplorer
Removing 212 instances of YandexBot
Removing 26 instances of YandexImages
Removing 163 instances of SemrushBot
Removing 17 instances of ysearch\/slurp
Removing 44 instances of msnbot
Removing 385 instances of bingbot
Removing 95 instances of Wotbox
Removing 51 instances of Cliqzbot
Removing 22 instances of Exabot
Removing 116 instances of SeznamBot
Removing 5 instances of magpie-crawler
Removing 57 instances of\/docs
Removing 6 instances of rogerbot

02Apr.log is now 566074 bytes and contains 2590 lines
Log reduced to 25 percent of its original size.

So you can see that around 75% of the traffic on here is crap. And now the log file is much easier to read.


Turning off ipV6 in Ubuntu 16

My home router doesn’t handle IPv6, and for that matter, neither does my ISP, so I get a lot of IPv6 related garbage in my syslog and kern.log. To turn it off, you need to create a new file, rather than editing a system file, and then reload these settings.

sudo nano /etc/sysctl.d/95-disable-ipv6.conf
#add the following lines
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
# reload kernel parameters
sudo service procps reload
# Check that its taken ... the result of this should be 1
cat /proc/sys/net/ipv6/conf/all/disable_ipv6


Blocking irritating PLDT Billing popups

So you forgot to pay your bill. It happens. PLDT used to set an automated phone call to call you once a day until you paid. Now they have something much more irritating, and something that feels borderline illegal. What they do is they hijack your internet connection. Every four or five page loads, they inject an HTML frame with a monster ad, which sits on top of all your work. You have no choice but to click on the button which takes you to a page with the amount you owe (which means they know which IP is your account BTW).

However there is no way to remove the popup, which sits on top of your browser. There is no ‘close window’ button, which means your only choice is to refresh the page, or close the window and re-open it. Too bad if you were working on something and hadn’t pressed the save button: that’s now gone. And even after you pay, the popup sticks around for a day or two … (how hard can it be to automatically cancel it?)

So how do we block it? The frame is an iframe which uses an IP address, so we can’t use DNS blocking or hostfile blocking. So changing the routing table seems to be the way to go.

On Linux, using sudo if necessary:

# check routing table
route -n
# Add rule
route add gw lo
# check routing table again
route -n
# check desired result
ip route get


More Control Over Logwatch Report Dates

I’ve been happily running Logwatch on several servers with the default ‘yesterday’ date range for several years. However I needed to run it for a client with a larger date range to check out a problem. But the options available for logwatch are only ‘today’, ‘yesterday’ and ‘all’. Or so it told me. And even worse, the ‘yesterday’ option takes the date from the previous day, and pulls out all the info on that date. So if you run your logwatch report at 4pm, you’re missing out on 16 hours worth of data! But it turns out logwatch is smarter than that …

Read more

How to change the time anacron runs.

Well this one took me a while to figure out, so I thought I’d blog about it in case I could save someone else some time. Anacron is installed on desktop / laptop orientated distributions as they’re often switched off. It basically makes sure the daily, weekly and monthly cron jobs are run by checking the time they were last run, and running them if they weren’t run in the last day, 7 days or 30 days respectively. It will also check these when the machine is rebooted. So it makes sense on machines that are rebooted frequently.

However I have a server where it is also installed, alongside cron, and I wanted to change the time logwatch ran every day. My first attempt was to simply change the times in /etc/crontab, which contained lines like this
25 00    * * *    root    test -x /usr/sbin/anacron || ( cd / && run-parts –report /etc/cron.daily )
Changing the time was ineffective, although I gather that if I’d removed anacron (or specifically the file /usr/sbin/anacron ) then this approach would have worked.

My research let me next to the /var/spool/anacron directory where there are three timestamp files cron.daily, cron.weekly and cron.monthly. I experimented with changing the time on these to fool anacron into running earlier in the day. This was also ineffective.

Hmm. Finally I found the solution, which kinda makes sense, but is well hidden. In /etc/cron.d/ there is a file which runs anacron containing the line
30 23    * * *   root    test -x /etc/init.d/anacron && /usr/sbin/invoke-rc.d anacron start >/dev/nul

Changing the time on this will make anacron run at your chosen time, and you’ll have your logwatch report before breakfast.