{"id":389,"date":"2017-04-04T16:09:34","date_gmt":"2017-04-04T08:09:34","guid":{"rendered":"https:\/\/play.datalude.com\/blog\/?p=389"},"modified":"2024-01-16T15:09:12","modified_gmt":"2024-01-16T07:09:12","slug":"bash-script-to-clean-bots-out-of-apache-logs","status":"publish","type":"post","link":"https:\/\/play.datalude.com\/blog\/2017\/04\/bash-script-to-clean-bots-out-of-apache-logs\/","title":{"rendered":"Bash script to clean Bots out of Apache Logs"},"content":{"rendered":"<p>If you've ever spent some time looking at webserver logs, you know how much crap there is in there from crawlers, bots, indexers, and all the bottom feeders of the internet. If you're looking for a specific problem with the webserver, this stuff can quickly become a nuisance, stopping you from finding the information you need. In addition, its often a surprise exactly <strong><em>how much<\/em><\/strong> of the traffic your website serves up to these bots.<\/p>\n<p>The script below helps with both these problems. It takes stats of a logfile (apache, but should also work on nginx), makes a backup, counts the number of lines it removes and each kind of bot, and then repeats the new stats at the end. Copy the following into a file eg. log-squish.sh and run with the name of the logfile as an argument. eg cleanlog.sh logfile.log<br \/>\nYou'll definitely want to edit the LOCALTRAFFIC bit to fit your needs. You may also want to add bots to the BOTLIST. Run the script once on a sample logfile and then view it to see what bots are left &#8230;<\/p>\n<pre>#!\/bin\/bash\r\n\r\n# Pass input file as a commandline argument, or set it here\r\nINFILE=$1\r\nOUTFILE=.\/$1.squish\r\nTMPFILE=.\/squish.tmp\r\n\r\nif [ -f $TMPFILE ] ; then \r\n    rm $TMPFILE\r\nfi\r\n\r\n# Check before we go ... \r\nread -p \"Will copy $INFILE to $OUTFILE and perform all operations on the file copy. Press ENTER to proceed ...\"\r\n\r\ncp $INFILE $OUTFILE\r\n\r\n# List of installation-specific patterns to delete from logfiles (this example for WP. Also excluding local IPaddress)\r\n# Edit to suit your environment.\r\nLOCALTRAFFIC=\" wp-cron.php 10.10.0.2 wp-login.php \\\/wp-admin\\\/ \"\r\necho\r\necho \"-------- Removing local traffic ---------\"\r\nfor TERM in $LOCALTRAFFIC; do\r\n    TERMCOUNT=$( grep \"$TERM\" $OUTFILE | wc -l )\r\n    echo $TERMCOUNT instances of $TERM removed &gt;&gt; $TMPFILE\r\n    sed -i  \"\/$TERM\/d\" $OUTFILE\r\ndone\r\nsort -nr $TMPFILE\r\nrm $TMPFILE\r\n\r\n# List of patterns to delete from logfiles, space separated\r\nBOTLIST=\"ahrefs Baiduspider bingbot Cliqzbot cs.daum.net DomainCrawler DuckDuckGo Exabot Googlebot linkdexbot magpie-crawler MJ12bot msnbot OpenLinkProfiler.org opensiteexplorer pingdom rogerbot SemrushBot SeznamBot sogou.com\\\/docs tt-rss Wotbox YandexBot YandexImages ysearch\\\/slurp BLEXBot Flamingo_SearchEngine okhttp scalaj-http UptimeRobot YisouSpider proximic.com\\\/info\\\/spider \"\r\necho\r\necho \"------- Removing Bots ---------\"\r\nfor TERM in $BOTLIST; do\r\n    TERMCOUNT=$( grep \"$TERM\" $OUTFILE | wc -l )\r\n    echo $TERMCOUNT instances of $TERM removed &gt;&gt; $TMPFILE\r\n    sed -i  \"\/$TERM\/d\" $OUTFILE\r\ndone\r\nsort -nr $TMPFILE\r\nrm $TMPFILE\r\n\r\necho\r\necho \"======Summary=======\"\r\n\r\n#filestats before\r\nPRELINES=$(cat $INFILE | wc -l )\r\nPRESIZE=$( stat -c %s $INFILE )\r\n\r\n#filestats after\r\nPOSTLINES=$(cat $OUTFILE | wc -l )\r\nPOSTSIZE=$( stat -c %s $OUTFILE )\r\nPERCENT=$(awk \"BEGIN { pc=100*${POSTLINES}\/${PRELINES}; i=int(pc); print (pc-i&lt;0.5)?i:i+1 }\")\r\n\r\necho Original file $INFILE is $PRESIZE bytes and contains $PRELINES lines\r\necho Processed file $OUTFILE is $POSTSIZE bytes and contains $POSTLINES lines\r\necho Log reduced to $PERCENT percent of its original size.\r\necho Original file was untouched.<\/pre>\n<p>And here is a sample output.<\/p>\n<pre>~\/temp $ .\/log-squish.sh access.log.2017-09-03\r\nWill copy access.log.2017-09-03 to .\/access.log.2017-09-03.squish and perform all operations on the file copy. Press ENTER to proceed\r\n\r\n-------- Removing local traffic ---------\r\n5536 instances of wp-login.php removed\r\n507 instances of \\\/wp-admin\\\/ removed\r\n84 instances of wp-cron.php removed\r\n0 instances of 10.10.0.2 removed\r\n\r\n------- Removing Bots ---------\r\n2769 instances of bingbot removed\r\n2342 instances of Googlebot removed\r\n2177 instances of sogou.com\\\/docs removed\r\n1815 instances of MJ12bot removed\r\n1651 instances of ahrefs removed\r\n1016 instances of opensiteexplorer removed\r\n578 instances of Baiduspider removed\r\n447 instances of Flamingo_SearchEngine removed\r\n357 instances of okhttp removed\r\n295 instances of UptimeRobot removed\r\n122 instances of scalaj-http removed\r\n74 instances of YandexBot removed\r\n60 instances of ysearch\\\/slurp removed\r\n24 instances of YisouSpider removed\r\n22 instances of magpie-crawler removed\r\n9 instances of linkdexbot removed\r\n7 instances of YandexImages removed\r\n7 instances of SeznamBot removed\r\n5 instances of rogerbot removed\r\n2 instances of tt-rss removed\r\n1 instances of SemrushBot removed\r\n0 instances of Wotbox removed\r\n0 instances of proximic.com\\\/info\\\/spider removed\r\n0 instances of pingdom removed\r\n0 instances of OpenLinkProfiler.org removed\r\n0 instances of msnbot removed\r\n0 instances of Exabot removed\r\n0 instances of DuckDuckGo removed\r\n0 instances of DomainCrawler removed\r\n0 instances of cs.daum.net removed\r\n0 instances of Cliqzbot removed\r\n0 instances of BLEXBot removed\r\n\r\n======Summary=======\r\nOriginal file access.log.2017-09-03 is 19395785 bytes and contains 74872 lines\r\nProcessed file .\/access.log.2017-09-03.squish is 15432796 bytes and contains 54965 lines\r\nLog reduced to 73 percent of its original size.\r\nOriginal file was untouched.\r\n<\/pre>\n<p>So you can see that around 20% of the traffic on here is crap. And now the log file is much easier to read.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you've ever spent some time looking at webserver logs, you know how much crap there is in there from crawlers, bots, indexers, and all the bottom feeders of the internet. If you're looking for a specific problem with the webserver, this stuff can quickly become a nuisance, stopping you from finding the information you &#8230; <a title=\"Bash script to clean Bots out of Apache Logs\" class=\"read-more\" href=\"https:\/\/play.datalude.com\/blog\/2017\/04\/bash-script-to-clean-bots-out-of-apache-logs\/\" aria-label=\"Read more about Bash script to clean Bots out of Apache Logs\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","footnotes":""},"categories":[1,4,5],"tags":[],"class_list":["post-389","post","type-post","status-publish","format-standard","hentry","category-it","category-linux","category-security"],"_links":{"self":[{"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/posts\/389","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/comments?post=389"}],"version-history":[{"count":0,"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/posts\/389\/revisions"}],"wp:attachment":[{"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/media?parent=389"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/categories?post=389"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/play.datalude.com\/blog\/wp-json\/wp\/v2\/tags?post=389"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}