I’ve had a lot (in to the thousands now) of hits over the last month or so from this particular browser summary: “TalkTalk Virus Alerts Scanning Engine”. I blocked it pretty much as soon as I first saw it browsing my sites. For the first two weeks I couldn’t find anything on the net about it until laeknishendr posted an interesting article http://laeknishendr.wordpress.com/2011/05/15/talktalk-virus-alerts-scanning-engine/. This particular browser summary appears to occur every time someone from the TalkTalk group of companies browses my site. An interesting thing about this browser summary is that it doesn’t appear to be limited to a particular IP range, as seen in the image below:
It’s one thing that TalkTalk are trying to make sure their customers are not browsing sites which contain viruses, but according to the linked post, TalkTalk send their bot to the site after the customer has clicked and seen the page. That means there’s a separate page request which surely means TalkTalk cannot prevent the user seeing the website prior to it rendering on their browser the first time.. So what is the bot doing?!
Why does the bot operate across such a large amount of IP addresses?
In my opinion this is a massive invasion of privacy! It makes me glad I switched from TalkTalk two months ago.
Anyway, the browser summary is in vsf-simple-block which is doing a fine job of protecting my users’ privacy 🙂
If like me you check your site logs most every days, you’ve probably noticed one or two site scrapes. Fortunately most are from the same user (or bot) and use the same, ip, host and browser summary. They grab 20 or so pages in a minute and then stop and come back another day to finish the site off. On a bad day, they might rip the entire site in one pass… With my plugin, targeting those individuals is easier than it used to be; add the browser, host or ip to the block rules in vsf-simple-block and no more annoying user, E.g. host = ‘amazonaws.com’
But the other week, I had another visit by an above average site scraping “team.” I’ve only had about 3 of these types of site scrapes before, so I don’t automatically twig and tend to give the benefit of the doubt.
I happened to be looking at my log, when I saw the first 4 hits shown in the image below. With so little hits, I assumed it was just coincidence that they all had the same browser summary and kept an eye on it. About ten minutes later though, the hits were still coming in. A different url, random ip and host but the same browser summary. Blocking on browser summary isn’t my favourite choice, as it can potentially exclude a lot of users who are completely innocent, but I added the rule to the block plugin and prevented another hour’s worth of hits. When blocked, the number of hits increased three-fold.
I’m still in the thinking stage on how to code a solution to prevent this type of site scrape. As you can see from the image, between the first hit and the last before being blocked, there is a gap of nineteen minutes. My site doesn’t have a high volume of hits, so all the hits are clumped together. Clumped hits are relatively easy to detect patterns in, but statistically there would be a hit by a genuine user in the middle which would make the logic harder to detect.