These past couple of weeks I've been bombarded with a very pesky spider which appeared seemed to make requests for every single entry in my blog. The weird aspect of this inquiry was the bot seemed to disregard 'robots.txt', that is it never bothered to read the file before running roughshod through my site. Because the IP addresses were from the same octet block, I realized that I had a chance to defend myself. The user-agent appeared to originate from some Linux/Mozilla combination. Before I decided to ban the IP from my site, I figured that some investigative work would be necessary.
The 208.111.xxx block seemed to be connected to Limelight Networks LLC. Hmm, I thought it odd that Limelight would be interested in crawling my site. Doug Kaye and all of the IT Conversations folks didn't need any of my content, as they have one of the largest CDNs in thenation. So, after performing the ubiquitous GOOG search I discovered that others in the blogosphere were having similar issues with these bots.
Grepping through my access and error logs proves that I have successfully blocked the IP.
[Fri Feb 8 06:20:12 2008] [error] [client 220.127.116.11] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000632.html[Fri Feb 8 04:01:42 2008] [error] [client 18.104.22.168] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000522.html
[Fri Feb 8 04:01:43 2008] [error] [client 22.214.171.124] client denied by server configuration: /home/bkaeg/public_html/favicon.ico
[Fri Feb 8 04:11:20 2008] [error] [client 126.96.36.199] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000053.html
Because I am of the inquisitive nature, I decided to take it a step further. A simple traceroute and stealth nmap revealed that it was not LimeLight Networks at all. It was actually Kavam, a company out of Tempe, AZ. When I say these bots were aggressive, I am not exaggerating. It was not uncommon to see 3-4 hits from the same bot within an hour.
I discovered the CEO of Kavam was Randy Adams. So, I called Mr. Adams and chatted with him and he confirmed my suspicions and also the story that I had read about from other bloggers.
Apparently, Kavam is working with same venture capitalists that funded LimeLight, LLC. Kavam, is taking snapshots of every website on the Internet. God Bless em. They wish to challenge the GOOG. The brute force method of getting content is admirable but quite annoying. I suppose it would not have been so bad if they would have simply told people of their intent. Perhaps if it had gone of for just a couple of days... This barrage went on for two weeks.
I have been blogging for roughly five years, so there is quite a bit of content to be archived. I wonder why Fooky never took this gestapo approach to collecting metadata? Anyway, I did ask Mr. Adams to be a guest on AG Speaks, hopefully he will chat with me before the company launches its killer application.