These past couple of weeks I've been bombarded with a very pesky spider which appeared seemed to make requests for every single entry in my blog. The weird aspect of this inquiry was the bot seemed to disregard 'robots.txt', that is it never bothered to read the file before running roughshod through my site. Because the IP addresses were from the same octet block, I realized that I had a chance to defend myself. The user-agent appeared to originate from some Linux/Mozilla combination. Before I decided to ban the IP from my site, I figured that some investigative work would be necessary.
The 208.111.xxx block seemed to be connected to Limelight Networks LLC. Hmm, I thought it odd that Limelight would be interested in crawling my site. Doug Kaye and all of the IT Conversations folks didn't need any of my content, as they have one of the largest CDNs in thenation. So, after performing the ubiquitous GOOG search I discovered that others in the blogosphere were having similar issues with these bots.
Grepping through my access and error logs proves that I have successfully blocked the IP.
[Fri Feb 8 06:20:12 2008] [error] [client 208.111.154.15] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000632.html[Fri Feb 8 04:01:42 2008] [error] [client 208.111.154.198] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000522.html
[Fri Feb 8 04:01:43 2008] [error] [client 208.111.154.198] client denied by server configuration: /home/bkaeg/public_html/favicon.ico
[Fri Feb 8 04:11:20 2008] [error] [client 208.111.154.198] client denied by server configuration: /home/bkaeg/public_html/blog/archives/000053.html
Because I am of the inquisitive nature, I decided to take it a step further. A simple traceroute and stealth nmap revealed that it was not LimeLight Networks at all. It was actually Kavam, a company out of Tempe, AZ. When I say these bots were aggressive, I am not exaggerating. It was not uncommon to see 3-4 hits from the same bot within an hour.
I discovered the CEO of Kavam was Randy Adams. So, I called Mr. Adams and chatted with him and he confirmed my suspicions and also the story that I had read about from other bloggers.
Apparently, Kavam is working with same venture capitalists that funded LimeLight, LLC. Kavam, is taking snapshots of every website on the Internet. God Bless em. They wish to challenge the GOOG. The brute force method of getting content is admirable but quite annoying. I suppose it would not have been so bad if they would have simply told people of their intent. Perhaps if it had gone of for just a couple of days... This barrage went on for two weeks.
I have been blogging for roughly five years, so there is quite a bit of content to be archived. I wonder why Fooky never took this gestapo approach to collecting metadata? Anyway, I did ask Mr. Adams to be a guest on AG Speaks, hopefully he will chat with me before the company launches its killer application.

There is no need for a gestapo approach to collecting metadata. We at Fooky.com have no need to bombard servers over and over again.
What I will say it that it could be a marketing ploy. Think about it - if I had Fooky.com ScorpionBots bombard millions of web sites each day, then thousands of techies who run web sites will investigate and discover who we are. I've decided against that kind of marketing as I likened it to spam-style tactics.
Please note that Fooky, Inc is only 'web crawling' for a limited engagement as we will move away from this practice altogether. I can discuss this more in detail later if you like..
Sure, I'd be happy to chat with you in more detail.
Perfect topic for another edition of AG Speaks. Probably a conversation that we should have had awhile ago.