Are Bitacle blog thieves too?
Tuesday, 14 March 2006
I have previously talked about websites which offer stolen content in the guise of some sort of “service” and now we’re been forced to ban Bitacle from accessing our server.
Over the last few weeks, their crawler has persistently been hammering away at our website, only to be consistently bounced by our anti-spam software for having an invalid HTTP request header. While we could modify our anti-spam rules to permit their crawler to access our website regardless, a quick investigation revealed a couple of reasons why that would have been unwise:
- They don’t obey the robots.txt standard. That’s usually reason enough to get them banned.
- They publish entire articles on their website, instead of article summaries, like proper blog search engines such as Technorati, Feedster and Blogdigger do.
While the Bitacle website is still in “beta” mode and doesn’t seem to be polluted with any form of advertising, don’t let their “add site” link on the front page fool you into thinking they operate a legitimate business. They don’t publish their email address, nor do they offer a contact form on their website. I have sent four emails to “postmaster@bitacle.org” in the last ten days. My emails didn’t bounce, which means they were delivered.
But I’m still waiting for a reply. I guess they must be very busy fixing their crawler.
|
Alright.. I faced the same problem and am sending an email to these guys. Lets see if they happen to reply back.
btw. I suggest that we report this behavior to the Search Engines.
Spamlog reporting sites like Splog Reporter and SplogSpot should be interested in this too.
Hi, my name is David Martin and i'm working in Bitacle.
I have just read the article and your comments about our service.
1 - There aren't a norm that forces us to obey the robots.txt. We index and archive the whole content of any XML. We never steal because always we link the original source and the original feed that you provide. When they have seen a spammer that links the original source? I haven't seen anybody.
2 - "They publish entire articles on their website, instead of article summaries, like proper blog search engines such as Technorati, Feedster and Blogdigger do."
The reason it's that we don't be only a blog search engine we are a "archive blog search engine" that it's different concept.
One question: why you don't ban Goolge, Yahoo or MSN? That search engines cache all your pages.
In the help page (It's possible that before it was more hidden) you can find a contact email (bitacle (@) gmail (dot) com) but also you can write me to seo.bitacle (@) gmail (dot) com and I reply you without any problem.
And yes, we are very busy fixing our crawler, we work hard on it :)
Regards and thanks for your comments.
David Martín.
Thanks for the reply, David.
Yes, you're right, there's no "norm" or rule which stipulates that crawlers must obey robots.txt, it's just that if you don't then we're not interested in providing you with any content. And I don't think we're the only ones either. The fact is you don't have a good reason for not reading robots.txt as we expect everyone to obey the robots.txt standard.
As far as you claiming that "we never steal because always we link the original source and the original feed that you provide": while your organisation does link to the original source, the way you provide the content is dubious. If we take one sample from your website, the article HOWTO made a handbag out of ties shows me that:
You publish the full article and do not clearly indicate that the original content came from Boing Boing.
The only indication anywhere on that page is in the form of the article header pointing to the original. I think that's bordering on the unethical, irrespective of how different "the concept" is behind you being an "archive blog search engine".
Why don't I ban Google, Yahoo or MSN? Because they obey robots.txt and that's what I use to control which crawlers have the right to see which pages. You see, on my website, I'm in charge and robots.txt enables me to keep things that way. I simply do not trust any crawler which doesn't bother obeying the rules I've put in there.
In any case, you were originally being blocked by our anti-spam software for submitting invalid HTTP headers in your request. We specifically didn't like your "Range: bytes=0-511999" data because it matches a pattern synonymous with a lot of the automated crapware which tries to pollute our content.
P.S. The contact email addresses you provided are an indication of a "dodgy" or "backyard" business. Why use gmail.com instead of bitacle.org? Doesn't that strike you as something which lowers the authenticity factor of your organisation?
Sent emails to David to have my site removed and he never responded. Yes, thieves.
Thanks, Ivan, your last commet said it better than I ever could have. Bitacle is very sneaky and needs to be taken down. They steal full content and then post ads around it. Sorry, if I wanted someone else to be making ad money off my content, I wouldn't be posting MY content on MY site. I am especially upset with having my content not only republished, but set under a Creative Commons license when my site never says it's CC, it is All Rights Reserved. Bitacle also violates Flickr's tos by taking my Flickr photos and posting them without linking back ot the Flickr page (as was originally done in my blog post, so I don't understand this).
I have emailed them to remove my content (afterall, I don't want my content on my pregnancy and my kid and my life, etc on someone else's site where I cannot edit it) but I doubt I will receive an answer. Any new content they attempt to grab from my site will give them only messages as content that they steal from other sites. I have also contacted Google about the violation of their use of Adsense. I have a few more steps in mind as well...
Yesterday I wrote an email to bitacle@gmail.com and seo.bitacle@gmail.com about their violation of my copyrights and Creative Commons license. I told them to end their practices. The answer I received today from bitacle@gmail.com was as follows:
...............
Bitacle es un lector de rss, atom via web.
Como puede ser para correo electrónico via web hotmail, gmail, etc.
Existen el mercado multitud de lectores ya sea por web o a través del sistema operativo que usted utilice.
La licencia solamente se aplica a nuestro sitio para nuestra programación, pero el contenido de la información y la licencia que usted aplique es suya.
Usted debe de controlar los contenidos que suministra a un lector de blogs y el tráfico que recibe de ellos, nadie le obliga a sindicar contenidos.
Atentamente bitacle.
...............
Well, I don't understand the Spanish language that well, but this message appears to contain the same nonsense I heard before.
I forgot to mention: http://stopbitacleorg.wordpress.com/
"Bitacle is a reader of rss, atom via Web. As it can be for electronic mail via hotmail Web, gmail, etc. They exist the market multitude of readers or by Web or through operating system that you use. The license is only applied to our site for our programming, but the content of the information and the license that you apply you are hers. You must control the contents that she provides to a reader of blogs and the traffic that receives from them, nobody forces to him to syndicate contents. Kindly bitacle."
That's what they say (translated with google).. I understand it, but they should not scrape whole blogs :( just the title or the first 50 words..
I allowed bitacle.org comments, thinking they were a social/collaborative bookmarking service. Now I see they're just scraping and cleaning, and definitely violating the cc license agreement.
Do we think their IP address should be banned in .htacess? I'm seeing an IP address of 212.22.59.251.
Bill,
I'm not sure if that's the current IP address but in the past they've come from 81.172.108.9, 81.172.108.83 and 81.172.109.145 all with a user agent of "Bitacle bot/1", so that's we now block on using this .htaccess rule:
RewriteCond %{HTTP_USER_AGENT} ^Bitacle\ bot [NC] RewriteRule ^.* - [F,L]That's been working nicely for us since March, 2006.
Their picture-mirror-thingy comes from 213.201.119.162 - 213.201.119.164 (at least). I've built a small .htaccess/php script combination to stop bitacle from stealing pictures.
The page itself is in german, but you'll get the relevant parts...