Scrapers suck.

About half of my posts on The Next Wave site seem to get trackback comments by a site “University Update” – the only thing is, there are no humans involved with “University Update”- it’s just a bot stealing my content.

I mark each trackback as SPAM- but, I can’t get my content off their site. Why do they do this? To aggregate content for search engine optimization- and then to hope to score some cash from Google Ad Sense click throughs.

Once again- if Google wasn’t so powerful, and if so many lame sites didn’t pay out huge cash to get hits through buying ad words, we wouldn’t have this problem.

The bots- scrapers, are talked about at length in the following CNET article which quotes Lorelle VanFossen who writes extensively about WordPress.

Please don’t steal this Web content | CNET
automated digital plagiarism in which software bots can copy thousands of blog posts per hour and publish them verbatim onto Web sites on which contextual ads next to them can generate money for the site owner.

Such Web sites are known among Web publishers as “scraper sites” because they effectively scrape the content off blogs, usually through RSS (Really Simple Syndication) and other feeds on which those blogs are sent.

One of the questions that comes up often in the seminar is about what constitutes “Fair Use” and how much to use via the PressIt function of WordPress. My answer isn’t great, but I believe it works: always cite the source, don’t put it on your site unless you contribute something to the meaning, or understanding- making it more useful than it was in it’s original version.

This is where scrapers fail- they just copy and steal. Google should easily be able to see the original publisher- and be able to identify sites that are entirely made up of stolen content and vote them off the island- the problem is, that’s how Google makes a lot of money- and according to their ethos- getting filthy rich isn’t considered part of their “do no evil” mantra.


Leave a Reply