Not many of us are aware of what goes on behind the scenes and how it can affect us. I saw this article recently and thought you might find it interesting. We have all heard about penalties on duplicating content, especially from Google, but how does this work?
I read this article recently and thought you would benefit from it, It was written by Sophie Baxter and explains how the clever guys, can overpower the small guys. It is something to watch out for. It goes like this.
“There is a current and active way, to knock a website out of Google’s search engine results. It’s simple and effective. This information is already in the public domain and the more people that know about it, the more likelihood there is, that Google will do something about it. This article will tell you how it works, how to get a website knocked out of the search engine rankings, but most importantly, how to defend your own website, from having it happen to you.
To understand this exploit, you must first understand about Google’s Duplicate Content filter. It’s simply described thus: Google doesn’t want you to search for “blue widget” and have the top 10 search terms returned, that copies of the same article, on how great blue widgets, are. They want to give you ONE copy, of the Great Blue Widget article, and 9 other different results, just on the off chance, that you’ve already read that article and the other results are actually what you wanted.
To handle this, every time Google spiders and indexes a page, it checks it, to see if it’s already got a page, that is predominantly the same, a duplicate page if you will. Exactly how Google works this out, nobody knows exactly, but it is going to be a combination of some, or all of: page text length, page title, headings, keyword densities, checking exactly, copy sentence fragments etc. As a result of this duplicate content filter, a whole industry has grown up, around trying to get round the filter. Just search for “spin article”.
Getting back to the story here, Google indexes a page and let’s say it fails it’s duplicate content check, what does Google do? These days, it dumps that duplicate page, in Google’s Supplemental Index. What, you didn’t know that Google has 2 indexes? Well they do: the main one, and a supplemental one. Two things are important here: Google will always return results from their Main index, if they can; and they will only go to the Supplemental index, if they don’t get enough joy from their main index. What this means, is that if your page is in the supplemental index, it’s almost certain, that you will never show up in the Search Engine Ranking Pages, unless there is next to no competition, for the phrase that was searched for.
This all seems pretty reasonable to me, so what’s the problem? Well there’s another little step, I haven’t mentioned yet. What happens, if someone copies your page, let’s say your homepage, of your business website, and when Google indexes that copy, it correctly determines, that it’s a duplicate. Now Google knows about 2 pages, that it knows are duplicates, it has to decide which to dump, into the supplemental index, and which to keep in the main one. That’s pretty obvious right? But, how does Google know, which is the original and which is the copy? They don’t. Sure they have some clever algorithms, to work it out, but even if they are 99% accurate, that leaves a lot of problems, for that 1% of times, they can get it wrong!
And this is the heart of the exploit, if someone copies your website’s homepage, say, and manages to convince Google, that *their* page is the original, then your homepage will get tossed into the supplemental index, never to see the light of day, in the Search Engine Ranking Pages again. In case I’m not being clear enough, that’s bad! But wait, it gets worse:
It’s fair to say, that in the case of a person physically copying your page and hosting it, you can often get them to take it down, through the use of copyright lawyers, and cease and desist letters, to ISP’s and the like, with a quick “Re-inclusion Request,” to Google. But recently, there’s a new threat, that’s a whole lot harder to stop: the use of publicly accessible Proxy websites. (If you don’t know what a Proxy is, it’s basically a way of making the web run faster, by caching content more local to your internet destination. In principle, they are generally a good thing.)
There are many such web proxies out there, and I won’t list any here, however, I will describe the process: they send out spiders (much like Google’s) and they spider your page, take your content, then they host a copy of your website, on their proxy site, nominally, so that when their users request your page, they can serve up their local copy quickly, rather than having to retrieve if off your server. The big issue, is that Google can sometimes decide, that the proxy copy, of your web page, is the original, and yours is not.
Worse again, there’s some evidence that people are deliberately and maliciously, using proxy servers, to cache copies of web pages, then using normal (white and black hat) Search Engine Optimization (SEO) techniques, to make those proxy pages rank in the search engine, increasing the likelihood, that your legitimate page, will be the one dumped, by the search engines’ duplicate content filters. Danger Will Robinson!
Even worse still, some of the proxy spiders, actively spoof their origins, so that you don’t realise, that it’s a spider from a proxy, as they pretend to be a Googlebot, for example, or from Yahoo. This is why the major search engines actively publish guidelines, on how to identify and validate their own spiders.
Now for the big question, how can you defend against this? There are several possible solutions, depending on your web hosting technology and technical competence:
Option 1
- If you are running Apache and PHP, on your server, you can set the webhost up, to check for search engine spiders, that purport to be from the main search engines, and using php and the .htaccess file, you can block proxies, from other sources. However, this only works for proxies, that are playing by the rules and identifying themselves correctly.
Option 2
- If you are using MS Windows and IIS, on your server, or if you are on a shared hosting solution, that doesn’t give you the ability to do anything clever, it’s an awful lot harder and you should take the advice of a professional, on how to defend yourself, from this kind of attack.
Option 3
- This is currently the best solution available, and applies if you are running a PHP, or ASP based website: you set ALL pages robot meta tags, to noindex and nofollow, then you implement a PHP, or ASP script, on each page, that checks for valid spiders, from the major search engines, and if so, resets the robot meta tags, to index and follow. The important distinction here, is that it’s easier to validate a real spider, and to discount a spider that’s trying to spoof you, because the major search engines publish processes and procedures to do this, including IP lookups and the like.
So, stay aware, stay knowledgeable, and stay protected. And if you see, that you’ve suddenly been dumped from the Search Engine Rankings Pages, now you might know why, and how and what to do about it.
Sophie White is an Internet Marketing and Website Promotion Consultant at Intrinsic Marketing an SEO and Pay-Per-Click firm, dedicated to supplying Better Website ROI.”
So, be warned. These clever types have us at their mercy. I hope you found it interesting. But, things are constantly changing on the internet, so do you really want to waste time worrying about things like optimization and keywords etc? Why not get ready prepared sites and spend that time on marketing? Just take a look at these Exclusive Sites.
If you write articles, or use them as a marketing tool, have you thought of automating the task? Can you be sure you won’t be penalized for duplicate content? Make sure and get the very best software available. Get Content Infinity here: http://www.dersalsites.com/contentinf/
Derek Robson is a South African Internet marketer. He has a vision of empowering all fellow South Africans and other non U.S folk, to have equal opportunity and success on the internet, by finding solutions to the many obstacles facing them. He is a syndicated article writer. He and his wife Sally, have started a string of sites, resources, courses and articles, as part of Dersalsites. For daily postings and articles, on Internet marketing, South African online business, list building, affiliate marketing, article marketing, blogging, seo, the law of attraction, rugby and other general topics, visit Derek at: http://dersalsites.com/southafricanbusiness/











