You know those annoying people who keep telling the same tired stories every opportunity they get? – It gets old pretty fast doesn’t it..?

Search engines feel much the same way about websites that contain duplicate content.

There are many different reasons why a website might contain multiple copies of the same content:

On many websites it is possible to reach the same page with different URLs. For example, all four of these addresses may display the same html file:

  • http://www.example.com/
  • http://example.com/
  • http://www.example.com/index.html
  • http://example.com/index.html

Many content management systems (CMS) and blog software packages publish pages in such a way that the same content can be reached in multiple ways. For example a new blog post may be available on:

  • Main blog page: http://www.example.com/blog/
  • One or more category pages: http://www.example.com/blog/category/tips/ and http://www.example.com/blog/category/articles/
  • In a date based archive: http://www.example.com/blog/2009/09/
  • As a permalink: http://www.example.com/blog/car-maintenance-101/

In an e-commerce website, the same product may be available in multiple different categories e.g. http://www.example.com/books/children/twilight/ and http://www.example.com/books/fiction/twilight/, or in some cases there may be very similar products with almost identical descriptions – such as clothes that come in different sizes or colours.

Some websites put session IDs in their URLs to help them track individual visitors as they move through the website. An unfortunate side effect of this is that every time a search engine spider visits the site, it will see a new unique URL for each page.

If your website is hosted on a UNIX/Linux environment, then its URLs are case sensitive, but if it is hosted on Windows, then http://www.example.com/TEST.html and http://www.example.com/test.html will both load the same page. However, the search engines would consider these to be two separate URLs that contain duplicate content.

Or on a less legitimate note, the content could have been stolen (or scraped) from another website.

So why is this a problem for you as a website owner?

Well there are two main issues as far as SEO is concerned: Links pointing to your website may be spread across different possible URLs, thus diluting the link popularity of each web page, which moves these pages further away from top of the SERPs (Search Engine Results Pages).

Search engines don’t really want to have to store multiple copies of the same content, so they might decide to only store one version (and it may not be the version you would prefer), or they may decide not to bother storing it at all.

So what can you as a website owner do to prevent or at least reduce this problem?

Be consistent with how you link within your website, don’t mix and match different versions of URLs that load the same page.

Google Webmaster Tools gives you the option to tell Google if you prefer the http://www.example.com or the http://example.com version of your domain name.

Some forms of duplicate content can be completely eliminated by setting up 301 redirects so that whenever someone tries to access one form of a URL (such as http://example.com) they are automatically redirected to your preferred version (such as http://www.example.com).

If you can’t eliminate the duplicate content, then you need to have some way of either telling the search engines which is your preferred version, or stopping them from trying to index your less preferred versions. There are several different approaches to this:

Use the canonical tag

Inside the <head> section of your less preferred pages you can add a line such as:
<link rel="canonical" href="http://www.example.com/blog/car-maintenance-101/"  />
which will act as a strong hint to the search engines that this is the address you would prefer them to index instead of the current page. For more information on the canonical tag see http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

robots.txt

Alternatively you can tell the search engines not to index certain pages (or directories) by adding a Disallow rule to the robots.txt file in the root directory of your website:

User-agent: *
Disallow: /blog/category

This rule stop all search engines from indexing any pages in the /blog/category directory of the website.

robots meta tag

Sometimes it is not really possible to create a suitable rule in the robots.txt file. An alternative approach is to set the robots meta tag on specific pages that you don’t want to have indexed.

<meta name="robots" content="noindex, nofollow" />

This would tell the search engines not to bother indexing this page and not to follow any of the links from this page.

Final Thoughts

It is possible to get into duplicate content problems even on pages that appear to be totally different if you make the mistake of using standard title and description meta tags across your whole website

Be careful if you are considering using robots.txt or the robots meta tag on legacy pages that may already have inbound links. You don’t want to eliminate any existing link juice. In this case, it would probably be better to use the canonical tag or do a 301 redirect so that the link juice is attributed to the desired page.