Tuesday, August 26, 2008

Google’s Case Sensitive Issues

When google crawls webpages over the internet to find fresh and unique content, it also crawls pages with duplicate content. Below are some of the factors which are generally not discussed as a result duplicate URL’s and content is crawled by google.

Google crawling is case insensitive

Starting with URI specification

Scheme and hostname are case insensitive i.e. the below url’s are treated same.

http://www.xyx.com/ = HTTP://www.Xyz.com/

But in case of Directories and filenames it is case sensitive

The below examples are treated as 3 different URLs

* http://www.xyz.com/Page1.html
* http://www. xyz.com/PAGE1.HTML
* http://www.xyz.com/page1.html

Google and Case Issues

Crawling

Google considers case variations in directory and filename and will consider the below URL’s as different and may crawl all the 3

* http://www.xyz.com/Page1.html
* http://www.xyz.com/PAGE1.HTML
* http://www.xyz.com/page1.html

Indexing

When case-varied URLs are accessible and webserver does not redirect to the preferred URL
Duplicate content is crawled between different URL cases.
It consolidate properties (such as link information) between duplicate URL’s and stores them.
It will display, high-ranking URL selected from case-sensitive URL comparisons.

URL Case Recommendations

Web server default behavior is as follows

* IIS is case insensitive it will treat Page1.html = page1.html, the two pages are treated as same
* Apache is case-sensitive it will treat Page1.html != page1.html, the two pages are treated as different

The most important issue which is not much discussed is robots.txt is case sensitive for paths

The below example will explain the same

* Disallow: /abc = disallow: /abc
* Disallow: /ABC != Disallow: /abc, the two paths are treated as different

Recommendation

1. Follow consistent design format for URL’s either choose ePuppy.html or epuppy.html

2. It is recommended and is often more error-proof to create all lowercase URLs such as epuppy.html

3. Verify case sensitive paths with Webmaster Tools’ robots.txt analysis tool

If the above mentioned points are considered while creating a website many duplicate issues can be solved.

2 comments:

Sandhya said...

Thanks Raman for sharing all about this about Google. I was not aware of this fact but from now onwards i will take care of this. Thanks for sharing.

Sarah Sienna said...

Thanks Raman for sharing all about this about Google. I was not aware of this fact but from now onwards i will take care of this. Thanks for sharing.