Google Tools Make Money Online: crawling and indexing

Video: Expanding your site to more languages

Author Admin

Webmaster Level: Intermediate to Advanced

We filmed a video providing more details about expanding your site to more languages or country-based language variations. The video covers details about rel=”alternate” hreflang and potential implementation on your multilingual and/or multinational site.

Video and slides on expanding your site to more languages

You can watch the entire video or skip to the relevant sections:

Additional resources on hreflang include:

Webmaster Help Center article on rel=”alternate” hreflang and hreflang=”x-default”
More blog posts

Webmaster discussion forum FAQ on internationalization
Webmaster discussion forum for internationalization (review answers or post your own question!)

Good luck as you expand your site to more languages!

Written by Maile Ohye, Developer Programs Tech Lead

5 common mistakes with rel=canonical

Author Admin

Webmaster Level: Intermediate to Advanced

Including a rel=canonical link in your webpage is a strong hint to search engines your about preferred version to index among duplicate pages on the web. It’s supported by several search engines, including Yahoo!, Bing, and Google. The rel=canonical link consolidates indexing properties from the duplicates, like their inbound links, as well as specifies which URL you’d like displayed in search results. However, rel=canonical can be a bit tricky because it’s not very obvious when there’s a misconfiguration.

While the webmaster sees the “red velvet” page on the left in their browser, search engines notice on the webmaster’s unintended “blue velvet” rel=canonical on the right.

We recommend the following best practices for using rel=canonical:

A large portion of the duplicate page’s content should be present on the canonical version.

One test is to imagine you don’t understand the language of the content—if you placed the duplicate side-by-side with the canonical, does a very large percentage of the words of the duplicate page appear on the canonical page? If you need to speak the language to understand that the pages are similar; for example, if they’re only topically similar but not extremely close in exact words, the canonical designation might be disregarded by search engines.

Double-check that your rel=canonical target exists (it’s not an error or “soft 404”)
Verify the rel=canonical target doesn’t contain a noindex robots meta tag
Make sure you’d prefer the rel=canonical URL to be displayed in search results (rather than the duplicate URL)
Include the rel=canonical link in either the <head> of the page or the HTTP header
Specify no more than one rel=canonical for a page. When more than one is specified, all rel=canonicals will be ignored.

Mistake 1: rel=canonical to the first page of a paginated series

Imagine that you have an article that spans several pages:

example.com/article?story=cupcake-news&page=1
example.com/article?story=cupcake-news&page=2
and so on

Specifying a rel=canonical from page 2 (or any later page) to page 1 is not correct use of rel=canonical, as these are not duplicate pages. Using rel=canonical in this instance would result in the content on pages 2 and beyond not being indexed at all.

Good content (e.g., “cookies are superior nutrition” and “to vegetables”) is lost when specifying rel=canonical from component pages to the first page of a series.

In cases of paginated content, we recommend either a rel=canonical from component pages to a single-page version of the article, or to use rel=”prev” and rel=”next” pagination markup.

rel=canonical from component pages to the view-all page

If rel=canonical to a view-all page isn’t designated, paginated content can use rel=”prev” and rel=”next” markup.

Mistake 2: Absolute URLs mistakenly written as relative URLs

The <link> tag, like many HTML tags, accepts both relative and absolute URLs. Relative URLs include a path “relative” to the current page. For example, “images/cupcake.png” means “from the current directory go to the “images” subdirectory, then to cupcake.png.” Absolute URLs specify the full path—including the scheme like http://.

Specifying <link rel=canonical href=“example.com/cupcake.html” /> (a relative URL since there’s no “http://”) implies that the desired canonical URL is http://example.com/example.com/cupcake.html even though that is almost certainly not what was intended. In these cases, our algorithms may ignore the specified rel=canonical. Ultimately this means that whatever you had hoped to accomplish with this rel=canonical will not come to fruition.

Mistake 3: Unintended or multiple declarations of rel=canonical

Occasionally, we see rel=canonical designations that we believe are unintentional. In very rare circumstances we see simple typos, but more commonly a busy webmaster copies a page template without thinking to change the target of the rel=canonical. Now the site owner’s pages specify a rel=canonical to the template author’s site.

If you use a template, check that you didn’t also copy the rel=canonical specification.

Another issue is when pages include multiple rel=canonical links to different URLs. This happens frequently in conjunction with SEO plugins that often insert a default rel=canonical link, possibly unbeknownst to the webmaster who installed the plugin. In cases of multiple declarations of rel=canonical, Google will likely ignore all the rel=canonical hints. Any benefit that a legitimate rel=canonical might have offered will be lost.

In both these types of cases, double-checking the page’s source code will help correct the issue. Be sure to check the entire <head> section as the rel=canonical links may be spread apart.

Check the behavior of plugins by looking at the page’s source code.

Mistake 4: Category or landing page specifies rel=canonical to a featured article

Let’s say you run a site about desserts. Your dessert site has useful category pages like “pastry” and “gelato.” Each day the category pages feature a unique article. For instance, your pastry landing page might feature “red velvet cupcakes.” Because the “pastry” category page has nearly all the same content as the “red velvet cupcake” page, you add a rel=canonical from the category page to the featured individual article.

If we were to accept this rel=canonical, then your pastry category page would not appear in search results. That’s because the rel=canonical signals that you would prefer search engines display the canonical URL in place of the duplicate. However, if you want users to be able to find both the category page and featured article, it’s best to only have a self-referential rel=canonical on the category page, or none at all.

Remember that the canonical designation also implies the preferred display URL. Avoid adding a rel=canonical from a category or landing page to a featured article.

Mistake 5: rel=canonical in the <body>

The rel=canonical link tag should only appear in the <head> of an HTML document. Additionally, to avoid HTML parsing issues, it’s good to include the rel=canonical as early as possible in the <head>. When we encounter a rel=canonical designation in the <body>, it’s disregarded.

This is an easy mistake to correct. Simply double-check that your rel=canonical links are always in the <head> of your page, and as early as possible if you can.

rel=canonical designations in the <head> are processed, not the <body>.

Conclusion

To create valuable rel=canonical designations:

Verify that most of the main text content of a duplicate page also appears in the canonical page.
Check that rel=canonical is only specified once (if at all) and in the <head> of the page.
Check that rel=canonical points to an existent URL with good content (i.e., not a 404, or worse, a soft 404).
Avoid specifying rel=canonical from landing or category pages to featured articles as that will make the featured article the preferred URL in search results.

And, as always, please ask any questions in our Webmaster Help forum.

Written by Allan Scott, Software Engineer, Indexing Team

A new opt-out tool

Author Admin

Webmasters have several ways to keep their sites' content out of Google's search results. Today, as promised, we're providing a way for websites to opt out of having their content that Google has crawled appear on Google Shopping, Advisor, Flights, Hotels, and Google+ Local search.

Webmasters can now choose this option through our Webmaster Tools, and crawled content currently being displayed on Shopping, Advisor, Flights, Hotels, or Google+ Local search pages will be removed within 30 days.

Posted by Matt Cutts, Distinguished Engineer

We created a first steps cheat sheet for friends & family

Author Admin

Webmaster level: beginner

Everyone knows someone who just set up their first blog on Blogger, installed WordPress for the first time or maybe who had a web site for some time but never gave search much thought. We came up with a first steps cheat sheet for just these folks. It’s a short how-to list with basic tips on search engine-friendly design, that can help Google and others better understand the content and increase your site’s visibility. We made sure it’s available in thirteen languages. Please feel free to read it, print it, share it, copy and distribute it!

We hope this content will help those who are just about to start their webmaster adventure or have so far not paid too much attention to search engine-friendly design. Over time as you gain experience you may want to have a look at our more advanced Google SEO Starter Guide. As always we welcome all webmasters and site owners, new and experienced to join discussions on our Google Webmaster Help Forum.

Posted by Kaspar Szymanski, Search Quality Strategist, Dublin

Come see us at SES London and hear tips on successful site architecture

Author Admin

If you're planning to be at Search Engine Strategies London February 13-15, stop by and say hi to one of the many Googlers who will be there. I'll be speaking on Wednesday at the Successful Site Architecture panel and thought I'd offer up some tips for building crawlable sites for those who can't attend.

Make sure visitors and search engines can access the content

Check the Crawl errors section of webmaster tools for any pages Googlebot couldn't access due to server or other errors. If Googlebot can't access the pages, they won't be indexed and visitors likely can't access them either.
Make sure your robots.txt file doesn't accidentally block search engines from content you want indexed. You can see a list of the files Googlebot was blocked from crawling in webmaster tools. You can also use our robots.txt analysis tool to make sure you're blocking and allowing the files you intend.
Check the Googlebot activity reports to see how long it takes to download a page of your site to make sure you don't have any network slowness issues.
If pages of your site require a login and you want the content from those pages indexed, ensure you include a substantial amount of indexable content on pages that aren't behind the login. For instance, you can put several content-rich paragraphs of an article outside the login area, with a login link that leads to the rest of the article.
How accessible is your site? How does it look in mobile browsers and screen readers? It's well worth testing your site under these conditions and ensuring that visitors can access the content of the site using any of these mechanisms.

Make sure your content is viewable

Check out your site in a text-only browser or view it in a browser with images and Javascript turned off. Can you still see all of the text and navigation?
Ensure the important text and navigation in your site is in HTML, not in images, and make sure all images have ALT text that describe them.
If you use Flash, use it only when needed. Particularly, don't put all of the text from your site in Flash. An ideal Flash-based site has pages with HTML text and Flash accents. If you use Flash for your home page, make sure that the navigation into the site is in HTML.

Be descriptive

Make sure each page has a unique title tag and meta description tag that aptly describe the page.
Make sure the important elements of your pages (for instance, your company name and the main topic of the page) are in HTML text.
Make sure the words that searchers will use to look for you are on the page.

Keep the site crawlable

If possible, avoid frames. Frame-based sites don't allow for unique URLs for each page, which makes indexing each page separately problematic.
Ensure the server returns a 404 status code for pages that aren't found. Some servers are configured to return a 200 status code, particularly with custom error messages and this can result in search engines spending time crawling and indexing non-existent pages rather than the valid pages of the site.
Avoid infinite crawls. For instance, if your site has an infinite calendar, add a nofollow attribute to links to dynamically-created future calendar pages. Each search engine may interpret the nofollow attribute differently, so check with the help documentation for each. Alternatively, you could use the nofollow meta tag to ensure that search engine spiders don't crawl any outgoing links on a page, or use robots.txt to prevent search engines from crawling URLs that can lead to infinite loops.
If your site uses session IDs or cookies, ensure those are not required for crawling.
If your site is dynamic, avoid using excessive parameters and use friendly URLs when you can. Some content management systems enable you to rewrite URLs to friendly versions.

See our tips for creating a Google-friendly site and webmaster guidelines for more information on designing your site for maximum crawlability and usability.

If you will be at SES London, I'd love for you to come by and hear more. And check out the other Googlers' sessions too:

Tuesday, February 13th

Auditing Paid Listings & Clickfraud Issues 10:45 - 12:00
Shuman Ghosemajumder, Business Product Manager for Trust & Safety

Wednesday, February 14th

A Keynote Conversation 9:00 - 9:45
Matt Cutts, Software Engineer

Successful Site Architecture 10:30 - 11:45
Vanessa Fox, Product Manager, Webmaster Central

Google University 12:45 - 1:45

Converting Visitors into Buyers 2:45 - 4:00
Brian Clifton, Head of Web Analytics, Google Europe

Search Advertising Forum 4:30 - 5:45
David Thacker, Senior Product Manager

Thursday, February 15th

Meet the Crawlers 9:00 - 10:15
Dan Crow, Product Manager

Web Analytics and Measuring Successful Overview 1:15 - 2:30
Brian Clifton, Head of Web Analytics, Google Europe

Search Advertising Clinic 1:15 - 2:30
Will Ashton, Retail Account Strategist

Site Clinic 3:00 - 4:15
Sandeepan Banerjee, Sr. Product Manager, Indexing

Discover your links

Author Admin

Update on October 15, 2008: For more recent news on links, visit Links Week on our Webmaster Central Blog. We're discussing internal links, outbound links, and inbound links.

You asked, and we listened: We've extended our support for querying links to your site to much beyond the link: operator you might have used in the past. Now you can use webmaster tools to view a much larger sample of links to pages on your site that we found on the web. Unlike the link: operator, this data is much more comprehensive and can be classified, filtered, and downloaded. All you need to do is verify site ownership to see this information.

To make this data even more useful, we have divided the world of links into two types: external and internal. Let's understand what kind of links fall into which bucket.

What are external links?
External links to your site are the links that reside on pages that do not belong to your domain. For example, if you are viewing links for http://www.google.com/, all the links that do not originate from pages on any subdomain of google.com would appear as external links to your site.

What are internal links?

Internal links to your site are the links that reside on pages that belong to your domain. For example, if you are viewing links for http://www.google.com/, all the links that originate from pages on any subdomain of google.com, such as http://www.google.com/ or mobile.google.com, would appear as internal links to your site.

Viewing links to a page on your site

You can view the links to your site by selecting a verified site in your webmaster tools account and clicking on the new Links tab at the top. Once there, you will see the two options on the left: external links and internal links, with the external links view selected. You will also see a table that lists pages on your site, as shown below. The first column of the table lists pages of your site with links to them, and the second column shows the number of the external links to that page that we have available to show you. (Note that this may not be 100% of the external links to this page.)

This table also provides the total number of external links to your site that we have available to show you.
When in this summary view, click the linked number and go to the detailed list of links to that page.
When in the detailed view, you'll see the list of all the pages that link to specific page on your site, and the time we last crawled that link. Since you are on the External Links tab on the left, this list is the external pages that point to the page.

Finding links to a specific page on your site
To find links to a specific page on your site, you first need to find that specific page in the summary view. You can do this by navigating through the table, or if you want to find that page quickly, you can use the handy Find a page link at the top of the table. Just fill in the URL and click See details. For example, if the page you are looking for has the URL http://www.google.com/?main, you can enter “?main” in the Find a page form. This will take you directly to the detailed view of the links to http://www.google.com/?main.

Viewing internal links

To view internal links to pages on your site, click on the Internal Links tab on the left side bar in the view. This takes you to a summary table that, just like external links view, displays information about pages on your site with internal links to them.

However, this view also provides you with a way to filter the data further: to see links from any of the subdomain on the domain, or links from just the specific subdomain you are currently viewing. For example, if you are currently viewing the internal links to http://www.google.com/, you can either see links from all the subdomains, such as links from http://mobile.google.com/ and http://www.google.com, or you can see links only from other pages on http://www.google.com.

Downloading links data
There are three different ways to download links data about your site. The first: download the current view of the table you see, which lets you navigate to any summary or details table, and download the data in the current view. Second, and probably the most useful data, is the list all external links to your site. This allows you to download a list of all the links that point to your site, along with the information about the page they point to and the last time we crawled that link. Thirdly, we provide a similar download for all internal links to your site.

We do limit the amount of data you can download for each type of link (for instance, you can currently download up to one million external links). Google knows about more links than the total we show, but the overall fraction of links we show is much, much larger than the link: command currently offers. Why not visit us at Webmaster Central and explore the links for your site?

Better understanding of your site

Author Admin

SES Chicago was wonderful. Meeting so many of you made the trip absolutely perfect. It was as special as if (Chicago local) Oprah had joined us!

While hanging out at the Google booth, I was often asked about how to take advantage of our webmaster tools. For example, here's one tip on Common Words.

Common Words: Our prioritized listing of your site's content
The common words feature lists in order of priority (from highest to lowest) the prevalent words we've found in your site, and in links to your site. (This information isn't available for subdirectories or subdomains.) Here are the steps to leveraging common words:

1. Determine your website's key concepts. If it offers getaways to a cattle ranch in Wyoming, the key concepts may be "cattle ranch," "horseback riding," and "Wyoming."

2. Verify that Google detected the same phrases you believe are of high importance. Login to webmaster tools, select your verified site, and choose Page analysis from the Statistics tab. Here, under "Common words in your site's content," we list the phrases detected from your site's content in order of prevalence. Do the common words lack any concepts you believe are important? Are they listing phrases that have little direct relevance to your site?

2a. If you're missing important phrases, you should first review your content. Do you have solid, textual information that explains and relates to the key concepts of your site? If in the cattle-ranch example, "horseback riding" was absent from common words, you may then want to review the "activities" page of the site. Does it include mostly images, or only list a schedule of riding lessons, rather than conceptually relevant information?

It may sound obvious, but if you want to rank for a certain set of keywords, but we don't even see those keyword phrases on your website, then ranking for those phrases will be difficult.

2b. When you see general, non-illustrative common words that don't relate helpfully to your site's content (e.g. a top listing of "driving directions" or "contact us"), then it may be beneficial to increase the ratio of relevant content on your site. (Although don't be too worried if you see a few of these common words, as long as you also see words that are relevant to your main topics.) In the cattle ranch example, you would give visitors "driving directions" and "contact us" information. However, if these general, non-illustrative terms surface as the highest-rated common words, or the entire list of common words is only these types of terms, then Google (and likely other search engines) could not find enough "meaty" content.

2c. If you find that many of the common words still don't relate to your site, check out our blog post on unexpected common words.

3. Here are a few of our favorite posts on improving your site's content:
Target visitors or search engines?

Improving your site's indexing and ranking

NEW! SES Chicago - Using Images

4. Should you decide to update your content, please keep in mind that we will need to recrawl your site in order to recognize changes, and that this may take time. Of course, you can notify us of modifications by submitting a Sitemap.

Happy holidays from all of us on the Webmaster Central team!

SES Chicago: Googlers Trevor Foucher, Adam Lasnik and Jonathan Simon

Deftly dealing with duplicate content

Author Admin

At the recent Search Engine Strategies conference in freezing Chicago, many of us Googlers were asked questions about duplicate content. We recognize that there are many nuances and a bit of confusion on the topic, so we'd like to help set the record straight.

What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it's unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and -- worse yet -- linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.

What isn't duplicate content?
Though we do offer a handy translation utility, our algorithms won't view the same article written in English and Spanish as duplicate content. Similarly, you shouldn't worry about occasional snippets (quotes and otherwise) being flagged as duplicate content.

Why does Google care about duplicate content?
Our users typically want to see a diverse cross-section of unique content when they do searches. In contrast, they're understandably annoyed when they see substantially the same content within a set of search results. Also, webmasters become sad when we show a complex URL (example.com/contentredir?value=shorty-george〈=en) instead of the pretty URL they prefer (example.com/en/shorty-george.htm).

What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in "regular" and "printer" versions and neither set is blocked in robots.txt or via a noindex meta tag, we'll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index.

How can Webmasters proactively address duplicate content issues?

Block appropriately: Rather than letting our algorithms determine the "best" version of a document, you may wish to help guide us to your preferred version. For instance, if you don't want us to index the printer versions of your site's articles, disallow those directories or make use of regular expressions in your robots.txt file.
Use 301s: If you have restructured your site, use 301 redirects ("RedirectPermanent") in your .htaccess file to smartly redirect users, the Googlebot, and other spiders.
Be consistent: Endeavor to keep your internal linking consistent; don't link to /page/ and /page and /page/index.htm.
Use TLDs: To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We're more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we'll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer.
Use the preferred domain feature of webmaster tools: If other sites link to yours using both the www and non-www version of your URLs, you can let us know which way you prefer your site to be indexed.
Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
Avoid publishing stubs: Users don't like seeing "empty" pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren't subjected to a zillion instances of "Below you'll find a superb list of all the great rental opportunities in [insert cityname]..." with no actual listings.
Understand your CMS: Make sure you're familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
Don't worry be happy: Don't fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it's highly unlikely that such sites can negatively impact your site's presence in Google. If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.

In short, a general awareness of duplicate content issues and a few minutes of thoughtful preventative maintenance should help you to help us provide users with unique and relevant content.

SES Chicago - Using Images

Author Admin

We all had a great time at SES Chicago last week, answering questions and getting feedback.

One of the sessions I participated in was Images and Search Engines, and the panelists had great information about using images on your site, as well as on optimizing for Google Image search.

Ensuring visitors and search engines know what your content is about
Images on a site are great -- but search engines can't read them, and not all visitors can. Make sure your site is accessible and can be understood by visitors viewing your site with images turned off in their browsers, on mobile devices, and with screen readers. If you do that, search engines won't have any trouble. Some things that you can do to ensure this:

Don't put the bulk of your text in images. It may sound simple, but the best thing you can do is to put your text into well, text. Reserve images for graphical elements. If all of the text on your page is in an image, it becomes inaccessible.
Take advantage of alt tags for all of your images. Make sure the alt text is descriptive and unique. For instance, alt text such as "picture1" or "logo" doesn't provide much information about the image. "Charting the path of stock x" and "Company Y" give more details.
Don't overload your alt text. Be descriptive, but don't stuff it with extra keywords.
It's important to use alt text for any image on your pages, but if your company name, navigation, or other major elements of your pages are in images, alt text becomes especially important. Consider moving vital details to text to ensure all visitors can view them.
Look at the image-to-text ratio on your page. How much text do you have? One way of looking at this is to look at your site with images turned off in your browser. What content can you see? Is the intent of your site obvious? Do the pages convey your message effectively?

Taking advantage of Image search
The panelists pointed out that shoppers often use Image search to see the things they want to buy. If you have a retail site, make sure that you have images of your products (and that they can be easily identified with alt text, headings, and textual descriptions). Searchers can then find your images and get to your site.

One thing that can help your images be returned for results in Google Image search is opting in to enhanced image search in webmaster tools. This enables us to use your images in the Google Image Labeler, which harnesses the power of the community for adding metadata to your images.

Someone asked if we have a maximum number of images per site that we accept for the Image Labeler. We don't. You can opt in no matter how many, or how few, images your site has.

Update: More information on using images can be found in our Help Center.

The number of pages Googlebot crawls

Author Admin

The Googlebot activity reports in webmaster tools show you the number of pages of your site Googlebot has crawled over the last 90 days. We've seen some of you asking why this number might be higher than the total number of pages on your sites.

Googlebot crawls pages of your site based on a number of things including:

pages it already knows about
links from other web pages (within your site and on other sites)
pages listed in your Sitemap file

More specifically, Googlebot doesn't access pages, it accesses URLs. And the same page can often be accessed via several URLs. Consider the home page of a site that can be accessed from the following four URLs:

http://www.example.com/
http://www.example.com/index.html
http://example.com
http://example.com/index.html

Although all URLs lead to the same page, all four URLs may be used in links to the page. When Googlebot follows these links, a count of four is added to the activity report.

Many other scenarios can lead to multiple URLs for the same page. For instance, a page may have several named anchors, such as:

http://www.example.com/mypage.html#heading1
http://www.example.com/mypage.html#heading2
http://www.example.com/mypage.html#heading3

And dynamically generated pages often can be reached by multiple URLs, such as:

http://www.example.com/furniture?type=chair&brand=123
http://www.example.com/hotbuys?type=chair&brand=123

As you can see, when you consider that each page on your site might have multiple URLs that lead to it, the number of URLs that Googlebot crawls can be considerably higher than the number of total pages for your site.

Of course, you (and we) only want one version of the URL to be returned in the search results. Not to worry -- this is exactly what happens. Our algorithms selects a version to include, and you can provide input on this selection process.

Redirect to the preferred version of the URL
You can do this using 301 (permanent) redirect. In the first example that shows four URLs that point to a site's home page, you may want to redirect index.html to www.example.com/. And you may want to redirect example.com to www.example.com so that any URLs that begin with one version are redirected to the other version. Note that you can do this latter redirect with the Preferred Domain feature in webmaster tools. (If you also use a 301 redirect, make sure that this redirect matches what you set for the preferred domain.)

Block the non-preferred versions of a URL with a robots.txt file
For dynamically generated pages, you may want to block the non-preferred version using pattern matching in your robots.txt file. (Note that not all search engines support pattern matching, so check the guidelines for each search engine bot you're interested in.) For instance, in the third example that shows two URLs that point to a page about the chairs available from brand 123, the "hotbuys" section rotates periodically and the content is always available from a primary and permanent location. If that case, you may want to index the first version, and block the "hotbuys" version. To do this, add the following to your robots.txt file:

User-agent: Googlebot
Disallow: /hotbuys?*

To ensure that this directive will actually block and allow what you intend, use the robots.txt analysis tool in webmaster tools. Just add this directive to the robots.txt section on that page, list the URLs you want to check in the "Test URLs" section and click the Check button. For this example, you'd see a result like this:

Don't worry about links to anchors, because while Googlebot will crawl each link, our algorithms will index the URL without the anchor.

And if you don't provide input such as that described above, our algorithms do a really good job of picking a version to show in the search results.

Video: Expanding your site to more languages

5 common mistakes with rel=canonical

A new opt-out tool

We created a first steps cheat sheet for friends & family

Come see us at SES London and hear tips on successful site architecture

Discover your links

Better understanding of your site

Deftly dealing with duplicate content

SES Chicago - Using Images

The number of pages Googlebot crawls

Popular Posts

Labels