Google Tools Make Money Online: crawling and indexing

Target visitors or search engines?

Author Admin

Last Friday afternoon, I was able to catch the end of the Blog Business Summit in Seattle. At the session called "Blogging and SEO Strategies," John Battelle brought up a good point. He said that as a writer, he doesn't want to have to think about all of this search engine optimization stuff. Dave Taylor had just been talking about order of words in title tags and keyword density and using hyphens rather than underscores in URLs.

We agree, which is why you'll find that the main point in our webmaster guidelines is to make sites for visitors, not for search engines. Visitor-friendly design makes for search engine friendly design as well. The team at Google webmaster central talks a lot with site owners who care a lot about the details of how Google crawls and indexes sites (hyphens and underscores included), but many site owners out there are just concerned with building great sites. The good news is that the guidelines and tips about how Google crawls and indexes sites come down to wanting great content for our search results.

In the spirit of John Battelle's point, here's a recap of some quick tips for ensuring your site is friendly for visitors.

Make good use of page titles
This is true of the main heading on the page itself, but is also true of the title that appears in the browser's title bar.

Whenever possible, ensure each page has a unique title that describes the page well. For instance, if your site is for your store "Buffy's House of Sofas", a visitor may want to bookmark your home page and the order page for your red, fluffy sofa. If all of your pages have the same title: "Wecome to my site!", then a visitor will have trouble finding your site again in the bookmarks. However, if your home page has the title "Buffy's House of Sofas" and your red sofa page has the title "Buffy's red fluffy sofa", then visitors can glance at the title to see what it's about and can easily find it in the bookmarks later. And if your visitors are anything like me, they may have several browser tabs open and appreciate descriptive titles for easier navigation.

This simple tip for visitors helps search engines too. Search engines index pages based on the words contained in them, and including descriptive titles helps search engines know what the pages are about. And search engines often use a page's title in the search results. "Welcome to my site" may not entice searchers to click on your site in the results quite so much as "Buffy's House of Sofas".

Write with words
Images, flash, and other multimedia make for pretty web pages, but make sure your core messages are in text or use ALT text to provide textual descriptions of your multimedia. This is great for search engines, which are based on text: searchers enter search queries as word, after all. But it's also great for visitors, who may have images or Flash turned off in their browsers or might be using screen readers or mobile devices. You can also provide HTML versions of your multimedia-based pages (if you do that, be sure to block the multimedia versions from being indexed using a robots.txt file).

Make sure the text you're talking about is in your content
Visitors may not read your web site linearly like they would a newspaper article or book. Visitors may follow links from elsewhere on the web to any of your pages. Make sure that they have context for any page they're on. On your order page, don't just write "order now!" Write something like "Order your fluffy red sofa now!" But write it for people who will be reading your site. Don't try to cram as many words in as possible, thinking search engines can index more words that way. Think of your visitors. What are they going to be searching for? Is your site full of industry jargon when they'll be searching for you with more informal words?

As I wrote in that guest post on Matt Cutts' blog when I talked about hyphens and underscores:

You know what your site’s about, so it may seem completely obvious to you when you look at your home page. But ask someone else to take a look and don’t tell them anything about the site. What do they think your site is about?

Consider this text:

“We have hundreds of workshops and classes available. You can choose the workshop that is right for you. Spend an hour or a week in our relaxing facility.”

Will this site show up for searches for [cooking classes] or [wine tasting workshops] or even [classes in Seattle]? It may not be as obvious to visitors (and search engine bots) what your page is about as you think.

Along those same lines, does your content use words that people are searching for? Does your site text say “check out our homes for sale” when people are searching for [real estate in Boston]?

Make sure your pages are accessible
I know -- this post was supposed to be about writing content, not technical details. But visitors can't read your site if they can't access it. If the network is down or your server returns errors when someone tries to access the pages of your site, it's not just search engines who will have trouble. Fortunately, webmaster tools makes it easy. We'll let you know if we've had any trouble accessing any of the pages. We tell you the specific page we couldn't access and the exact error we got. These problems aren't always easy to fix, but we try to make them easy to find.

Googlebot activity reports

Author Admin

The webmaster tools team has a very exciting mission: we dig into our logs, find as much useful information as possible, and pass it on to you, the webmasters. Our reward is that you more easily understand what Google sees, and why some pages don't make it to the index.

The latest batch of information that we've put together for you is the amount of traffic between Google and a given site. We show you the number of requests, number of kilobytes (yes, yes, I know that tech-savvy webmasters can usually dig this out, but our new charts make it really easy to see at a glance), and the average document download time. You can see this information in chart form, as well as in hard numbers (the maximum, minimum, and average).

For instance, here's the number of pages Googlebot has crawled in the Webmaster Central blog over the last 90 days. The maximum number of pages Googlebot has crawled in one day is 24 and the minimum is 2. That makes sense, because the blog was launched less than 90 days ago, and the chart shows that the number of pages crawled per day has increased over time. The number of pages crawled is sometimes more than the total number of pages in the site -- especially if the same page can be accessed via several URLs. So http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html and http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html#links are different, but point to the same page (the second points to an anchor within the page).

And here's the average number of kilobytes downloaded from this blog each day. As you can see, as the site has grown over the last two and a half months, the number of average kilobytes downloaded has increased as well.

The first two reports can help you diagnose the impact that changes in your site may have on its coverage. If you overhaul your site and dramatically reduce the number of pages, you'll likely notice a drop in the number of pages that Googlebot accesses.

The average document download time can help pinpoint subtle networking problems. If the average time spikes, you might have network slowdowns or bottlenecks that you should investigate. Here's the report for this blog that shows that we did have a short spike in early September (the maximum time was 1057 ms), but it quickly went back to a normal level, so things now look OK.

In general, the load time of a page doesn't affect its ranking, but we wanted to give this info because it can help you spot problems. We hope you will find this data as useful as we do!

Learn more about Googlebot's crawl of your site and more!

Author Admin

We've added a few new features to webmaster tools and invite you to check them out.

Googlebot activity reports
Check out these cool charts! We show you the number of pages Googlebot's crawled from your site per day, the number of kilobytes of data Googlebot's downloaded per day, and the average time it took Googlebot to download pages. Webmaster tools show each of these for the last 90 days. Stay tuned for more information about this data and how you can use it to pinpoint issues with your site.

Crawl rate control
Googlebot uses sophisticated algorithms that determine how much to crawl each site. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth.

We've been conducting a limited test of a new feature that enables you to provide us information about how we crawl your site. Today, we're making this tool available to everyone. You can access this tool from the Diagnostic tab. If you'd like Googlebot to slow down the crawl of your site, simply choose the Slower option.

If we feel your server could handle the additional bandwidth, and we can crawl your site more, we'll let you know and offer the option for a faster crawl.

If you request a changed crawl rate, this change will last for 90 days. If you liked the changed rate, you can simply return to webmaster tools and make the change again.

Enhanced image search
You can now opt into enhanced image search for the images on your site, which enables our tools such as Google Image Labeler to associate the images included in your site with labels that will improve indexing and search quality of those images. After you've opted in, you can opt out at any time.

Number of URLs submitted
Recently at SES San Jose, a webmaster asked me if we could show the number of URLs we find in a Sitemap. He said that he generates his Sitemaps automatically and he'd like confirmation that the number he thinks he generated is the same number we received. We thought this was a great idea. Simply access the Sitemaps tab to see the number of URLs we found in each Sitemap you've submitted.

As always, we hope you find these updates useful and look forward to hearing what you think.

Useful information you may have missed

Author Admin

When we launched this blog in early August, we said goodbye to the Inside Google Sitemaps blog and started redirecting it here. The redirect makes the posts we did there a little difficult to get to. For those of you who started reading with this newer blog, here are links to some of the older posts that may be of interest.

Webmaster Tools Account questions

And you can always browse the blog's archives .

How to verify Googlebot

Author Admin

Lately I've heard a couple smart people ask that search engines provide a way know that a bot is authentic. After all, any spammer could name their bot "Googlebot" and claim to be Google, so which bots do you trust and which do you block?

The common request we hear is to post a list of Googlebot IP addresses in some public place. The problem with that is that if/when the IP ranges of our crawlers change, not everyone will know to check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real hassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided another way to authenticate Googlebot. Here's an answer from one of the crawl people (quoted with their permission):

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

This answer has also been provided to our help-desk, so I'd consider it an official way to authenticate Googlebot. In order to fetch from the "official" Googlebot IP range, the bot has to respect robots.txt and our internal hostload conventions so that Google doesn't crawl you too hard.

(Thanks to N. and J. for help on this answer from the crawl side of things.)

Debugging blocked URLs

Author Admin

Vanessa's been posting a lot lately, and I'm starting to feel left out. So here my tidbit of wisdom for you: I've noticed a couple of webmasters confused by "blocked by robots.txt" errors, and I wanted to share the steps I take when debugging robots.txt problems:

A handy checklist for debugging a blocked URL

Let's assume you are looking at crawl errors for your website and notice a URL restricted by robots.txt that you weren't intending to block:
http://www.example.com/amanda.html URL restricted by robots.txt Sep 3, 2006

Check the robots.txt analysis tool
The first thing you should do is go to the robots.txt analysis tool for that site. Make sure you are looking at the correct site for that URL, paying attention that you are looking at the right protocol and subdomain. (Subdomains and protocols may have their own robots.txt file, so https://www.example.com/robots.txt may be different from http://example.com/robots.txt and may be different from http://amanda.example.com/robots.txt.) Paste the blocked URL into the "Test URLs against this robots.txt file" box. If the tool reports that it is blocked, you've found your problem. If the tool reports that it's allowed, we need to investigate further.

At the top of the robots.txt analysis tool, take a look at the HTTP status code. If we are reporting anything other than a 200 (Success) or a 404 (Not found) then we may not be able to reach your robots.txt file, which stops our crawling process. (Note that you can see the last time we downloaded your robots.txt file at the top of this tool. If you make changes to your file, check this date and time to see if your changes were made after our last download.)

Check for changes in your robots.txt file
If these look fine, you may want to check and see if your robots.txt file has changed since the error occurred by checking the date to see when your robots.txt file was last modified. If it was modified after the date given for the error in the crawl errors, it might be that someone has changed the file so that the new version no longer blocks this URL.

Check for redirects of the URL
If you can be certain that this URL isn't blocked, check to see if the URL redirects to another page. When Googlebot fetches a URL, it checks the robots.txt file to make sure it is allowed to access the URL. If the robots.txt file allows access to the URL, but the URL returns a redirect, Googlebot checks the robots.txt file again to see if the destination URL is accessible. If at any point Googlebot is redirected to a blocked URL, it reports that it could not get the content of the original URL because it was blocked by robots.txt.

Sometimes this behavior is easy to spot because a particular URL always redirects to another one. But sometimes this can be tricky to figure out. For instance:

Your site may not have a robots.txt file at all (and therefore, allows access to all pages), but a URL on the site may redirect to a different site, which does have a robots.txt file. In this case, you may see URLs blocked by robots.txt for your site (even though you don't have a robots.txt file).
Your site may prompt for registration after a certain number of page views. You may have the registration page blocked by a robots.txt file. In this case, the URL itself may not redirect, but if Googlebot triggers the registration prompt when accessing the URL, it will be redirected to the blocked registration page, and the original URL will be listed in the crawl errors page as blocked by robots.txt.

Ask for help
Finally, if you still can't pinpoint the problem, you might want to post on our forum for help. Be sure to include the URL that is blocked in your message. Sometimes its easier for other people to notice oversights you may have missed.

Good luck debugging! And by the way -- unrelated to robots.txt -- make sure that you don't have "noindex" meta tags at the top of your web pages; those also result in Google not showing a web site in our index.

Setting the preferred domain

Author Admin

Based on your input, we've recently made a few changes to the preferred domain feature of webmaster tools. And since you've had some questions about this feature, we'd like to answer them.

The preferred domain feature enables you to tell us if you'd like URLs from your site crawled and indexed using the www version of the domain (http://www.example.com) or the non-www version of the domain (http://example.com). When we initially launched this, we added the non-preferred version to your account when you specified a preference so that you could see any information associated with the non-preferred version. But many of you found that confusing, so we've made the following changes:

When you set the preferred domain, we no longer will add the non-preferred version to your account.
If you had previously added the non-preferred version to your account, you'll still see it listed there, but you won't be able to add a Sitemap for the non-preferred version.
If you have already set the preferred domain and we had added the non-preferred version to your account, we'll be removing that non-preferred version from your account over the next few days.

Note that if you would like to see any information we have about the non-preferred version, you can always add it to your account.

Here are some questions we've had about this preferred domain feature, and our replies.

Once I've set my preferred domain, how long will it take before I see changes?
The time frame depends on many factors (such as how often your site is crawled and how many pages are indexed with the non-preferred version). You should start to see changes in the few weeks after you set your preferred domain.

Is the preferred domain feature a filter or a redirect? Does it simply cause the search results to display on the URLs that are in the version I prefer?
The preferred domain feature is not a filter. When you set a preference, we:

Consider all links that point to the site (whether those links use the www version or the non-www version) to be pointing at the version you prefer. This helps us more accurately determine PageRank for your pages.
Once we know that both versions of a URL point to the same page, we try to select the preferred version for future crawls.
Index pages of your site using the version you prefer. If some pages of your site are indexed using the www version and other pages are indexed using the non-www version, then over time, you should see a shift to the preference you've set.

If I use a 301 redirect on my site to point the www and non-www versions to the same version, do I still need to use this feature?
You don't have to use it, as we can follow the redirects. However, you still can benefit from using this feature in two ways: we can more easily consolidate links to your site and over time, we'll direct our crawl to the preferred version of your pages.

If I use this feature, should I still use a 301 redirect on my site?
You don't need to use it for Googlebot, but you should still use the 301 redirect, if it's available. This will help visitors and other search engines. Of course, make sure that you point to the same URL with the preferred domain feature and the 301 redirect.

You can find more about this in our webmaster help center.

Better details about when Googlebot last visited a page

Author Admin

Most people know that Googlebot downloads pages from web servers to crawl the web. Not as many people know that if Googlebot accesses a page and gets a 304 (Not-Modified) response to a If-Modified-Since qualified request, Googlebot doesn't download the contents of that page. This reduces the bandwidth consumed on your web server.

When you look at Google's cache of a page (for instance, by using the cache: operator or clicking the Cached link under a URL in the search results), you can see the date that Googlebot retrieved that page. Previously, the date we listed for the page's cache was the date that we last successfully fetched the content of the page. This meant that even if we visited a page very recently, the cache date might be quite a bit older if the page hadn't changed since the previous visit. This made it difficult for webmasters to use the cache date we display to determine Googlebot's most recent visit. Consider the following example:

Googlebot crawls a page on April 12, 2006.
Our cached version of that page notes that "This is G o o g l e's cache of http://www.example.com/ as retrieved on April 12, 2006 20:02:06 GMT."
Periodically, Googlebot checks to see if that page has changed, and each time, receives a Not-Modified response. For instance, on August 27, 2006, Googlebot checks the page, receives a Not-Modified response, and therefore, doesn't download the contents of the page.
On August 28, 2006, our cached version of the page still shows the April 12, 2006 date -- the date we last downloaded the page's contents, even though Googlebot last visited the day before.

We've recently changed the date we show for the cached page to reflect when Googlebot last accessed it (whether the page had changed or not). This should make it easier for you to determine the most recent date Googlebot visited the page. For instance, in the above example, the cached version of the page would now say "This is G o o g l e's cache of http://www.example.com/ as retrieved on August 27, 2006 13:13:37 GMT."

Note that this change will be reflected for individual pages as we update those pages in our index.

All About Googlebot

Author Admin

I've seen a lot of questions lately about robots.txt files and Googlebot's behavior. Last week at SES, I spoke on a new panel called the Bot Obedience course. And a few days ago, some other Googlers and I fielded questions on the WebmasterWorld forums. Here are some of the questions we got:

If my site is down for maintenance, how can I tell Googlebot to come back later rather than to index the "down for maintenance" page?
You should configure your server to return a status of 503 (network unavailable) rather than 200 (successful). That lets Googlebot know to try the pages again later.

What should I do if Googlebot is crawling my site too much?
You can contact us -- we'll work with you to make sure we don't overwhelm your server's bandwidth. We're experimenting with a feature in our webmaster tools for you to provide input on your crawl rate, and have gotten great feedback so far, so we hope to offer it to everyone soon.

Is it better to use the meta robots tag or a robots.txt file?
Googlebot obeys either, but meta tags apply to single pages only. If you have a number of pages you want to exclude from crawling, you can structure your site in such a way that you can easily use a robots.txt file to block those pages (for instance, put the pages into a single directory).

If my robots.txt file contains a directive for all bots as well as a specific directive for Googlebot, how does Googlebot interpret the line addressed to all bots?
If your robots.txt file contains a generic or weak directive plus a directive specifically for Googlebot, Googlebot obeys the lines specifically directed at it.

For instance, for this robots.txt file:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /cgi-bin/

Googlebot will crawl everything in the site other than pages in the cgi-bin directory.

For this robots.txt file:

User-agent: *
Disallow: /

Googlebot won't crawl any pages of the site.

If you're not sure how Googlebot will interpret your robots.txt file, you can use our robots.txt analysis tool to test it. You can also test how Googlebot will interpret changes to the file.

For complete information on how Googlebot and Google's other user agents treat robots.txt files, see our webmaster help center.

Back from SES San Jose

Author Admin

Thanks to everyone who stopped by to say hi at the Search Engine Strategies conference in San Jose last week!

I had a great time meeting people and talking about our new webmaster tools. I got to hear a lot of feedback about what webmasters liked, didn't like, and wanted to see in our Webmaster Central site. For those of you who couldn't make it or didn't find me at the conference, please feel free to post your comments and suggestions in our discussion group. I do want to hear about what you don't understand or what you want changed so I can make our webmaster tools as useful as possible.

Some of the highlights from the week:

This year, Danny Sullivan invited some of us from the team to "chat and chew" during a lunch hour panel discussion. Anyone interested in hearing about Google's webmaster tools was welcome to come and many did -- thanks for joining us! I loved showing off our product, answering questions, and getting feedback about what to work on next. Many people had already tried Sitemaps, but hadn't seen the new features like Preferred domain and full crawling errors.

One of the questions I heard more than once at the lunch was about how big a Sitemap can be, and how to use Sitemaps with very large websites. Since Google can handle all of your URLs, the goal of Sitemaps is to tell us about all of them. A Sitemap file can contain up to 50,000 URLs and should be no larger than 10MB when uncompressed. But if you have more URLs than this, simply break them up into several smaller Sitemaps and tell us about them all. You can create a Sitemap Index file, which is just a list of all your Sitemaps, to make managing several Sitemaps a little easier.

While hanging out at the Google booth I got another interesting question: One site owner told me that his site is listed in Google, but its description in the search results wasn't exactly what he wanted. (We were using the description of his site listed in the Open Directory Project.) He asked how to remove this description from Google's search results. Vanessa Fox knew the answer! To specifically prevent Google from using the Open Directory for a page's title and description, use the following meta tag:

<meta name="GOOGLEBOT" content="NOODP">

My favorite panel of the week was definitely Pimp My Site. The whole group was dressed to match the theme as they gave some great advice to webmasters. Dax Herrera, the coolest "pimp" up there (and a fantastic piano player), mentioned that a lot of sites don't explain their product clearly on each page. For instance, when pimping Flutter Fetti, there were many instances when all the site had to do was add the word "confetti" to the product description to make it clear to search engines and to users reaching the page exactly what a Flutter Fetti stick is.

Another site pimped was a Yahoo! Stores web site. Someone from the audience asked if the webmaster could set up a Google Sitemap for their store. As Rob Snell pointed out, it's very simple: Yahoo! Stores will create a Google Sitemap for your website automatically, and even verify your ownership of the site in our webmaster tools.

Finally, if you didn't attend the Google dance, you missed out! There were Googlers dancing, eating, and having a great time with all the conference attendees. Vanessa Fox represented my team at the Meet the Google Engineers hour that we held during the dance, and I heard Matt Cutts even starred in a music video! While demo-ing Webmaster Central over in the labs area, someone asked me about the ability to share site information across multiple accounts. We associate your site verification with your Google Account, and allow multiple accounts to verify ownership of a site independently. Each account has its own verification file or meta tag, and you can remove them at any time and re-verify your site to revoke verification of a user. This means that your marketing person, your techie, and your SEO consultant can each verify the same site with their own Google Account. And if you start managing a site that someone else used to manage, all you have to do is add that site to your account and verify site ownership. You don't need to transfer the account information from the person who previously managed it.

Thanks to everyone who visited and gave us feedback. It was great to meet you!

Target visitors or search engines?

Googlebot activity reports

Learn more about Googlebot's crawl of your site and more!

Useful information you may have missed

How to verify Googlebot

Debugging blocked URLs

Setting the preferred domain

Better details about when Googlebot last visited a page

All About Googlebot

Back from SES San Jose

Popular Posts

Labels