Badware alerts for your sites

As part of our efforts to protect users, we have been warning people using Google search before they visit sites that have been determined to distribute badware under the guidelines published by StopBadware. Warning users is only part of the solution, though; the real win comes from helping webmasters protect their own users by alerting them when their sites have been flagged for badware -- and working with them to remove the threats.

It's my pleasure to introduce badware alerts in Google webmaster tools. You can see on the Diagnostic Summary tab if your site has been determined to distribute badware and can access information to help you correct this.

If your site has been flagged and you believe you've since removed the threats, go to http://stopbadware.org/home/review to request a review. If that's successful, your site will no longer be flagged -- and your users will be safer as a result of your diligence.

This version is only the beginning: we plan to continue to provide more data to help webmasters diagnose issues on their sites. We realize that in many cases, badware distribution is unintentional and the result of being hacked or running ads which lead directly to pages with browser exploits. Stay tuned for improvements to this feature and others on webmaster tools.

Update: this post has been updated to provide a link to the new form for requesting a review.


Update: More information is available in our Help Center article on malware and hacked sites.

The number of pages Googlebot crawls

The Googlebot activity reports in webmaster tools show you the number of pages of your site Googlebot has crawled over the last 90 days. We've seen some of you asking why this number might be higher than the total number of pages on your sites.


Googlebot crawls pages of your site based on a number of things including:
  • pages it already knows about
  • links from other web pages (within your site and on other sites)
  • pages listed in your Sitemap file
More specifically, Googlebot doesn't access pages, it accesses URLs. And the same page can often be accessed via several URLs. Consider the home page of a site that can be accessed from the following four URLs:
  • http://www.example.com/
  • http://www.example.com/index.html
  • http://example.com
  • http://example.com/index.html
Although all URLs lead to the same page, all four URLs may be used in links to the page. When Googlebot follows these links, a count of four is added to the activity report.

Many other scenarios can lead to multiple URLs for the same page. For instance, a page may have several named anchors, such as:
  • http://www.example.com/mypage.html#heading1
  • http://www.example.com/mypage.html#heading2
  • http://www.example.com/mypage.html#heading3
And dynamically generated pages often can be reached by multiple URLs, such as:
  • http://www.example.com/furniture?type=chair&brand=123
  • http://www.example.com/hotbuys?type=chair&brand=123
As you can see, when you consider that each page on your site might have multiple URLs that lead to it, the number of URLs that Googlebot crawls can be considerably higher than the number of total pages for your site.

Of course, you (and we) only want one version of the URL to be returned in the search results. Not to worry -- this is exactly what happens. Our algorithms selects a version to include, and you can provide input on this selection process.

Redirect to the preferred version of the URL
You can do this using 301 (permanent) redirect. In the first example that shows four URLs that point to a site's home page, you may want to redirect index.html to www.example.com/. And you may want to redirect example.com to www.example.com so that any URLs that begin with one version are redirected to the other version. Note that you can do this latter redirect with the Preferred Domain feature in webmaster tools. (If you also use a 301 redirect, make sure that this redirect matches what you set for the preferred domain.)

Block the non-preferred versions of a URL with a robots.txt file
For dynamically generated pages, you may want to block the non-preferred version using pattern matching in your robots.txt file. (Note that not all search engines support pattern matching, so check the guidelines for each search engine bot you're interested in.) For instance, in the third example that shows two URLs that point to a page about the chairs available from brand 123, the "hotbuys" section rotates periodically and the content is always available from a primary and permanent location. If that case, you may want to index the first version, and block the "hotbuys" version. To do this, add the following to your robots.txt file:

User-agent: Googlebot
Disallow: /hotbuys?*

To ensure that this directive will actually block and allow what you intend, use the robots.txt analysis tool in webmaster tools. Just add this directive to the robots.txt section on that page, list the URLs you want to check in the "Test URLs" section and click the Check button. For this example, you'd see a result like this:

Don't worry about links to anchors, because while Googlebot will crawl each link, our algorithms will index the URL without the anchor.

And if you don't provide input such as that described above, our algorithms do a really good job of picking a version to show in the search results.

Googlebot activity reports

The webmaster tools team has a very exciting mission: we dig into our logs, find as much useful information as possible, and pass it on to you, the webmasters. Our reward is that you more easily understand what Google sees, and why some pages don't make it to the index.

The latest batch of information that we've put together for you is the amount of traffic between Google and a given site. We show you the number of requests, number of kilobytes (yes, yes, I know that tech-savvy webmasters can usually dig this out, but our new charts make it really easy to see at a glance), and the average document download time. You can see this information in chart form, as well as in hard numbers (the maximum, minimum, and average).

For instance, here's the number of pages Googlebot has crawled in the Webmaster Central blog over the last 90 days. The maximum number of pages Googlebot has crawled in one day is 24 and the minimum is 2. That makes sense, because the blog was launched less than 90 days ago, and the chart shows that the number of pages crawled per day has increased over time. The number of pages crawled is sometimes more than the total number of pages in the site -- especially if the same page can be accessed via several URLs. So http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html and http://googlewebmastercentral.blogspot.com/2006/10/learn-more-about-googlebots-crawl-of.html#links are different, but point to the same page (the second points to an anchor within the page).


And here's the average number of kilobytes downloaded from this blog each day. As you can see, as the site has grown over the last two and a half months, the number of average kilobytes downloaded has increased as well.


The first two reports can help you diagnose the impact that changes in your site may have on its coverage. If you overhaul your site and dramatically reduce the number of pages, you'll likely notice a drop in the number of pages that Googlebot accesses.

The average document download time can help pinpoint subtle networking problems. If the average time spikes, you might have network slowdowns or bottlenecks that you should investigate. Here's the report for this blog that shows that we did have a short spike in early September (the maximum time was 1057 ms), but it quickly went back to a normal level, so things now look OK.

In general, the load time of a page doesn't affect its ranking, but we wanted to give this info because it can help you spot problems. We hope you will find this data as useful as we do!

Learn more about Googlebot's crawl of your site and more!

We've added a few new features to webmaster tools and invite you to check them out.

Googlebot activity reports
Check out these cool charts! We show you the number of pages Googlebot's crawled from your site per day, the number of kilobytes of data Googlebot's downloaded per day, and the average time it took Googlebot to download pages. Webmaster tools show each of these for the last 90 days. Stay tuned for more information about this data and how you can use it to pinpoint issues with your site.

Crawl rate control
Googlebot uses sophisticated algorithms that determine how much to crawl each site. Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server's bandwidth.

We've been conducting a limited test of a new feature that enables you to provide us information about how we crawl your site. Today, we're making this tool available to everyone. You can access this tool from the Diagnostic tab. If you'd like Googlebot to slow down the crawl of your site, simply choose the Slower option.

If we feel your server could handle the additional bandwidth, and we can crawl your site more, we'll let you know and offer the option for a faster crawl.

If you request a changed crawl rate, this change will last for 90 days. If you liked the changed rate, you can simply return to webmaster tools and make the change again.


Enhanced image search
You can now opt into enhanced image search for the images on your site, which enables our tools such as Google Image Labeler to associate the images included in your site with labels that will improve indexing and search quality of those images. After you've opted in, you can opt out at any time.

Number of URLs submitted
Recently at SES San Jose, a webmaster asked me if we could show the number of URLs we find in a Sitemap. He said that he generates his Sitemaps automatically and he'd like confirmation that the number he thinks he generated is the same number we received. We thought this was a great idea. Simply access the Sitemaps tab to see the number of URLs we found in each Sitemap you've submitted.

As always, we hope you find these updates useful and look forward to hearing what you think.

Useful information you may have missed

Fresher query stats

Query stats in webmaster tools provide information about the search queries that most often return your site in the results. You can view this information by a variety of search types (such as web search, mobile search, or image search) and countries. We show you the top search types and locations for your site. You can access these stats by selecting a verified site in your account and then choosing Query stats from the Statistics tab.


If you've checked your site's query stats lately, you may have noticed that they're changing more often than they used to. This is because we recently changed how frequently we calculate them. Previously, we showed data that was averaged over a period of three weeks. Now, we show data that is averaged over a period of one week. This results in fresher stats for you, as well as stats that more accurately reflect the current queries that return your site in the results. We update these stats every week, so if you'd like to keep a history of the top queries for your site week by week, you can simply download the data each week. We generally update this data each Monday.

How we calculate query stats
Some of you have asked how we calculate query stats.

These results are based on results that searchers see. For instance, say a search for [Britney Spears] brings up your site as position 21, which is on the third page of the results. And say 1000 people searched for [Britney Spears] during the course of a week (in reality, a few more people than that search for her name, but just go with me for this example). 600 of those people only looked at the first page of results and the other 400 browsed to at least the third page. That means that your site was seen by 400 searchers. Even though your site was at position 21 for all 1000 searchers, only 400 are counted for purposes of this calculation.

Both top search queries and top search query clicks are based on the total number of searches for each query. The stats we show are based on the queries that most often return your site in the results. For instance, going back to that familiar [Britney Spears] query -- 400 searchers saw your site in the results. Now, maybe your site isn't really about Britney Spears -- it's more about Buffy the Vampire Slayer. And say Google received 50 queries for [Buffy the Vampire Slayer] in the same week, and your site was returned in the results at position 2. So, all 50 searchers saw your site in the results. In this example, Britney Spears would show as a top search query above Buffy the Vampire Slayer (because your site was seen by 400 searchers for Britney but 50 searchers for Buffy).

The same is true of top search query clicks. If 100 of the Britney-seekers clicked on your site in the search results and all 50 of the Buffy-searchers click on your site in the search results, Britney would show as a top search query above Buffy.

At times, this may cause some of the query stats we show you to seem unusual. If your site is returned for a very high-traffic query, then even if a low percentage of searchers click on your site for that query, the total number of searchers who click on your site may still be higher for the query than for queries for which a much higher percentage of searchers click on your site in the results.

The average top position for top search queries is the position of the page on your site that ranks most highly for the query. The average top position for top search query clicks is the position of the page on your site that searchers clicked on (even if a different page ranked more highly for the query). We show you the average position for this top page across all data centers over the course of the week.

A variety of download options are available. You can:
  • download individual tables of data by clicking the Download this table link.
  • download stats for all subfolders on your site (for all search types and locations) by clicking the Download all query stats for this site (including subfolders) link.
  • download all stats (including query stats) for all verified sites in your account by choosing Tools from the My Sites page, then choosing Download data for all sites and then Download statistics for all sites.

Debugging blocked URLs

Vanessa's been posting a lot lately, and I'm starting to feel left out. So here my tidbit of wisdom for you: I've noticed a couple of webmasters confused by "blocked by robots.txt" errors, and I wanted to share the steps I take when debugging robots.txt problems:

A handy checklist for debugging a blocked URL

Let's assume you are looking at crawl errors for your website and notice a URL restricted by robots.txt that you weren't intending to block:
http://www.example.com/amanda.html URL restricted by robots.txt Sep 3, 2006

Check the robots.txt analysis tool
The first thing you should do is go to the robots.txt analysis tool for that site. Make sure you are looking at the correct site for that URL, paying attention that you are looking at the right protocol and subdomain. (Subdomains and protocols may have their own robots.txt file, so https://www.example.com/robots.txt may be different from http://example.com/robots.txt and may be different from http://amanda.example.com/robots.txt.) Paste the blocked URL into the "Test URLs against this robots.txt file" box. If the tool reports that it is blocked, you've found your problem. If the tool reports that it's allowed, we need to investigate further.

At the top of the robots.txt analysis tool, take a look at the HTTP status code. If we are reporting anything other than a 200 (Success) or a 404 (Not found) then we may not be able to reach your robots.txt file, which stops our crawling process. (Note that you can see the last time we downloaded your robots.txt file at the top of this tool. If you make changes to your file, check this date and time to see if your changes were made after our last download.)

Check for changes in your robots.txt file
If these look fine, you may want to check and see if your robots.txt file has changed since the error occurred by checking the date to see when your robots.txt file was last modified. If it was modified after the date given for the error in the crawl errors, it might be that someone has changed the file so that the new version no longer blocks this URL.

Check for redirects of the URL
If you can be certain that this URL isn't blocked, check to see if the URL redirects to another page. When Googlebot fetches a URL, it checks the robots.txt file to make sure it is allowed to access the URL. If the robots.txt file allows access to the URL, but the URL returns a redirect, Googlebot checks the robots.txt file again to see if the destination URL is accessible. If at any point Googlebot is redirected to a blocked URL, it reports that it could not get the content of the original URL because it was blocked by robots.txt.

Sometimes this behavior is easy to spot because a particular URL always redirects to another one. But sometimes this can be tricky to figure out. For instance:
  • Your site may not have a robots.txt file at all (and therefore, allows access to all pages), but a URL on the site may redirect to a different site, which does have a robots.txt file. In this case, you may see URLs blocked by robots.txt for your site (even though you don't have a robots.txt file).
  • Your site may prompt for registration after a certain number of page views. You may have the registration page blocked by a robots.txt file. In this case, the URL itself may not redirect, but if Googlebot triggers the registration prompt when accessing the URL, it will be redirected to the blocked registration page, and the original URL will be listed in the crawl errors page as blocked by robots.txt.

Ask for help
Finally, if you still can't pinpoint the problem, you might want to post on our forum for help. Be sure to include the URL that is blocked in your message. Sometimes its easier for other people to notice oversights you may have missed.

Good luck debugging! And by the way -- unrelated to robots.txt -- make sure that you don't have "noindex" meta tags at the top of your web pages; those also result in Google not showing a web site in our index.

Setting the preferred domain

Based on your input, we've recently made a few changes to the preferred domain feature of webmaster tools. And since you've had some questions about this feature, we'd like to answer them.

The preferred domain feature enables you to tell us if you'd like URLs from your site crawled and indexed using the www version of the domain (http://www.example.com) or the non-www version of the domain (http://example.com). When we initially launched this, we added the non-preferred version to your account when you specified a preference so that you could see any information associated with the non-preferred version. But many of you found that confusing, so we've made the following changes:
  • When you set the preferred domain, we no longer will add the non-preferred version to your account.
  • If you had previously added the non-preferred version to your account, you'll still see it listed there, but you won't be able to add a Sitemap for the non-preferred version.
  • If you have already set the preferred domain and we had added the non-preferred version to your account, we'll be removing that non-preferred version from your account over the next few days.
Note that if you would like to see any information we have about the non-preferred version, you can always add it to your account.

Here are some questions we've had about this preferred domain feature, and our replies.

Once I've set my preferred domain, how long will it take before I see changes?
The time frame depends on many factors (such as how often your site is crawled and how many pages are indexed with the non-preferred version). You should start to see changes in the few weeks after you set your preferred domain.

Is the preferred domain feature a filter or a redirect? Does it simply cause the search results to display on the URLs that are in the version I prefer?
The preferred domain feature is not a filter. When you set a preference, we:
  • Consider all links that point to the site (whether those links use the www version or the non-www version) to be pointing at the version you prefer. This helps us more accurately determine PageRank for your pages.
  • Once we know that both versions of a URL point to the same page, we try to select the preferred version for future crawls.
  • Index pages of your site using the version you prefer. If some pages of your site are indexed using the www version and other pages are indexed using the non-www version, then over time, you should see a shift to the preference you've set.
If I use a 301 redirect on my site to point the www and non-www versions to the same version, do I still need to use this feature?
You don't have to use it, as we can follow the redirects. However, you still can benefit from using this feature in two ways: we can more easily consolidate links to your site and over time, we'll direct our crawl to the preferred version of your pages.

If I use this feature, should I still use a 301 redirect on my site?
You don't need to use it for Googlebot, but you should still use the 301 redirect, if it's available. This will help visitors and other search engines. Of course, make sure that you point to the same URL with the preferred domain feature and the 301 redirect.

You can find more about this in our webmaster help center.

System maintenance

We're currently doing routine system maintenance, and some data may not be available in your webmaster tools account today. We're working as quickly as possible, and all information should be available again by Thursday, 8/24. Thank you for your patience in the meantime.

Update: We're still finishing some things up, so thanks for bearing with us. Note that the preferred domain feature is currently unavailable, but will available as soon as our maintenance is complete.

Back from SES San Jose

Thanks to everyone who stopped by to say hi at the Search Engine Strategies conference in San Jose last week!

I had a great time meeting people and talking about our new webmaster tools. I got to hear a lot of feedback about what webmasters liked, didn't like, and wanted to see in our Webmaster Central site. For those of you who couldn't make it or didn't find me at the conference, please feel free to post your comments and suggestions in our discussion group. I do want to hear about what you don't understand or what you want changed so I can make our webmaster tools as useful as possible.

Some of the highlights from the week:

This year, Danny Sullivan invited some of us from the team to "chat and chew" during a lunch hour panel discussion. Anyone interested in hearing about Google's webmaster tools was welcome to come and many did -- thanks for joining us! I loved showing off our product, answering questions, and getting feedback about what to work on next. Many people had already tried Sitemaps, but hadn't seen the new features like Preferred domain and full crawling errors.

One of the questions I heard more than once at the lunch was about how big a Sitemap can be, and how to use Sitemaps with very large websites. Since Google can handle all of your URLs, the goal of Sitemaps is to tell us about all of them. A Sitemap file can contain up to 50,000 URLs and should be no larger than 10MB when uncompressed. But if you have more URLs than this, simply break them up into several smaller Sitemaps and tell us about them all. You can create a Sitemap Index file, which is just a list of all your Sitemaps, to make managing several Sitemaps a little easier.

While hanging out at the Google booth I got another interesting question: One site owner told me that his site is listed in Google, but its description in the search results wasn't exactly what he wanted. (We were using the description of his site listed in the Open Directory Project.) He asked how to remove this description from Google's search results. Vanessa Fox knew the answer! To specifically prevent Google from using the Open Directory for a page's title and description, use the following meta tag:
<meta name="GOOGLEBOT" content="NOODP">

My favorite panel of the week was definitely Pimp My Site. The whole group was dressed to match the theme as they gave some great advice to webmasters. Dax Herrera, the coolest "pimp" up there (and a fantastic piano player), mentioned that a lot of sites don't explain their product clearly on each page. For instance, when pimping Flutter Fetti, there were many instances when all the site had to do was add the word "confetti" to the product description to make it clear to search engines and to users reaching the page exactly what a Flutter Fetti stick is.

Another site pimped was a Yahoo! Stores web site. Someone from the audience asked if the webmaster could set up a Google Sitemap for their store. As Rob Snell pointed out, it's very simple: Yahoo! Stores will create a Google Sitemap for your website automatically, and even verify your ownership of the site in our webmaster tools.

Finally, if you didn't attend the Google dance, you missed out! There were Googlers dancing, eating, and having a great time with all the conference attendees. Vanessa Fox represented my team at the Meet the Google Engineers hour that we held during the dance, and I heard Matt Cutts even starred in a music video! While demo-ing Webmaster Central over in the labs area, someone asked me about the ability to share site information across multiple accounts. We associate your site verification with your Google Account, and allow multiple accounts to verify ownership of a site independently. Each account has its own verification file or meta tag, and you can remove them at any time and re-verify your site to revoke verification of a user. This means that your marketing person, your techie, and your SEO consultant can each verify the same site with their own Google Account. And if you start managing a site that someone else used to manage, all you have to do is add that site to your account and verify site ownership. You don't need to transfer the account information from the person who previously managed it.

Thanks to everyone who visited and gave us feedback. It was great to meet you!