Would you rate-control Googlebot? How much crawling is too much crawling?

lzhao

One of our sites is very large - over 500M pages. Google has indexed 1/8th of the site - and they tend to crawl between 800k and 1M pages per day.

A few times a year, Google will significantly increase their crawl rate - overnight hitting 2M pages per day or more. This creates big problems for us, because at 1M pages per day Google is consuming 70% of our API capacity, and the API overall is at 90% capacity. At 2M pages per day, 20% of our page requests are 500 errors.

I've lobbied for an investment / overhaul of the API configuration to allow for more Google bandwidth without compromising user experience. My tech team counters that it's a wasted investment - as Google will crawl to our capacity whatever that capacity is.

Questions to Enterprise SEOs:

*Is there any validity to the tech team's claim? I thought Google's crawl rate was based on a combination of PageRank and the frequency of page updates. This indicates there is some upper limit - which we perhaps haven't reached - but which would stabilize once reached.

*We've asked Google to rate-limit our crawl rate in the past. Is that harmful? I've always looked at a robust crawl rate as a good problem to have.

Is 1.5M Googlebot API calls a day desirable, or something any reasonable Enterprise SEO would seek to throttle back?

*What about setting a longer refresh rate in the sitemaps? Would that reduce the daily crawl demand? We could set increase it to a month, but at 500M pages Google could still have a ball at the 2M pages/day rate.

Thanks

CraigBradford

I agree with Matt that there can probably be a reduction of pages, but that aside, how much of an issue this is comes down to what pages aren't being indexed. It's hard to advise without the site, are you able to share the domain? If the site has been around for a long time, that seems a low level of indexation. Is this a site where the age of the content matters? For example Craigslist?

Craig

lzhao

Thanks for your response. I get where you're going with that. (Ecomm store gone bad.) It's not actually an Ecomm FWIW. And I do restrict parameters - the list is about a page and a half long. It's a legitimately large site.

You're correct - I don't want Google to crawl the full 500M. But I do want them to crawl 100M. At the current crawl rate we limit them to, it's going to take Google more than 3 months to get to each page a single time. I'd actually like to let them crawl 3M pages a day. Is that an insane amount of Googlebot bandwidth? Does anyone else have a similar situation?

MattAntonino

Gosh, that's a HUGE site. Are you having Google crawl parameter pages with that? If so, that's a bigger issue.

I can't imagine the crawl issues with 500M pages. A site:amazon.com search only returns 200M. Ebay.com returns 800M so your site is somewhere in between these two? (I understand both probably have a lot more - but not returning as indexed.)

You always WANT a full site crawl - but your techs do have a point. Unless there's an absolutely necessary reason to have 500M indexed pages, I'd also seek to cut that to what you want indexed. That sounds like a nightmare ecommerce store gone bad.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Would you rate-control Googlebot? How much crawling is too much crawling?

Browse Questions

Explore more categories

Related Questions

What happens to crawled URLs subsequently blocked by robots.txt?

XML sitemap generator only crawling 20% of my site

What IP Address does Googlebot use to read your site when coming from an external backlink?

Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?

ScreamingFrog won't crawl my site.

Prevent Google from crawling Ajax

How much does dirty html/css etc impact SEO?

Why is my Crawl Report Showing Thousands of Pages that Do Not Exist?

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved