Google crawling: how to optimize the crawl budget of your website
On Google’s official webmaster blog, Google’s Gary Illyes wrote about crawl budgets and how they affect your website. Prioritizing the pages that should be indexed can help you to get high rankings for your more important pages. Two factors influence the crawl budget of a website:
1. The crawl rate limit
Crawling is the main priority of Google’s web crawler. The crawl-rate limit represents the number of simultaneous parallel connections Googlebot may use to crawl the site, as well as the time it has to wait between the fetches.
The crawl rate is influenced by how quickly a website responds to requests. You can also limit indexing in Google’s search console. Unfortunately, Google does not support the crawl-delay directive for robots.txt that is supported by many other bots.
2. Crawl demand
The crawl demand represents Google’s interest in a website. URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in Google’s index. Google also attempts to prevent URLs from becoming stale in the index.
If a website moves to a new address, the crawl demand might increase in order to reindex the content under the new URLs.
The crawl rate limit and the crawl demand define the crawl budget as the number of URLs Googlebot can and wants to crawl.
How to optimize your crawl budget
Having many low-value-add URLs can negatively affect a site’s crawling and indexing. Here are some low-value-add URLs that should be excluded from crawling:
1. Pages with session ID’s: If the same page can be accessed with multiple session ID’s, use the rel=canonical attribute on these pages to show Google the preferred version of the page. The same applies to all other duplicate content pages on your site, for example print versions of web pages. The duplicates will be ignored then.
2. Faceted navigation (filtering by color, size, etc.): Filtering pages by color, size and other criteria can also lead to a lot of duplicate content. Use the robots.txt file of your site to make sure that these duplicates aren’t indexed.
3. Soft 404 pages: Soft 404 pages are error pages that show a “this page was not found” error message with the wrong HTTP status code “200 OK”. These error pages should use the HTTP status code “404 not found”.
4. Infinite spaces: For example, if your website has a calender with a “next month” link, Google could follow these “next month” links forever. If your website contains automatically created pages that do not really contain new content, add the rel=nofollow attribute to these links.
5. Low quality and spam content: Check if there are pages on your website that aren’t that good. If your website has very many pages, removing these pages can result in better rankings.
If you do not block these page types, you will waste server resources on unimportant pages that do not have value. Excluding these pages will make sure that Google indexes the important pages of your site.
What does this mean for your web page rankings on Google?
It’s likely that you do not have to worry about crawl budgets. If Google indexes your pages on the same day they are published (or a day later) then you do not have to do anything.
Google crawls websites with a few thousand websites efficiently. If you have a very big site with tens of thousands of websites it is more important to prioritize what to crawl, and how much resources the server hosting the site can allocate to crawling.
Crawling is not a ranking factor. There are many factors that are used by Google’s ranking algorithms. The crawling rate is not one of them.