Crawling, Indexing and Deindexing
Crawling
Crawling a site means following a path. A site crawler (aka spider) following your links and crawling around every inch of your website
- Crawlers can validate HTML code or hyperlinks. They can also go extract data from certain websites, which is called web scraping.
- When Google’s bots come to your website to crawl around, they follow other linked pages that are also on your site.
- The bots then use this information to provide up-to-date data to searchers about your pages. They also use it to create ranking algorithms.
- This is one of the reasons why sitemaps are so important. Sitemaps contain all of the links on your site so that Google’s bots can easily take a deeper look at your pages.
- Googlebot (Google’s search engine bot) has a “crawl budget.” It is made up of 2 parts ⇒ crawl rate limit and crawl demand
Indexing
Indexing, refers to the process of adding certain web pages into the index of all pages that are searchable on Google.
If a web page is indexed, Google will be able to crawl and index that page. Once you deindex a page, Google will no longer be able to index it.
eg : By default, every WordPress post and page is indexed.
It’s good to have relevant pages indexed because the exposure on Google can help you earn more clicks and bring in more traffic, which translates into more money and brand exposure.
But, if you let parts of your blog or website that aren’t vital be indexed, you could be doing more harm than good.
Deindexing
- There are many different occasions where you may need (or want) to exclude a web page (or at least a portion of it) from search engine indexing and crawling.
Pages you wanna deindex
DUPLICATE PAGES : Basically pages containing duplicate content
- Duplicate content refers to there being more than one version of one of your web pages. For example, one might be a printer-friendly version while the other is not.
- THANK YOU PAGE : the page that visitors land on after taking a desired action such as downloading your software.
Here is a set of guidelines to get deindexed really effectively.