Web crawling is the process of fetching documents or resources identified by hyperlinks and recursively retrieving all referenced web pages.
Web crawlers are used for search engine indexing purposes, but can be harmful if they target your website as they will often try to extract sensitive information like credit card numbers or passwords. Malicious web crawlers can be filtered out using bot management systems.
For in-depth analysis, web crawlers need to be programmed using languages like C++, Java etc. However, for quickly looking into websites like e-commerce stores/catalogues or product reviews they can also be scripted using high-level programming languages like Python.
Types of Web Crawlers
To make a list of web crawlers, you need to know the 3 main types of web crawlers:
- In-house web crawlers
- Commercial web crawlers
- Open-source web crawlers
In-house web crawlers are developed in-house by a company to crawl its own website for different purposes like – generating sitemaps, crawling the entire website for broken links etc.
Commercial web crawlers are those which are commercially available and can be purchased from companies who develop such software. Some large companies might have their custom-built spiders as well for crawling websites.
Open-source crawlers are those that are open-sourced or under some free/open license so that anybody can use them and modify them as per their needs. Though these often lack advanced features and functionalities of commercial counterparts they do provide an opportunity to look into source code and understand how these things work!
List of Common Web Crawlers
In-House Web Crawlers
- Applebot – crawls Apple’s website for updates, etc.
- Googlebot – crawls Google websites (like Youtube) for indexing content for Google search engine
- Baiduspider – crawls websites from Baidu.com
Commercial Web Crawlers
- Swiftbot – a web crawler for monitoring changes to web pages
- SortSite – a web crawler for testing, monitoring and auditing websites
Open-Source Web Crawlers
- Apache Nutch – a highly extensible and scalable open-source web crawler that can also be used to create a search engine
- Open Search Server – a Java web crawler that can be used to create a search engine or for indexing web content