Web scraping (or web harvesting or screen scraping) is the process of automatically extracting data from an online service website. This data can be stored in a structured format for further use. A web scraper executes with the help of web crawling programs that mimic browsers to access and communicate with different websites, follow their hypertext structure, and extract data according to predefined parameters. Data scraped usually gets stored in local databases.
How web scrapers work
Since most websites are dynamically driven nowadays with dynamically-generated content, it’s almost impossible for humans to automate the extraction process themselves so they have to use bots instead. Bots access a website and follow its hypertext structure which is combined of HTML pages. When they encounter a new page, they read its content and extract specific information as defined by the user. For this, bots need programming tools that enable them to emulate browsers’ behavior as well as adhere to standard protocols such as HTTP or HTTPS.
In order to control scrapers, website owners sometimes include specific coding language in their websites that prevents access from unauthorized bots or those that attempt to scrape their data. Bots can also be prevented from accessing a website by restricting users’ IP addresses so scraped data doesn’t come from a single source but many – making it almost impossible for bots to distinguish between human visitors and scrapers.
Why people scrape websites
There are many reasons why somebody would want to scrape information from another website. Many companies want to know what their competitors are up to and use web scraping as a means to gather this type of information. Others want to automate the process of gathering information from websites that have a public API or support screen scraping. Some just want to build bots that work 24/7 without any manual action from them.
Additionally, some people scrape data from a website to later sell the scraped information to other companies for marketing purposes.
Web scraping is not all bad
While there are many cases where web scrapers can be used for evil, they also offer great opportunities. For example, they can be used to create data-driven tools, like price comparison websites. In the latter case, scraping is an integral part of the development process and it enables companies to deliver great services at scale.
Problems that might arise from web scraping
There are several issues that may come up from site owners when they discover they have been scraped:
- Data confidentiality – When a website stores information in a structured format, web scrapers most likely will not display this data in a readable format. This makes it difficult for site owners to see what kind of information has been extracted from their website. For example, passwords or credit card numbers can’t be distinguished by just looking at what has been scraped. It’s also possible that sensitive private information is accidentally made public.
- Data integrity and data quality – Sites that provide public APIs might get overloaded by web scraping requests since an automated bot can send thousands of requests in a very short time, while a human would be limited to just a few. If too many people are scraping the same website concurrently, there’s also the chance that some of them are scraping outdated information since sites don’t always have resources to fight bots 24/7.
- Legal issues – Web scrapers can be considered as an automated hacking tool in some cases. Especially when scraping is done without the site owner’s consent, it’s likely that bots are not allowed to access certain resources or do specific things (e.g., post comments on someone else’s blog).
Types of web scrapers
There are several types of web scrapers based on how they work:
- Classical or traditional web scraping is when a scraper works through a browser, following the hypertext structure of pulled data from the same site that it’s working from. If a website has an API, some services can be used to scrape its content without using a browser at all.
- Parallelized web scraping is a way to speed up processing by distributing tasks among different machines and network nodes. A scheduler distributes tasks across different machines and processes them in parallel while collecting the results in one place. This approach enables companies to reach very high levels of throughput when scraping websites with public APIs (e.g., Facebook).
- Real-time web scraping doesn’t use schedulers. Instead, it parses website HTML in real-time to extract information while the pages are being served to site visitors. This approach doesn’t work for sites with dynamic content since it can only scrape what has been retrieved by web browsers at the moment when parsing is done.
- Crawling and spidering is a slightly different kind of scraping where bots crawl (follow links) and index (store data in search engines like Google or Bing) certain websites rather than monotonously pulling data from them. For crawlers, scraped data is usually saved in databases for later retrieval and analysis.
Common web scrapers
There are many web scrapers available online, but only a few of them are popular. Here are some examples:
- Scrapy – An open-source web scraping framework written in Python. Scrapy is one of the most widely used web scrapers and it can be combined with several other Python libraries for specific purposes.
- Beautiful Soup – A library that enables you to scrape information from HTML or XML documents using Python scripting language or QT WebKit browser engine. It has built-in support for navigation, searching, selecting, extracting information, handling broken pages, logging errors, etc.
- MechanicalSoup – Another open-source Python library that reuses Beautiful Soup’s parsing techniques to make screen scraping easy. MechanicalSoup also supports different Python versions and comes with a number of extra features such as remote control, logging, etc.
- Selenium – A browser automation tool that lets you scrape data from almost any site since all major browsers are supported (Firefox, Chrome, Safari). Selenium can be used in conjunction with several other tools and supports Java and Python programming languages. It works by sending commands to the specified browser and detecting its actions within an automated session.
- WebScarab – An open-source tool written in Java for sniffing HTTP requests and responses on a network. WebScrab gets information by using web proxies, which makes it possible to access websites behind firewalls or other security systems designed to protect sensitive resources from public access. Also known as an intercepting proxy server.
- MobSF – An open-source web security testing framework with an emphasis on mobile platforms, but also supporting all common web browsers. MobSF is written in Java, making it easy to use for developers of both desktop and mobile applications.
How to prevent web scraping
There are several possible approaches for how organizations deal with web scraping:
- Rate limit web crawling requests – This is the most commonly used approach to combat scraping. It involves placing limits on the number of requests that can be sent per time period (e.g., per minute or hour). If a user exceeds these limits, they are either blocked by the server sending an HTTP response code 429 Too Many Requests or given limited access.
- Hide content behind CAPTCHAs – Some websites try to prevent bots from accessing certain resources by requiring users to answer a CAPTCHA (“Completely Automated Public Turing test to tell Computers and Humans Apart”). The idea behind this technique is that humans pass this test easily, while bots fail since they aren’t able to recognize images or read distorted text. However, not all bots are affected by CAPTCHAs, especially those that use custom engines or employ a large team of developers.
- Block web crawling requests using the robots.txt file – Some websites allow access to certain kinds of data only to specific user agents (e.g., search engines). Using this technique, site owners can specify which user agents have permission to access their site and block web scrapers from accessing this information. In turn, scrapers honor robots.txt as well and will not scrape content that is disallowed in it. However, since most bot creators want to be able to scrape everything they want without any restrictions, they avoid following rules set in robots.txt files.
- Blacklist web scraping services – When a website or a web scraping service gets hacked, sensitive information can be exposed. To protect their users from malware or other threats that might arise after such incidents, webmasters can blacklist certain services and won’t allow them to visit any of the pages on their site. However, there are various ways for getting around this restriction.
- Block an entire range of IP addresses – Sometimes websites blacklist an entire range of addresses used by bots or scrapers rather than blocking individual IPs one by one. This approach is very effective in preventing web scraping but also results in several problems since many legitimate user agents are blocked as well which eventually leads to loss of visitors and revenue.
The future of web scraping
Web scraping is a growing issue in the tech industry. It continues to evolve, and its impacts are felt by more people every day. It poses a threat to companies that deal with sensitive information, such as banks, credit card companies, and government organizations
Frequently Asked Questions about web scraping
How common is web scraping?
Web scraping is used by search engines, academic researchers, news aggregation sites, marketing firms, and many others for a wide variety of reasons.
How do web scrapers differ from analytics software?
There’s some kind of confusion that exists about the difference between web scrapers and analytic software. And some web scraping services like to brand themselves as analytics tools. However, the question of what distinguishes web scrapers from other kinds of software is easy to answer: they are readily distinguished by their use of poorly-disclosed or hidden user agents that lead to bulk access and automated processing of online resources without first requesting users’ consent or providing them with a way to opt-out.
How do web scrapers work?
Web scrapers run automated tasks over the Internet in order to collect specific types of information. They use various methods and tools (e.g., web crawling and screen scraping) in order to carry out these tasks.
How can I protect my website from web scrapers?
Webmasters can protect their sites using a number of techniques such as using robots exclusion standard (robots.txt), limiting access by IP address or range, implementing captchas, blocking bots by the user agent, blacklisting known service providers that gather data with bots etc.
What data can web scrapers access?
Web scrapers can access any information available online. They can search for and collect data from web pages, images, PDFs, videos, news feeds etc. But they can’t scrape encrypted or password-protected documents such as those that require HTTP authentication prior to allowing the scraping process.
Are there legal restrictions on web scraping?
Web scraping is mostly unregulated at this time. However, some countries have specific legislation regarding the use of automated tools for collecting user information online (e.g., GDPR in Europe). Also, many websites prohibit or limit content gathering by bots in their terms of service agreements which are legally binding contracts between site owners and users.
What data can’t web scrapers access?
Since web scraping is a data mining technique, it only works on data that’s available online. E-commerce sites with payment pages that require authentication before the user can place an order are protected from data mining since they don’t publicly share their data. Also, password-protected documents and encrypted data cannot be accessed by web scrapers. However, some services claim to have techniques for bypassing encryption and authentication so this may no longer be the case in the future.
Can I go after a website or a service if they scrape my data without my consent?
That depends on your location and what data was collected. For example, GDPR legislation protects personal data but HIPAA regulations protect health data. You can also try to take legal action against data-mining companies that scrape data but you need to prove damages, loss of revenue or data theft for this to work.
What data is most valuable?
Any data that is available online and publicly shared can potentially be scraped by web crawlers. But some types of data are more valuable than others since they contain information about individuals, their preferences and habits. These include personal data such as names, dates of birth, social security numbers etc., home addresses, credit card details etc. because these enable a complete profile to be built on a person using data from various sources. Web scrapers use people’s information in order to create profiles which can be sold at a profit to other data-mining companies.
How many websites do get scraped?
The data is not readily available but it’s estimated that web-based data has been scraped from billions of websites. Some data aggregators have been known to scrape data from tens of thousands of sites simultaneously and store it in their centralized databases.