Content theft is the use of bots and other automated processes to gather content from your site to use elsewhere without your consent.
This can be content you have created yourself (such as journalism or thought leadership) or content extracted from paid data sources (such as sports statistics or product information)
Content theft is part of the family of bot activity that we classify as data harvesting or can more generically be called scrapers. This category encompasses some of the oldest forms of bot activity and actually underpins how most people interact with the internet today - we refer, of course, to search engine spiders.
Search engine spiders illustrate the challenge in identifying content theft - the bot activity is not all bad, in fact some is vital for business success. As well as search engine spiders, other bot activity that could be good includes content aggregation for display on aggregation sites (these can be good or bad depending on they display content and whether they link back to the source appropriately) or content scraping by affiliates to be used to help them market your products/services.
The complexity of the range of web scrapers hitting every website means that we have to look at more than just the behaviour that indicates that the visitor is undertaking scraping activity (e.g. number or frequency of requests) or how they identify themselves (e.g. as googlebot).
Netacea Intelligence uses advanced machine learning techniques to detect scrapers and to categorise them based on the sort of scraping activity they are undertaking, based on the information they are collecting and the pattern of that collection. This is then combined with data from a wide range of industry sources to add an additional layer of insight to the activity categorisation.
Netacea has successfully managed to identify a range of categories of scrapers, both good and bad, that can be used as the basis of bot management policies. The custom whitelisting available also makes sure that no known affiliates will be stopped.