Data Sheets

Digital Provenance

By / 18th Jul 2018

 

Digital Provenance

Digital Provenance

Just as the provenance of a work of art is used to determine its authenticity, we use a variety of machine learning techniques that look at the behavioural history of the origin IP, as well as its current behaviour to build up a dynamic behavioural driven intelligence system based upon understanding the unique requirements of your enterprise estate.

Eliminate The Fakes

Quickly determining reliably who is a fake actor and who is for real is challenging when it comes to Internet visitors. Malicious bots adopt many of the telltale signs of a ‘real human’ such as mouse trails, but many of the more basic bots use obvious browser emulators that are easier to detect using a simple JavaScript Proof of Work test, which asks the browser to prove it’s a full-stack browser attached to a CPU by performing a simple calculation.

Hash Cash

The hashcash algorithm we use is a modified form of the blockchain mining bitcoin. In our case we are simply using the hashcash to ensure the browser is a full-stack browser, and not a bot based emulator browsers. Although the hashcash processing needed is very minimal and won’t cause a client to have issue processing the calculation, if we are suspicious of a particular series of IPs, we can set a harder calculation which causes the botnet to use up expensive CPU cycles.

Moving Beyond The Fingerprint

Together all these data points are fed into the machine learning engine. We call this this the ‘digital provenance’ to reflect all the historical analysis as the in-depth digital forensic insight needed to verify authenticity. A bot may identify itself as Bing, but is it verified and authenticated as actually Bing, traceable to the Bing DataCentre?

This approach goes far beyond a simple fingerprint. We build up a detailed picture of your visitors, based on origin authentication, IP reputation and history, user journey, and the behavioural activity of the visitors themselves.

Once the machine learning understands your web estates critical paths and your own risk criteria, which can be set using a simple visual tool, the machine learning can start to understand the visitor flow.

Historical Analysis

Bots rotate IP addresses very quickly and keeping up with the latest attacks, botnets, and hijacking of legitimate PCs by rogue software is very tough. As the data on each IP ages, false positives can easily be introduced.

A hijacked PC using malware, can be cleaned, but then still be flagged up a rogue PC. We examine the history of each IP and see if it has been flagged up multiple times. Each IP is periodically ‘released’ based upon it’s prior score, so we can verify and eliminate persistent offenders, while taking reasonable steps to ensure false positives are reduced over time.

Advanced Captcha

Google captcha is easily bypassed by advanced bots, who use AI techniques linking a video camera to the screen, or even using human bot solvers to bypass it manually. We offer advanced captcha techniques that look as the digital fingerprint to search for the telltale signs of human v. bot behaviour.

Using an Advanced Captcha greatly eliminates the known bypass methods, and allows us to combine the Captcha Fingerprinting data into our machine learning algorithm to become even more precise.

As the bots try and evade Captcha they in turn leave tell signs that we can easily identify and pick up as sure indicators of scripted bots.

Static v. Dynamic Lists

Most of the publicly available IP reputation lists are static and quickly become out of date. If they cannot keep up with every changing stage of the dynamic bot threat, the publicly available lists often help to create much more false positive data.

Historical IP Analysis

Instead of using static lists, IP addresses are assigned a threat score by using historical data. If an IP address is repeatedly flagged it effectively receives a longer reputation ‘sentence’ effectively flagging it as a bad actor for longer. Periodically each IP is reassessed to ensure that we don’t create too many false positives in the data

Browser Fingerprinting

Browser fingerprinting allows us to inspect which device and browser a visitor is using, whether they are using standard configurations and how they interact with the site. This analysis is valuable in separating automated, non-standard browsing behaviour from legitimate users.

Source Verification

Some Bots deliberately disguise themselves as known good actors. For example, we’ve found lots of bots pretending to be Google, that aren’t. They rely on the reluctance of webmasters to block anything Google related, but the provenance forensics shows they are just masquerading as Google bots.