Blog, Events & News
Using The Skip-Gram Model to Understand Web Traffic
By Matt Jackson & John Modin / 08th Jul 2018
How we use word embedding techniques to give domain owners insight and control over their bots.
At Netacea we wish to give domain owners visibility and control over all their visitors to their web estate, so they can provide assisted learning to help our algorithms automatically manage visitors according to their business priorities.
Key to this is to identify different kinds of bots based on their behaviour. We achieve this through the application of machine learning on the traffic data.
We rely on machine learning models to classify different types of bots; successful training of these models relies on well-structured data representations, known as features.
We can use all kinds of hand engineered features to train our models, such as the time between, or the number of requests. Different sets of features will be good at classifying certain bots, and hopefully generalise well across different domains. However, hand engineering features is extremely time-consuming and when we wish to classify a new type of bot, chances are we need to find some new clever features. It would be much better if we could automate this feature engineering and let the model do the work so we can have time to try out new board games!
When analysing data logs with the naked eye, we often scan through the actual webpage requests and the sequence they are in; during this exercise we can clearly distinguish behaviours types, it is their characteristics we are trying to capture by devising original features. If we could automate this process with machine learning, it would massively help generalise our approach and reduce manual intervention. Who knows, a trained model might even be better than us at spotting the most suitable features. It may even work 24/7 without pay.
To pass requests directly to the model we need to represent them numerically somehow, this presents a new challenge. For a given website, the number of unique paths requested is very large and most paths are only requested a few times.
Aside: Word Embeddings
Natural language processing (NLP), is an exciting area of machine learning where models are trained to understand language. Common tasks in NLP include language translation, text classification and sentiment analysis. Each of these problems shares a common challenge; how can words and sentences be represented numerically?
The simplest representation would be to use one-hot-encoding. Here each word is represented by a vector with length equal to the length of the vocabulary. In the vector all indices are set equal to zero except the index corresponding to the word in the vocabulary i.e. the nth word in the vocabulary has only zeros except for in the nth place where there is a one. See the example below where we have a vocabulary of three words.
This works well in the case when we are trying to represent three words, but when the vocabulary is much larger, such as the English language, the result of one-hot-encoding is a huge, sparse matrix. This leads to many problems and should be avoided… like the plague.
Another issue with one-hot encoding is that it does not capture anything about the meaning of the words. For instance, in our vocabulary, orangutan is equally far from pear as pear is from rambutan which is clearly ridiculous.
When representing sentences using these one-hot vectors, it is typical to just sum up the vector counts into what is known as a 'bag of words'. In this approach the context of words is lost; consider the phrases “the orangutan ate the rambutan” and “rambutan ate the orangutan” would have the same representation. This is not great news; the first is perfectly in order whereas the latter is bizarre.
Ideally information about the vocabulary should be embedded into a smaller, dense matrix, where words with similar meaning have similar representations. There are several ways to do this, one such method is the skip-gram model. The skip-gram model trains a neural network to predict the probability of a word within a window of a sentence, the weights of this model are extracted as the word embeddings. These embeddings are a represented as a vector of numbers, where the number of dimensions is much smaller than the size of vocab, 300 dimensions for the English language would be typical. In this approach the vocabulary is represented by a dense matrix with size equal to the size of the vocabulary times the number of dimensions; this solves the sparsity problem described above. Also, words with similar meaning end up with similar representations.
Our hypothesis was that the paths requested in a session were analogous to the words in a sentence, and that the sessions were analogous to the sentence. Just like in the sentence case, the sequence of paths requested is important. To test this, we extracted one day of data from a leading e-commerce site and trained a skip-gram model on the data.
We used genism to train the model. Our goal here was to prove out our hypothesis, therefore we have not agonised over optimising parameters. For now, we have taken some sensible choices based on our data, all other parameters are set to their default setting.
- Limit our vocabulary to paths which have 10 occurrences in a day; see below this keeps around 20% of our ‘vocabulary’ (roughly 20000 different paths)
- Context window of 10 was chosen based on the average session length
- 300 dimensions was selected through some trial an error
We were quite excited to see the results of this approach before any tuning or optimization.
We visualized the embeddings using the t-sne algorithm which attempts to represent high dimensional data in a 2 or 3-dimensional space. This can be seen below.
We clearly see separate clusters, and inspection reveals that they are surprisingly intuitive, for example
- Paths relating to similar products had similar embeddings
- Help and support pages were given similar contexts, those relating to similar products were more closely related
- Paths relating to blog content were found to be closely related
The idea of embedding a website based on regular user behaviour could be an exciting development towards identifying bots. We have undertaken a proof of concept which indicates there may be some truth in our hypothesis but many challenges remain.
The next step in our journey will be to use the pre-trained embeddings as input to some of our models and understand how we can leverage the embedded information.