Blog, Events & News
The Netacea Data Science Team Discusses Structured Streaming
By Matt Jackson / 10th Jul 2018
The Data Science team here at Netacea is always looking at the latest technologies to help us in our quest for real-time bot detection. The task is a difficult one; process web log data for multiple customers, monitor behaviour and make predictions using trained machine learning models all in real time.
Data Science Tech Stack
The Data Science team owns a small yet important element of the technology stack powering Netacea. Our inline stack consumes data from Kafka, processes the stream with Spark on EMR and outputs results though Kafka and Elastic Search.
A key principle in our development cycle is to try to keep our offline analysis as close to the real world application as possible. Wherever possible utilise the same shared functionality in our offline analysis which we generally undertake in Zeppelin notebooks on EMR.
Spark is central to our data pipeline, doing most of the heavy lifting. Spark allows us to process huge amounts of data both offline, when we are analysing data and building machine learning models, and inline when we are using our knowledge and trained models to make predictions.
Since Spark 2.0 the Structured Streaming API has brought the easy to use, familiar concepts of data frames to the world of streaming. Already being huge fans of the Spark SQL API for offline analysis and batch jobs we looked to bring our framework up to speed and try out Structured Streaming.
Overview of Structured Streaming
There are many blog posts outlining Spark Structured Streaming and its benefits, along with some excellent documentation. Therefore, I won’t bore you with another, instead, I thought it would be interesting to share a few lessons learned. If you are interested in reading more the following links offer a good place to start.
We are constantly trying to balance offline research and development with continuous improvements to our real-time scripts. One unexpected benefit of Structured Streaming has been the streamlining of this journey. Since both the Spark Structured Streaming and Spark SQL APIs work with data frames our offline analysis is much closer to our production code allowing us to iterate much quicker.
Take the following example of reading some data from a JSON file, making some transformations and then outputting to a parquet.
Spark SQL API
input = spark.read.format(“json”).load(“path”) output= input.transformations… output.write.format(“parquet”).save(“s3://my-bucket/spark_sql_example”)
Spark Structured Streaming API
input = spark.readStream.format(“json”).load(“path”) result = input.transformations… output.writeStream.format(“parquet”).start(“s3://my-bucket/structured_streaming_example”)
Notice how few differences there are in syntax. This similarity makes our development journey from idea to production much simpler.
Our inline Spark jobs consume data flowing through Kafka, for each customer we consume a topic containing processed web log data ready for analysis. Previously we had had difficulty with the old DStreams API when wanting to treat these topics separately. The solution was to initialise multiple DStreams, one for each topic; this solution didn’t scale very well.
Thankfully Structured Streaming offers a much simpler solution; when using Kafka as an input the topic is given as a column in the data frame meaning it is as simple as grouping the data frame by topic to keep customer data separate.
Most of our work revolves around training, testing and evaluating machine learning models. Structured Streaming has made it even easier to integrate trained models into Spark. MLlib now makes use of data frames keeping things familiar across the board, it is also consistent across ML algorithms and multiple languages.
Another great feature of MLlib is ML Pipelines which help us streamline our model building process and version control. This is extremely important for us as we work in an ever-changing environment.
The biggest problem we encountered during the migration was the lack of support for multiple aggregations. For example, say we wanted to compare one sessions features to a global average within a given time window this would require multiple aggregations to achieve.
To solve this problem we decided to use Kafka as an output sink to which we push intermediate aggregations. Our inline script then picks up these streams and continues as before. This solution works nicely and has lead us to think about new ways we could store longer-term behaviour through different windowed aggregations – a promising direction to investigate.
Porting our code from DStreams to Structured Streaming was a great learning experience and has resulted in some great technical gains. We now benefit from all the latest updates to Spark Streaming, making our code cleaner and more efficient. Unexpectedly we have also greatly improved our workflow allowing us to iterate through ideas and improvements much more effectively.