Blog, Events & News

BACK TO ALL

Data science and Machine Learning: How to drive innovation

By Richard Jones / 23rd Oct 2018

Machine Learning and Agile - How we created a culture for innovation

At Netacea we’ve been on a very rapid journey from the first initial proof of concept using just one data scientist, to architecting and bringing into production a sophisticated machine learning product hardened for our enterprise customers.

Along the way, we’ve inevitably made lots of mistakes and sometimes even managed to get a few things right. Our biggest challenge was how to let the data science innovate - while ensuring we remained highly focused on meeting the core customer requirements.

As we integrated the data science team in our existing Agile squads and tried to sync up the two, we gradually all came to realise your typical, well-worn Agile playbook just wasn’t working.

The core challenge - how data science can work as a team in a traditional software company - which is all geared up to be a finely-honed Agile machine, relentlessly making mincemeat of requirements and features, was yet to be discovered.

So what was going wrong?

When we turned the microscope onto our process, it turned out nearly everything was wrong with the way we’d tried to integrate the data science team into our existing Agile squads.

The core reason was we had tried to apply a typical Agile process without taking into consideration the unique considerations of our data science mission. One of the worst mistakes you can make is taking someone else’s process and trying to apply it to your own business. This is particularly true in a start-up team that is growing and creating its own cultural identity.

The tyrannical side of Agile - the relentless focus on feature and requirements as well the daily ceremonies, resulted in a slowdown of actual development and a dearth of innovation for the data science team.

The team was spending considerable time with Product Owners and other internal stakeholders who don’t have a background in machine learning, trying and failing to understand the decision gates, or simply time boxing work whose scope couldn’t be predicted.

It became very difficult to correctly apportion and cost justify time, and we spent too much time bogged down in planning and therefore not creating. None of the traditional burndown metrics helped us. Sometimes they went the wrong way.

To sync the data science team with the production squads we performed the usual retro’s planning and stand-ups, but as the team grew these meetings began to take up way too much time and resource, and it became apparent we were sucking the lifeblood out of our data science team.

Developing machine learning models is complex and unpredictable. If you knew what the result of the model was going to be before you let the machine learning work, the chances are that you are not doing the data science correctly in the first place.

Given it’s impossible to know beforehand the results of the data science and the underlying core requirement driving that - by the time you’ve developed and written it all up - the entire process cycle proved to be comprehensively broken.

After much careful thought, for Netacea the fundamental insight was to allow the team the time and flexibility to drive innovation through the data discovery.

Enabling innovation

We were already driving outcome-based requirements through the Jobs To Be Done (JTBD) framework which gave us a respectable high-level set of core customer requirements which allowed us to build a customer-focused framework to allow the innovation to happen. Once the data science team really understood the core customer “why” they could bear large helpings of ‘how’.

We took the decision to turn the data science team into its own collegiate based self-organising squad and called the team “Discovery’. The team included its own Machine Learning engineers who are supporting the transition from the pure data science members, to help scale and push the algorithms into actual production.

Letting the data scientists with the PhDs and the scientific know-how build those decision gates, transformed our process and we began to see not only good results, but to start innovating rapidly.

It turns out that if data scientists can share and socialise their ideas and POC results with their peers, along with a matrix report into the very high-level core desired outcomes, they almost immediately started to generated better experiments, and we began to see a marked improvement in production candidates.

The core jobs to be done framework helped frame the outcomes around constantly interactive hypotheses testing in a collegiate type working culture, which valued open debate around a known set of outcomes, driven by the actual data discovery.

Reporting into fellow experts and sharing results through peer review meant we had true experts moving us quickly through the decision gates. We use peer review at every stage - all analysis, all code, all decision gates are peer reviewed.

Fostering Collaboration Amongst The Data Scientists

One of the core benefits for the data science team was that they became a much happier team. The collegiate framework proved to be a very rich and fertile one. Too often data scientists operate in a vacuum, with no-one to bounce new ideas and models. Too often they report and must cost justify their work to people who don’t have a background in the data science, which often leads to a broken process.

Instead of introducing yet another layer of process and slowing us down, the collegiate system of hypotheses testing and peer review produced true debate, true integration and finally true innovation as we let the machine learning drive the process of data discovery, and interactively fed back the improved results into our next sprint. We were able to rapidly prototype multiple contenders for a product candidate and quickly verify which of the ideas was going to be the best fit to harden into production.

The data science team is still very much using many-core Agile processes. We have daily stand-ups and use a simple Kanban board. We develop in thin slices, iteratively. We have retros. We take the feedback from the interactive slice and re-apply the knowledge back into our development cycle. We don’t estimate and time box, but we do report progress through the standups and make decisions based on peer review and the results of the actual data.

One of the keys to driving the innovation was to increase the number of ideas we could test and rapidly move down the decision gates from the initial idea to the final production candidate.

It’s too easy for data science teams to fall in love with their own ideas, and it’s particularly dangerous with machine learning not to be hypotheses driven and frame each mini POC with strict outcome criteria.

It’s also completely unhelpful to consider too many ideas all at once.

The collegiate team structure really allows individuals to develop their ideas and skills through exposure to many different aspects of the machine learning work. Our flat structure is one of the best features of our process and creates an environment where everyone’s ideas are equal as well as equally scrutinised. We can rapidly progress through to focus on just the best two or three ideas.

It turns out the Goldilocks number is usually to examine three ideas initially. One is definitely too few, four is definitely too many. Again, this initial decision gate may be very quick, as we have all the relevant data science, modelling, visualisation and machine learning engineers in the team to assess the relevant machine learning and architectural concerns and move it through the gates.

Machine Learning Data Science

In the diagram, we are showing a logical series of gates progressing smoothly from ideation to production, in a smooth cognitive ladder. In fact, the machine learning process is often shaped like a bush - we have clusters of data and innovation - that don’t flow smoothly in a linear progression but do give us the insight we need to feed into the next iteration.

Setting the framework for this innovation is tough.

Without all the building blocks, real-time reliable streaming data, relentless focus on achieving clean data, and a complete focus on early thin-slice experimentation as a series of mini- POCs, you don’t have any ability to perform the data science, let alone create the innovation framework needed to ensure you’re solving the right problems.

We’re still improving our process over time as our product priorities change. Without clearly thinking through your own process, and understanding the unique composition of your team, you’re highly unlikely to succeed by simply applying someone else’s playbook.

Machine Learning / Data Science