When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking fixed format files, training models and printing the result. Naturally, businesses tend to think that hiring good data scientists should get the job done. What they often fail to appreciate is that it is also a good old system and data engineering problem, with the data models and algorithms sitting at the core.
About a few years ago, at an organisation I was working in, business deliberated on using machine learning models to enhance user engagement. The use cases that were initially planned revolved around content recommendations. However later, as we worked more in the field we started using it for more diverse problems like, Topic Classification, Keyword Extraction, Newsletter Content Selection etc.
I would use our experience in designing and incorporating machine learnt models in production to illustrate the engineering and human aspects in building a data science application and team.
In big data analysis, training models were the crux of our data science application. But to make things work in production, many missing pieces of the puzzle were also required to be solved.
- Getting data into the system on a regular basis from multiple sources.
- Cleaning and transformation in more than one structures for use.
- Training and retraining models, saving and reusing as required.
- How to apply incremental changes.
- Exposing model outputs for consumption through API’s.
Scaling consumption API’s was also a concern for us. In our existing system, Content was mostly static and served from CDN cache. Certain content related data were served by application servers, but then all the users get the same data. Data was served from cache, which was updated every 5–10 seconds. Also the data pertained to around 7000 odd items on any particular day. Hence, overall low memory consumption and low number of writes.
Now, the personalized content output was for around 35 million users. Also new content was available every 10 minutes or so. Everything needed to be served by our application servers. Thus, this meant a far higher number of writes as well cache size to be handled than what we had handled earlier.
The challenge for us was to design a system to do all these. So a data-science / ML project was not limited to building vectors and running models, but involved designing a complete system with data as the lead player.
When we started building our solution, we found that we had three facets that our decisions needed to cater too, namely System, Data and Team . Thus we would discuss our approach to these all three aspects separately.
We had data in multiple types of database, which backed our various applications. Data structures ranged from tabular to Document to Key-Value. Also, we had decided to use Hadoop ecosystem frameworks like Spark, Flink etc for our processing. Therefore, we chose HDFS for our data storage system for analytics.
We built a 3 tier data storage system.
- Raw Data Layer: This essentially is our Data lake and the foundation layer. Data is ingested from all our sources into this layer. Data ingestion is done from Databases as well as Kafka Streams.
- Cleaned / Transformed / Enriched Data Layer: This layer stores data in structures which are directly consumable by our analytics or machine learning applications. Jobs take data from our lake, clean and transform it into standardised structures creating Primary Data. There are jobs which also merged changes to create an updated state. Primary Data is further enriched to create Secondary or Tertiary Data. Jobs also create and save Feature vectors in this layer. Feature vectors are designed to be used in multiple subsequent algorithms. For example, the Content Feature vector is used for Section/Topic classification. Same feature vector, enhanced over time to include consumption information, was used for Newsletter candidate selection & recommendation.
- Processed Output Layer: Analytics and Model outputs are stored in this layer. Also trained models too are stored in this layer, for subsequent use.
All the jobs/applications that we made catered to either data ingestion, processing or output consumption. Thus we built a 3 tiered application layer for all our processing.
- Data Ingestion Layer: This layer includes batch jobs to import data from RDBMS and Document storage. We used Apache Sqoop for ingestion jobs. There are a bunch of jobs to ingest data from Kafka message streams. For example user activity data. Apache Netty based Rest API Server collects activity data, which is pushed to Kafka. Apache Flink based jobs consume the activity data from Kafka, generate basic statistics and also push the data to HDFS.
- Data Processing: We used Apache Spark jobs for all our processing. It includes jobs for cleaning, enhancements, feature vector builders, ML models and model output generation. The jobs are written in Java, Scala as well as Python.
- Result Consumption: Processed output was pushed to RDBMS as well as Redis for consumption. The jobs were either built on Spark or Sqoop. The output is further exposed by Spring Boot Rest API endpoints for consumption. Further to this the same results were pushed out in event streams for further downstream processing or consumption.
This was the most crucial aspect for the success of the entire enterprise. There were two needs that were required to be fulfilled:
- Skill in understanding and applying machine learning principles and tools.
- Knowledge of our domain. Deep understanding of the content types, important aspects of content, how it matures, dies, what all things affect it etc.
Also, when it came to be known that we were planning ML based products, there were a lot of people in our existing team who wanted to be part of such an initiative. Also, it was important for us to ensure that we cater to the aspirations of our existing team members too.
Moreover, the overall system design meant that there were two distinct parts of the problem, the core ML section and the peripherals which were more like good old software engineering.
We decided to build our team with a combination two sets of people:
- Data Science experts, whom we hired. They were entrusted with the data science part of the puzzle. They also taught other team members and mentored their learning process.
- System development team, who were people picked from our existing team. They built the ingestion pipelines, stream processing engines, output consumption API’s etc.
Also, by taking people from our existing team, we were able to get the ingestion pipelines development going while we were hiring the data science people. Thus, we were able to kick start work from day one, figuratively speaking.
As is illustrated through our experience, building a bunch of applications for training models and generating output is only a beginning. Building a system and team to harness them is an entirely different proposition.