CapFlow: A scalable ML framework for training and serving CRM model operations

Biswa G Singh
Capillary Technologies
6 min readSep 14, 2021

Capillary Technologies provides retail CRM, Loyalty and e-commerce SaaS platform. We help businesses grow amidst changing consumer expectations. The role of AI models is very crucial to create personalised engagement. Right audience selection, with the right product and offer for the individuals and engaging them in the right channel at the right time is the need of the hour. To cater to personalised recommendations, insight, decision-making, and several other retail CRM predictions, we have built several models for our customers. Models such as customer churn propensity, customer transaction propensity, campaign response propensity, product cross sell recommendation, product repurchase recommendation, customer lifetime value prediction model, fraud detection model, communication channel propensity and campaign offer/promotion propensity to name a few need to be trained and refreshed regularly for 200+ clients. We have hundreds of customers and terabytes of data that we consume to train and serve these models through online and offline inference. To scale to such a level and cater to continuous training and delivery of results, we have built our own MLOps framework called CapFlow.

Challenges:

  • 30+ Machine Learning models for 200+ brands
  • Efficient Scaling to a large volume of data
    Customers with billions of transactions and millions of users.
  • A large size brand: ~80 millions customer-base and ~1 Billion transactions every month.
  • A medium sized brand: 20–30 millions customer-base and ~100 million transactions every month

Apart from scale, the core objectives of the MLOPs framework are:

  • Deploy techniques to monitor model performance.
  • Deploy techniques to automate the model building process.
  • Plugging new models to the suit efficiently and validate across multiple customers.
  • Continuous delivery of model results.
  • Flexibility to use various models like Tensorflow/PyTorch based Deep learning models, decision tree-based boosting models, statistical/rule-based models, and ensemble models.
  • Should be able to work with diverse types of data like, structured, unstructured, image and audio.

Please refer to ml-ops.org for more information about the MLOps framework.

Solutions:

To fulfil the above requirements we chose to build a platform using the capabilities of airflow and databricks.

Why Airflow?:

  • Maturity
  • Popularity
  • Simplicity for Monitoring and CICD
  • Scalability
  • Compatibility: Language, Libraries, Scheduling, Ease of testing
Airflow is the most popular solution, followed by Luigi. There are newer contenders too, and they’re all growing fast. (source)

Why Databricks?:

  • Highly optimised spark3.0
  • Cost-efficient clusters supporting PySpark
  • Support for Deep learning/Machine learning libraries
  • Managed cluster management: Auto scaling, Auto termination
  • Production Jobs Monitoring and Alerts
  • Resource Monitoring.
  • Security: Access control for notebooks, clusters, jobs, and structured data
  • Easy Integration and Handshaking between Apache Airflow and Databricks (REST API)
Source

Simplified ML Pipeline with tech stack:

Overview of the the CapFlow Architecture

All the above modules are pipelined through Airflow on Kubernetes. We leverage Databricks Jobs and cluster APIs to manage the Databricks jobs from airflow DAG. A training and inference scheduling framework takes care of load management and submits airflow jobs based on the currently running active DAGs. Since we are running airflow on Kubernetes, we have plans to leverage Kubernetes executors for airflow to scale to multiple DAGs (Directed Acyclic Graphs) horizontally.

CapFlow pipeline components:

The workflow components

Following is the description of the different components of the CapFlow pipeline as shown in the figure above:

(A) Data Access Platform:

Data access platform (DAP) is a spark based data reading and preprocessing platform.

  • DAP is developed to handle different data sources
  • Hive on S3 data lake,
  • CSV
  • SQL
  • No-SQL
  • Image urls
  • Exposes uniform structured table/view API irrespective of the data source to the subsequent layers. This platform creates a uniform structure or view for data processing or training to work.

(B) Data Cleaning and Processing:

Data preprocessing and feature generation are two of the most crucial steps in any end-to-end ML model. We have developed a separate data-preprocessing module in spark on scala for data preprocessing and feature generation for all of our ML models.

Data processing module communicates with the DAP SDK to accept appropriate data, in this phase it mostly reads structured data produced by the DAP. We have incorporated various feature generation techniques such as recursive feature generation, RFM features, User Item frequency matrix and many other features which are important for the model building.

Prior to feature generation several data cleaning techniques are incorporated that are listed below:

  1. Outlier removal
  2. Z-score and IQR based anomaly removal
  3. Frequency bases outlier removal in categorical variable
  4. Time and threshold based data filtering
  5. Product attribute selection based on fill rate of columns.
  6. Missing value prediction
  7. Regex based data cleaning on text data

Spark on scala offers concurrency support, which is the key in parallelising the processing of large data sets. And several of Spark’s high-performance data frameworks are written in Scala. We run our heavy data preprocessing and feature generation on Databricks clusters. Databricks supports autoscaling, which enables clusters to resize automatically based on workloads, making it cost-efficient.

(C) Model Training:

Once the business problem is identified, we need to collect relevant data points which can be used for feature engineering. Feature selection can be a recurring process where we choose the most important features and train over them. Once the important features are selected, cross-validation and auto-tuning ensures that the training process is robust and scales to multiple customers without much intervention. The algorithm used for the model building depends on the type of target we are trying to predict. The trained model is stored in the s3 path, and the path is written to the database. This model is then used for inferencing, where we load the pre-trained model and predict for given data points. We have built training libraries based on the business level model objective, and the library supports both deep learning and machine learning capabilities.

(D) Feature and Model Storage:

Once the training process is over, we deploy the model automatically based on some thresholds and rules and the version changes to the new model built. The feature columns selected during the training process are also stored for reference during inference. Some of the non-model-specific parameters are also stored for future reference.

(E) Model Inferencing:

In our case, the inference is a process where we generate results or predictions by inferring the already trained model. In model training steps, we generate optimum features and parameters. We infer this model to generate results from these models and write to the appropriate destination for further consumption by other services. And we generally retrain a process until we see a dip in the performance of the model.

Batch Inferencing

Conclusion:

MLOPs is a set of practices which companies follow to manage, scale and productise their machine learning models. The practices and tool preference vary from organisation to organisation based on the requirement. We in Capillary built CapFlow which suits the requirements we wanted to fulfil. We did a thorough study on the requirements and tried to reuse the tools/frameworks which Capillary already trusted in-house to build our MLOPs infrastructure. We also tried to leverage open source tools and frameworks as much as possible to build this infrastructure.

Following is the future pipeline to enhance CapFlow capabilities:

  • Model Drift detection
  • Robust feature selection techniques
  • Load based cluster launch
  • Distributed training using Horovod and MMLSpark
  • Addition of Vision and Audio Models to the same framework (The models are part of our smart store product)
  • More streamlined and efficient (both computationally and cost-wise) to get recommendations in real-time based on users’ stream of clicks.
  • This entire pipeline coupled with UI to trigger requests can become an ML-Suite product.

Co-author: Anuja Prakash Kolse, Sankalp Apharande

--

--