While developing Machine learning models to solve different problems like Fraud detection, Smart routing, forecasting, etc. at Razorpay we faced different challenges. One of the major challenge was the realtime feature generation and ML model serving in real-time at scale.
As a leading payment processor in India, we process millions of transactions each day and receive billions of different events in our real-time streaming engine. As the digital payment processor, we get few milliseconds to generate features and predict the results from our AI models.
In this blog post, we will describe how we are using AI at scale and using ML models in a real-time stream. We are using Apache Flink as our core engine, Kafka as data queue and control stream, HDFS and S3 as the raw data lake, xgboost as classification models, NLP in address parsing and some micro models.
For achieving this in real-time and on the high scale we developed our Data intelligent platform Mitra. It’s based on Kappa+ architecture where we process all data on streams. Our core engine is based on Apache Flink with Kafka as a data queue and Rocksdb as in-memory states. We use Kafka for both Data flow and as a control stream to send dynamic control signals to our platform.We have a lot of other components in our Mitra platform like Graph DB, ML Model server, Dynamic rule engine on streams and data lake.
Introduction to Mitra
Razorpay users generate a lot of data events every day, which is a huge asset for us. We leverage the data every day in both strategic decisions and product intelligence to better serve users and fight fraud. Flink allows us to process data at a large scale in real-time.
Key features of Mitra :
- Predict results within 200 milliseconds in the distributed environment
- Generate Hundreds of features on the fly during model serving
- Serve results from deployed ML models
- Dynamic rule engine on Flink streams
We heavily use Flink’s in-memory states, CEP (Complex event processing) and Async IO to achieve this. We have more than 50 tasks in our Flink application.
Why Apache Flink?
- Low latency, we have to generate the results in near real-time at scale.
- As we receive the events asynchronously so there is a need for CEP (Complex event processing) and event time streams to handle event sequence.
- We need to maintain the data history for our AI features and we can not afford to go outside from our application to maintain low latency. Therefore, we use Flink’s in-memory states heavily.
- As some-time, we need to query external services like graph DB for the community features we use Async IO.
- Automatic Checkpointing for fault tolerance.
- Active community support and active development. For example, Flink just introduced the most awaited features schema evolution and State TTL.
ML Model Training and Deployment on scale
As we have to predict results in milliseconds by generating hundreds of features on the fly and predict from our ML models, Scalability and low latency play an important role in our platform. Feature generation and prediction from the model we scale well on Apache Flink. But for the serving, we have to scale the model server too for low latency and handle the load.
Generally, on a small scale, people use a single server or cluster for the model training and serving. But this doesn’t scale well on load. So why we separate out our training and serving servers for the following reasons :
Resource Allocation: Most of the time during training we allocate maximum resources to training due to that serving requests suffer and start failing.
Network Load: As during training we pass features in bulk so it chocks the network and starts impacting serving requests.
Separate scalability: After separation, we can scale the training and serving cluster according to load and requirements.
In the near future, we will continue our work on Mitra to enhance its functionality and scalability. We will also develop functionality for online learning on the scale. We will keep working on our Data science and Mitra platform As we have access to huge Data and have a lot of challenges that we can solve through Data intelligence.
If you are interested in a highly scalable and real-time data intelligence platform, consider applying for one of the available roles in Razorpay! We are a data-first company which believes in solving challenges and decisioning using data intelligence.
If you have any questions or feedback you’d like to share, please get in touch.
Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.