How fin-tech should evolve to respect data privacy

Solving data privacy in alternate data analytics through an offline distributed machine learning architecture.

Vishnu Satis
6 min readJul 14, 2018
Photo by arvin febry on Unsplash

“Arguing that you don’t care about the right to privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say.”

- Edward Snowden

My journey with machine learning(ML) started 5 years back when I first heard of a brain-inspired learning algorithm, the perceptron-based neural networks. Fascinated by this amazing technology which could learn on its own and perform human-like tasks, I started to build my own ML PaaS platform. The objective of the platform was for users to create machine learning apps with ease and enable them to use it on the fly. In a span of 3 months,I integrated popular regression algorithms, different types of neural networks, data classifiers & more as part of the platform. The end result, a platform which enabled hundreds of users to create ML apps and use just REST APIs to train and make predictions for their apps.

I joined my friends at Mho & MessAI who were building a mobile-first Alternate Data Engine.

At the time, we required a similar ML platform which could support real-time decision making. Even though machine learning was the last piece of our text-to-decision puzzle, it’s “prodigal” nature brought along problems of its own. The costs incurred were always high, from data pre-processing, training the algorithm, to obtaining the predictions/results. The lack of infrastructure at client side had always restricted ML algorithms to be run on the server side. The problem we wanted to address was the ability for businesses to make decisions based on customer data available on their devices at an affordable price but while respecting user privacy. Therefore, we came up with two flavours for addressing the issue.

An online decision-making server

MessAI was integrated with my ML PaaS platform, enabling clients to use REST API calls to send normalised vectors(devoid of any personally identifiable information) created from the customer metadata from our proprietary MessAI Parser. The trained ML model(business-specific model) on the server returns the output as a REST response based on which enterprises could make multiple customers targeted decisions on-the-go. This approach ensured that no personally identifiable information of the customers was compromised. With this solution, we had overcome a hurdle of customer data privacy but, was it good enough? How much would our server costs amount to? How much time will the decision-making take? With a data processing and aggregation task running before the decision making, is making a network call to get the decision the best possible solution? We are in one of the most competitive niches of the fin-tech market and both server-cost and time was a major concern.

Thus, we faced our next hurdle — finishing data processing, business logic/signal creation and decision making in under 30 seconds and with minimal load on the server. Overcoming this paved the way for our next solution.

Offline decision making at the client

To facilitate this change, we decided to integrate Google’s TensorFlow into our machine learning PaaS platform. We hoped that with this approach we could make decisions within the client devices and thus bringing down the decision-making time to less than 30s. If all things went as expected, we would be able to able to conquer offline decision making, reduce server costs, adhere to data privacy and remove network latency, all in one go. But as you all know, ideation & design is just the first step to solving any problem. It took us days for us to create a proof of concept(POC) for this approach.

We set out on building a decision model which predicts loan eligibility for a user against different factors like revolving credit balances, average account balance, annual income, etc. Fortunately, we were able to get our hands on an open-source data set that had most of the metrics we were looking for.

The major challenge was to choose the right algorithm for the current data set. As we were trying to solve a classification problem between loan approvals and rejections we decided to go with artificial neural networks(ANN). We chose ANN so that we could scale our platform to cater to more complex scenarios in customer environments. We started training the ANN with the data we had and everything was going well. The training loss started from around ~0.3 and came down to as low as ~0.06, thus giving an accuracy of nearly ~94%. When we checked TensorBoard, graphs were showing a steady decrease in training and cross-validation errors.

When we finally tested the model with real world scenarios, everything went for a toss.

The results were always giving very high probability values for loan eligibility. When we tried varying different features, the probability was changing but at very low rates (~0.00002), with very high probability for loan getting approved(always above ~97%). Looking at the training stats from TensorBoard we were sure that there was nothing wrong with the training process. It had to be something wrong with the dataset and so we started digging deep into it. Once we started looking we found that the data set was highly skewed, with 270,000 out of 300,000 records having fully approved loan eligibility. This was causing the ANN to overfit and we had to find a way to further normalise this dataset before using it for training the model. Doing lot of research on imbalanced classification problems took us to Synthetic Minority Over-sampling Technique (or commonly known as SMOTE). By performing SMOTE over our data we created a non-skewed dataset to train the model.

So with all this effort, we were able to create a model which was giving promising results. The model was showing patterns that we were hoping for, the probability of loan approval was responding well to changes in the individual features. Now we had to convert the model into a frozen model file(.pb file) so that we could port the same on to Android. Finding the right documentation for the same was strenuous, and we ended up spending hours scrambling through numerous GitHub bugs and StackOverflow answers to find the right APIs to freeze the graph. Once frozen, we placed the .pb file in our B2B Android demo app (Meteor) to test our scenarios.

At this point, we thought that we cracked it.

Unfortunately, our first round of end-to-end testing gave unexpected results. The model that we had trained gave different output probabilities when running on Android. We quickly came to the conclusion that there was something wrong with our code for freezing the model. After a lot of research and finally being able to load the frozen model into TensorBoard it was obvious that there were a lot of missing layers in the model. On loading the frozen model into python and predicting the outputs, the results remained the same as that in android. This meant that the code we were using to freeze the model was altering it as well. We had to continue our research to find the right method to freeze the model. Finally, we figured out a workaround to freeze TensorFlow models and this gave us a frozen file which was giving right results when tested in python as well. We tried the same in android and it finally worked!

Although the whole process was cumbersome and stressful, it was definitely worth it. We were able to create a demo app which could take your SMS inbox as input and our MessAI SDK could predict your loan eligibility completely offline. This was a breakthrough for us and the fin-tech industry, with enterprises now being able to make business decisions over customer data without compromising data privacy. Adding to this was the reduced pricing through lowered server costs. Most of the current solutions in the market upload their customer data to servers and run analytics on the cloud, thus compromising data privacy and incurring hefty server costs.

This was how 4 passionate tech folks from NIT Calicut, enabled machine learning to overcome the lack of data privacy in Alternate Data in India, thus paving the way to a new perspective towards customer targeting and instant decision making.

--

--