The number of data scientists has grown a lot in the past few years. Ten years ago we saw the rise of big data; now we see the rise of people who gather insights from big data — the so-called “data scientists.”
Data scientists are new era fortune tellers.
We see every company creating big data or data science departments, either to understand their clients and maximize profits or to disrupt their industry. At Feedzai, we started with fraud detection. After that, we looked toward machine learning as a disruptive technology that would get us ahead of the curve in the market.
New Feedzai clients typically find themselves facing one of two scenarios:
- The client is looking for their first fraud detection solution. They know little about the topic and trust us to help them build an effective solution for their use case. They’re also looking to us to teach them about the solution.
- The client already has a data science team (sometimes really small, sometimes a full department) and a solution in place, but they want to migrate to Feedzai’s solution.
For clients facing the second scenario, we found their data science teams experienced friction when it came to adopting our platform — mostly because they already had a ton of custom scripts, tools, and even deployed machine learning models that they wanted to keep for the first phase. It made sense. After all, why would you throw away years of experience?
These were some of the main reasons why we developed our OpenML API, a public API that enables the seamless integration of any machine learning model into the Feedzai platform.
Through our research, we found that in most cases, client teams were working in R and Python. So we started developing OpenML integrations for those common use cases.
We’ve already covered our approach and challenges when we integrated R code into our platform, so we’ll focus on the challenges of integrating Python into the Feedzai platform for this article.
Java + Python = ❤
Python is an interpreted, high-level, general-purpose programming language. Created in 1991, Python has a design philosophy that emphasizes code readability. It provides constructs that enable clear programming on both small and large scales.
Python was adopted by data scientists for its readability and ease of use. There are many libraries in Python that address a wide range of common data science tasks. There are also many widely used machine learning frameworks, but one of the most adopted is scikit-learn.
With this in mind, and similar to what we did with our R OpenML provider, we built a Python provider that is able to integrate both with scikit-learn ML framework (with minimal effort) and with custom Python code.
As always, integrating 2 different programming languages poses some challenges. Hence, we did a little research on the best possibilities:
- Jython was one of the first solutions that we found. It is an implementation of Python seamlessly integrated with the Java platform. Very similar to one of the many existing programming languages ports to the JVM like JRuby and many others.
- JEP was the other viable solution we found, as it embeds CPython though Java Native Interface protocol (JNI). The JNI protocol allows java applications to call native applications and libs written in other programming languages.
Unfortunately, we had to dismiss Jython as a viable alternative due to two big problems:
- It didn’t seem to fully support extension modules written in C like Numpy — a widely-used extension in Python.
- Python 3 support is still under development.
We then went back to explore JEP as our most viable solution.
But JEP is not a perfect solution either. Like everything else in the software development world, every decision has its set of trade-offs. Some of the most important trade-offs for us are:
- Ease of installation — Since most of our clients use self-hosted machines with different configurations, different network and firewall setups, and a gazillion of other subtle differences, we cannot rely on some standard industry solutions, like Docker.
- Performance — As always, performance is crucial in our system. Our clients can’t wait forever to know if a given transaction is fraudulent.
By introducing JEP, we are introducing yet another set of steps when configuring a machine to run the Feedzai Platform.
When installing JEP, there’s a common problem regarding the LD_PRELOAD and LD_LIBRARY_PATH environment variables, which allow you to change the loading order for shared libs of the Operating System (see more at https://blog.fpmurphy.com/2012/09/all-about-ld_preload.html). This problem is known in JEP and forces us to have an extra setup step (see https://github.com/feedzai/feedzai-openml-python#running-the-tests).
Tinkering with these low-level details is always risky. And to make things worse, we need to install Python packages directly on the OS where other users/processes may install even more packages. It’s a ticking time bomb
To avoid unpredicted errors, we decided to go with Anaconda, which allows you to manage virtual environments, and, as such, have some isolation. Anaconda acts similar to rbenv and RVM and it allows us to be more confident that we won’t mess up other processes while using Python packages in other versions.
Non thread-safe dependencies
After a stable period without reported problems, our internal team faced an unexpected error. While doing some tests using TensorFlow a weird behavior appeared: they tried to load two TensorFlow models using our OpenML Python provider and the loading failed.
We found similar issues on GitHub. It seems that TensorFlow has dependencies that cannot be imported twice. Luckily this problem had already been reported to JEP and can be solved using shared modules, which import the module only once across and sub-interpreters.
This has allowed us to keep some isolation, while reusing some dependencies that were not ready for this concurrency use case.
Finally, when it comes to dealing with a real-time system that scores millions of transactions, performance is always a topic we cannot forget.
One of our first concerns was this: “Can we use JEP in production for our strict SLAs?” . While we notice a performance decay compared to our current machine learning models, it has been performant enough on more relaxed SLAs. We are currently conducting more exhaustive benchmarkings to understand the full performance implications.
Nevertheless, allowing arbitrary code is a Pandora’s Box. The code itself needs to be efficient, both from the memory point of view, as well as from a performance perspective.
Until now, our current approach has always been to distribute the existing guidelines to anyone trying to use our OpenML Python provider: https://github.com/ninia/jep/wiki/Performance-Considerations. One must pay particular attention to memory leaks if using custom Python code.
If you want to see the dirty details, or just give it a try, take a look at our GitHub repository, which contains the code and the documentation to assist you:
- Repository: https://github.com/feedzai/feedzai-openml-python
- OpenML Generic Python provider: https://github.com/feedzai/feedzai-openml-python/tree/master/openml-generic-python
- OpenML Scikit provider: https://github.com/feedzai/feedzai-openml-python/tree/master/openml-scikit