Scalable machine learning with Apache Ignite, Python and Julia: from prototype to production

Peter Gagarinov
Alliedium
Published in
10 min readMay 28, 2021

In the previous article Boosting Jira Cloud app development with Apache Ignite we explained the benefits of using Apache Ignite in a combination with Spring Boot for building scalable and distributed backends for Jira add-ons. The important aspect of the backend design left without much attention was the machine learning infrastructure. In this post I’ll explain why we started with a combination of Scikit-learn and Celery and ended up with a combination of Apache Ignite, Ray Serve[4],[5],[6], Scikit-learn and PyTorch.

Our Alliedium Jira add-on for simplifying project management via automating the ticket assignment, labeling, ranking by priority works via using ML for inferring statistical patterns from existing Jira tickets. This requires a scalable backend infrastructure for online training and serving a large number of both classical ML and Deep Learning models in parallel for different clients (treated as separate tenants). Initially we had a prototype of ML pipeline implemented in Python’s Scikit-learn library and it was quite naturally to use the Python-based distributed task queue framework called Celery as a model orchestration layer inside Kubernetes cluster. PostgreSQL was our first choice of the database (until we switched to Apache Ignite later).

As number of users and ML model complexity increased we have quickly started to feel that PostgreSQL was a limiting factor on the way to the fault-tolerant, scalable and distributed architecture of the back-end. We could scale Celery by running more pods in Kubernetes, we could also scale the web server part of the back-end by running more instances of Spring Boot pods but we could not scale PostgreSQL without switching to one of the distributed PostgreSQL derivatives such as Greenplum[17], CitusDB[16] or TimescaleDB[18]. Manual sharding among different instances of PostgreSQL was not an option for us as we also wanted to have a more elegant solution that would be less painful to use in multi-tentant setting for complex schemas and preferably without bulky ORM tools such as Hibernate. After evaluating a few alternatives we have chosen Apache Ignite because it

  • integrates with Java natively
  • highly available and horizontally scalable
  • fault-tolerant and distributed
  • supports distributed ACID transactions
  • provides both persistent and in-memory storage
  • supports SQL for distributed data
  • supports user-defined distributed jobs
  • provides automatic failover (jobs and db connections)
  • provides Transparent Data Encryption for safety reasons
  • supports native configurations for deployment in Kubernetes
  • free and open-source

Switching from Celery to Apache Ignite + Ray Serve

We’ve decided to keep Celery for the time being (though it wasn’t a perfect fit anymore as Apache Ignite had all we needed when it came to being able to run distributed user-defined jobs on the cluster — see the next section for details). The problem was that Celery didn’t have any native Java API while Apache Ignite Python thin client API didn’t have all the features we needed (see https://ignite.apache.org/docs/latest/thin-clients/getting-started-with-thin-clients). To work around this limitation we’ve wrapped a full-featured thick Java client for Apace Ignite with Python-Java bridge called Py4j (https://www.py4j.org/).

Celery wasn’t perfect for building a distributed ML serving engine just because it was a specialized system for ML-related tasks. Ray Serve seemed a much better choice (see Machine Learning is Broken article for details). However Celery provided an ability to run generic not necessarily ML-related tasks on the cluster which was important for us as PostgreSQL being a pure SQL database didn’t have any features for running user-defined jobs. The strong sides of Celery for our use case were

  • Celery is written in Python; thus easier to integrate with Python-based ML frameworks
  • “At Least Once” delivery guarantee for Celery message queues (implemented via RabbitMQ)[38]

Apache Ignite on the other hand has a built-in support for running distributed jobs on a cluster that eliminated the need for having Celery. We appreciated the following strong sides of Apache Ignite as a compute grid:

  • Native Java API for messages and distributed computing tasks
  • Built-in distributed basic ML models
  • Automatic connection failover for both thick and thin clients

Despite Apache Ignite having a set of built-in ML models we’ve decided not to use those (for the reasons explained below) and keep using Python-based pipelines. The other problems with Apache Ignite as a compute grid we noticed are:

  • Weaker delivery guarantees; not suitable for important messages (in finance e.g.)[42]
  • Built-in ML models lack certain features for our use case
  • Python thin client doesn’t support neither message nor compute API[34]
  • Using the thick client API from Python requires Py4J Python-Java bridge[36]

The Celery was far from being perfect either because it

  • Requires a separate message broker (RabbitMQ) for submitting tasks[39]
  • Requires a separate results backend for large results[39]
  • No out-of-the-box pure Java API[40]
  • If not run inside K8s a special care is needed for RabbitMQ auto-failover implementation[41]
  • Automatic connection failover is available only inside Kubernetes

At the end the strong sides of Apache Ignite outweighed the weak sides as we found the following workarounds for them:

  • Weaker delivery guarantees were handled on the front-end side inside JavaScript
  • Thick client Java API wrapped with Py4j was used instead of less-featured thin Python client API

While we no longer needed Celery for running generic distributed computing tasks we still needed a place to serve and run our ML models. Thus, after releasing a version with a combination of Celery and Apache Ignite we’ve decided to replace Celery with a combination of Apache Ignite built-in distributed computing capabilities and Ray Serve for training and serving ML models.

Evaluating Julia language as Python alternative for ML

While some of our models are Deep Learning-based and run on GPU, most of our models are still are in the the classical ML camp and run on CPU. The training speed is a very important factor for us as this directly impacts the user wait times and consequently — the user experience. Scikit-learn models run quite fast but we wanted to see if we could make classical ML models even faster and started to evaluate the following alternatives:

Linfa didn’t seem mature enough for our case (e.g. it didn’t have a multinomial logistic regression classifier at the moment of writing this post) though it turned out to be very fast.

CuML required GPU and that is why we didn’t seriously consider it.

Out of all four options we considered Julia’s MLJ was the most promising one. Firstly, we were already familiar with Julia from a few previous projects. Secondly, Julia was built for speed and doesn’t have the overhead introduced by calling NumPy functions (written in C) from high-level Python functions of Scikit-learn.

As for Apache Ignite ML — it wasn’t built for running ML model training fast on small datasets (~10k observations) but we’ve decided to include it into our comparison anyway.

Here are the results of our comparison between Scikit-learn (Python), MLJ (Julia) and Apache Ignite ML:

We noticed that

  • ML in both Julia and Python is much faster than Apache Ignite ML for our case (~8–10k observations)
  • A limited set of optimization solvers in Apache Ignite ML (e.g. LogisticRegressionSGDTrainer for LogisticRegressionModel, in scikit-learn — 5 solvers)
  • There is no nested cross-validation[51] and no stratified cross-validation[52] in Apache Ignite ML

and concluded that Python is stronger than Julia’s MLJ when it comes to

  • Apache Ignite thin client for Python (no such client for Julia)
  • Ray Serve (e.g. Genie.jl + Dagger.jl is not an equivalent replacement)[4],[5],[6]
  • Python has a much more mature ML ecosystem comparing to Julia
  • Scikit-learn is sometimes faster than MLJ

Python and Julia go head to head for

  • Calling Apache Ignite thick client Java API: Py4J (Python)[36] vs JavaCall.jl (Julia)[53]
  • Calling Apache Ignite thick client C++ API: Cython (Python)[54] vs CxxWrap.jl (Julia)[55]

Julia in its turn is stronger than Python when it comes to

  • Being a more flexible language (custom models and transformations do not have to be written in C to be fast, they can be written in Julia itself)
  • Easier parallelism (native threads in Julia vs GIL in Python)[56]

Based on the results of this comparison the decision was to stick with Python (Scikit-learn for classical ML algorithms and PyTorch for Deep Learning models) for now and keep an eye on Julia ML ecosystem that includes such important packages as

As for Apache Ignite ML — as was expected it did show lower performance for our use case (~10k observations) just because it was designed for large datasets distributed across multiple nodes of Apache Ignite cluster and wasn’t optimized for small datasets like ours.

The Final Architecture Layout

After switching from PostgreSQL+Celery to Apache Ignite + Ray Serve the ended up with the following architecture layout of the backend:

The blue hexagons represent Apache Ignite nodes (each running in a separate Kubernetes pod when the cloud deployment is used), the green hexagons represent the Ray Serve nodes. In our case both blue and green pods run inside the same Kubernetes cluster while the grey hexagons represent the GPU-enabled Deep Learning containers managed by Amazon SageMaker service. The diagram above hides certain aspects of the design such as Spring Boot-based web server pods and the load balancer in front of those web server pods. When Allidium client performs actions requiring training of ML model (or triggering an inference for already trained model) the load balancer forwards the request to one of Spring Boot Kubernetes pods which in its turn forwards the request to one of Ray Serve pods. If the request is related to the classical ML model the Ray Serve pod handles the request all by itself running all the calculations on CPU. The Deep Learning model-related requests are forwarded by Ray Serve to GPU-enabled Amazon SageMaker pods.

References

[1] Alliedium AIssistant Jira App
[2] Johansson Lovisa, Running Celery with RabbitMQ, www.cloudampq.com, 2019
[3] Scikit-learn: Machine Learning in Python, scikit-learn.org, ver. 0.24, 2020
[4] Ray Serve: Scalable and Programmable Serving, docs.ray.io, ver. 1.3.0, 2021
[5] Mo Simon, Machine Learning Serving is Broken: And How Ray Serve Can Fix it, medium.com, 2020
[6] Oakes Edward, The Simplest Way to Serve your NLP Model in Production with Pure Python, medium.com, 2020
[7] How would you compare Scikit-learn with PyTorch?, www.quora.com, 2020
[8] K Dhiraj, Why PyTorch Is the Deep Learning Framework of the Future, medium.com, 2019
[9] Chiniara Dan, Installing PostgreSQL for Mac, Linux, and Windows, medium.com, 2019
[10] Atlassian Connect Spring Boot, bitbucket.org, ver. 2.1.6, 2021
[11] Oliveira Junior, The best and easy way to handle database migrations (version control), medium.com, 2019
[12] Gopal Vineet, Move fast and migrate things: how we automated migrations in Postgres, medium.com, 2019
[13] PostgreSQL: Appendix D. SQL Conformance, www.postgresql.org, ver. 13.2, 2021
[14] PostgreSQL vs SQL Standard, wiki.postgresql.org, 2020
[15] Kuizinas Gajus, Lessons learned scaling PostgreSQL database to 1.2bn records/month: Choosing where to host the database, materialising data and using database as a job queue, medium.com, 2019
[16] Slot Marco, Why the RDBMS is the future of distributed databases, ft. Postgres and Citus, www.citiusdata.com, 2018
[17] Knoldus Inc., Want to know about Greenplum?, medium.com, 2020
[18] TimescaleDB 2.0: A multi-node, petabyte-scale, completely free relational database for time-series, blog.timescale.com, 2020
[19] Chen Neil, Rise and Fall for an expected feature in PostgreSQL — Transparent Data Encryption, highgo.ca, 2020
[20] PostgreSQL Transparent Data Encryption, www.cybertec-postgresql.com, 2021
[21] Huang Cary, Approaches to Achieve in-Memory Table Storage with PostgreSQL Pluggable API, highgo.ca, 2020
[22] Westermann Daniel, Can I put my temporary tablespaces on a RAM disk with PostgreSQL?, blog.dbi-services.com, 2020
[23] PostgreSQL: 22.6. Tablespaces, www.postgresql.org, ver. 13.2, 2021
[24] Ringer Craig, Putting a PostgreSQL tablespace on a ramdisk risks ALL your data, 2014
[25] Apache Ignite: ACID Transactions with Apache Ignite, ignite.apache.org, ver. 2.10.0, 2021
[26] Apache Ignite: Cluster Snapshots: Current Limitations, ignite.apache.org, ver. 2.10.0, 2021
[27] Ignite in-memory + other SQL store without fully loading all data into Ignite, Apache Ignite Users, 2020
[28] SQL Conformance, ignite.apache.org, ver. 2.10.0, 2021
[29] Apache Ignite: SQL Transactions, ignite.apache.org, ver. 2.10.0, 2021
[30] Using Spring Boot: 8. Developer Tools, 8.2.7. Known Limitations, docs.spring.io, ver. 2.4.6, 2021
[31] ClassCastException while fetching data from IgniteCache (with custom persistent store), Apache Ignite Users, 2016
[32] Spring Session and Dev Tools Cause ClassCastException, github.com/spring-projects/spring-boot, 2017
[33] Bhuiyan Shamim, A Simple Checklist for Apache Ignite Beginners (5. Ghost Nodes), dzone.com, 2019
[34] Apache Ignite: Thin Clients Overview, ignite.apache.org, ver. 2.10.0, 2021
[35] Kulichenko Valentin, Apache Ignite: Client Connectors Variety, dzone.com, 2020
[36] Py4J; A Bridge between Python and Java, py4j.org, ver. 0.10.9.2, 2021
[37] MavenRepository: Ignite Spring, ver. 2.10.0, 2021
[38] RabbitMQ: Reliability Guide, Acknowledgements and Confirms, ver. 3.8.16, 2021
[39] First Steps with Celery: Configuration, docs.celeryproject.org, ver. 5.1.0, 2021
[40] Celery: Message Protocol, docs.celeryproject.org, ver. 5.1.0, 2021
[41] Paudice Genny, High availability with RabbitMQ, blexin.com, 2019
[42] Messaging Reliability, Apache Ignite Users, 2016
[43] Apache Ignite: Python Thin Client, ignite.apache.org, ver. 2.10.0, 2021
[44] Peter Gagarinov & Ilya Roublev, Boosting Jira Cloud app development with Apache Ignite, medium.com, 2020
[45] Apache Ignite JavaDoc: Interface Binarylizable, ignite.apache.org, ver. 2.10.0, 2021
[46] Apache Ignite: SQL API, Query Entities, ignite.apache.org, ver. 2.10.0, 2021
[47] Unable to query system cache through Visor console, Apache Ignite Users, 2017
[48] A Machine Learning Framework for Julia, alan-turing-institute.github.io, ver. 0.16.4, 2021
[49] Credit Card Fraud Detection dataset, kaggle.com, 2018
[50] Chaudhri Akmal, Using Apache Ignite’s Machine Learning for Fraud Detection at Scale, dzone.com, 2018
[51] Brownlee Jason, Nested Cross-Validation for Machine Learning with Python, machinelearningmastery.com, 2021
[52] Brownlee Jason, How to Fix k-Fold Cross-Validation for Imbalanced Classification, machinelearningmastery.com, 2020
[53] Call Java programs from Julia, juliainterop.github.io, ver. 0.7.7, 2021
[54] Using C++ in Cython, cython.readthedocs.io, ver. 0.29.6, 2019
[55] Janssens Bart, Wrapping a C++ library using CxxWrap.jl, JuliaCon, 2020
[56] Ajitsaria Abhinav, What Is the Python Global Interpreter Lock (GIL)?, realpython.com, 2018

--

--