At ING’s Wholesale Banking Advanced Analytics team we’ve been using Apache Flink and more specifically the DataStreams API for over 6 months now. Although we’re very enthusiastic about this open source project, we’ve identified one major pain point in using it in a production environment: managing job deployments. This post will highlight the challenges and the open source solution we’ve built to tackle this.
Managing job deployments and state recovery with Apache Flink is tedious and tricky within a production cluster and sometimes requires chaining operators that are provided by the native Apache Flink CLI. The deployer CLI that we’ve developed will provide several methods which encapsulate this process for you. …
In most cases, using Python User Defined Functions (UDFs) in Apache Spark has a large negative performance impact. This blog will show you how to use Apache Spark native Scala UDFs in PySpark, and gain a significant performance boost.
To create your Scala UDF, follow these steps:
Find an example project in the link…
Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data. The 7 layers are:
On the 11th and 12th of October, World Summit AI took place in Amsterdam.
The summit is the world’s only and first industry organized applied AI event, meaning that to be involved you should be leading some AI initiative, no matter what industry. ING decided to become a main sponsor for the summit, ensuring that the classic ING logo was visible throughout the venue Furthermore, the bank had provided speakers to give presentations and participate with the rest of the summit. …
Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results. In this blog, we implement a customized Cython function for this purpose. When comparing our Cythonic approach to doing the same with SciPy and NumPy functions, our approach improves the speed by about 40% and reduces memory consumption. The GitHub code of our approach is available here.
ING Wholesale Banking has huge amounts of data about many companies, but because the data comes from different source systems, inside and outside the bank, there is no single identifier that can be used to easily connect the data sets. …
Niels Denissen — PyData lecture on Asyncio
At the time of this posting, it has been two months since Neils Denissen held his lecture at the PyData conference on the 8th of April. His lecture, which focused on a new Python application known as “Asyncio”, drew the attention of many attendees. The conference was full of excited participants eager to listen and share knowledge. The venue, Booking.com, felt particularly full as a wave of data scientist, data engineers and more filled all the lecture halls they had set up.
Niels is a team member of Wholesale Banking Advanced Analytics. WBAA are creating data driven products to create value for clients and the bank itself. If you wish to see Niels’ lecture “A practical guide to speed up your application with Asyncio” please visit the link bellow; it is sure to help out with some of your Python needs!
We have one Hadoop cluster with Apache Airflow as a workflow scheduler and monitor in our current environment. In the near future, we want to build two new Hadoop clusters to handle the production workloads. The airflow instance that we currently use is a single node, everything installed on one server: consisting of a web server, scheduler and worker; we use PostgreSQL DB for the meta data. We started investigating the possibility to make our Airflow Environment Highly Available in both data centers.
The new requirements for the HA solution are:
The devil is in the details. It’s an expression that is often used to say that some things need to be done thoroughly, that details can be important. This is of course true for many things that people do everyday, but it can particularly be the case in the fields of data science and advanced analytics.
Now when it comes to data science I am the first one to admit that I am a stickler for the nitty-gritty, that I need to thoroughly understand every model or algorithm that I work with. But I have noticed that I am more exception than rule in this. …