How we built a modern data platform for a digital bank from zero to hero

Anil Palwai
3 min readApr 22, 2020

--

To Start any organization’s data journey, an initial step is to build a data platform. however, it’s not rocket science to build a modern data platform but we need to follow rocket ship analogy. In this blog, I would like to give kind of an overview that how we have built a modern data platform for a digital bank in Singapore. For this project, the challenge is to build a completely new Bigdata platform that is going to service the data which was previously operating on legacy tools such as Teradata studio, Oracle Studio, and so on.

Every organization has their data platform which can be designed based upon their business and functional factors, those factors could type of data(batch, near-real or real), type of architecture (monolithic or Microservices), Type of environment solutions(On-premise, Could or Hybrid)(single cloud or multi-cloud).

While we are architecting this digital data platform, we kept a target of some crucial areas to be non-compromised are Performance, Scalability, Security, Highly availability, Easy to monitor, and flexible to future changes. We came up with these key fundamentals to architect this data platform to reach the target.

Sample Architecture
  1. Decoupling the Storage(DataLake on S3) and Compute layer(Hybrid VMs): We have introduced Alluxio a virtual file system in the stack to decouple both storage and computation layers, Where processing pipelines like apache-spark, hive..etc all these jobs will read and write only to Alluxio fs, whereas Alluxio will read/write to/from the storage layer async/sync manner depends upon job config properties
  2. Better Performance for Data pipelines by data co-location: Again we take advantage of Alluxio capabilities to co-locate the data into work node memory to run jobs at memory speed.
  3. Resilence in job runs: We have used apache airflow to schedule and run jobs in the form of tasks in airflow DAG. Where we could have complete control over data pipelines and their lineage.
  4. High availability is vital for every service in the platform: We made almost all the services running in the cluster with multiple agents which tags with virtual IP by doing this under the hood we made sure service will not affect if any of agent goes down
  5. Resource isolation in the Compute layer: As it’s a multi-tenant computational cluster, We configured the Yarn resource pool in the cluster and deployed all jobs with related queue configuration.
  6. Distributed Workflow management system: We used Airflow as a WMS to schedule and run pipelines in the cluster to easy monitoring, test, and resilience in runtime.
  7. Automation in runtime failures to achieve SLA: As we monitoring the job runtime activity with extracted airflow logs on Grafana dashboards, for any runtimes issues we will trigger automation scripts or workaround solutions between the airflow retries to run all the know issues without failure.
  8. Benchmark tests for Cluster capacity analysis: We perform stress tests/benchmark tests on the cluster in a frequent manner get to understand the cluster performance and capacity.
  9. Integration tests: We run integration test jobs for every 10 mins for every service in the cluster and visualize their uptime/downtime on dashboards to understand availability.
  10. Role-based security: We have integrated LDAP and Kerberos to provide authentication and onboarding user to the platform.

Apart from these in this platform, it has evolved with the time by introducing more innovative inhouse products like data security frameworks, self-service compute engine to designing, testing, validating, and deploying ETL pipelines with opensource Apache Spark.

Please reach me on Linkedin for queries: www.linkedin.com/in/anilpalwai

--

--

Anil Palwai

A digital data engineer and specialized in providing solutions in Hadoop, Apache Spark, Apache Kafka, Airflow for on-prem, cloud, and Hybrid environments.