Architecting Big data at Scale

Published in

DBS Tech Blog

11 min readNov 24, 2020

Background

Big data analytics is the use of advanced analytic techniques against very large diverse data sets that include structured, semi-structured and unstructured data from different sources varying in sizes from terabytes to petabytes.

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyse it to find answers that enable

• Cost reduction
• Time reduction
• New product development and optimised offerings and
• Smart decision making

We started our journey with Cloudera Hadoop in 2016 as an extension to existing data warehouse capabilities building deep understanding of unstructured data and modernising our technology stack to support analytics.

Over four years the Cloudera Distribution Including Apache Hadoop (CDH) platform has become essential for our data architecture and strategy, supporting nearly all our business units, from retail, wholesale and investment banking to compliance supporting applications close to 40+ with some notable being

· Anti-money laundering reporting and fighting fraud
· Institutional banking reporting
· Risk regulatory platform
· Trading platform
· Finance platform
· Cost allocation
· Customer science
· Customer due diligence as a service

Challenge and Pushing ahead with resilient architecture

With increase in usage and learning, we had to redesign the enterprise data platform in 2019 to overcome several future challenges such as

a. Misalignment between storage and compute requirements causing resource wastage.
b. Slow turnaround due to procurement and provisioning delays of buying physical hardware.
c. Multiple hardware failures impacting service availability.
d. Contention risk due to single cluster handling all use cases.
e. Handling multitenancy with noisy neighbors.
f. Poor capacity forecasting.
g. Deploying new capacity stack and enabling experimentation.

**Figure 1 Challenges of running Hadoop on Physical Hardware**

We have architected this platform stack which can handle multitenancy better, lower contention risks, reduce operational costs and ensure uptime for our critical application. We took inspiration from Google and Amazon who have done it for services they offer in the big data arena.

The key to success of building and running a scalable data platform is to segregate compute from storage which was achieved by building

A. Data Storage Platform — Our core data storage platform which is highly resilient and performant.

B. Microservices — Containerised self-service which encapsulates the complexities of provisioning and management of big data tools making it on-demand services for consumption by applications.

C. GPU Compute using shared GPU farm — Accelerated compute using GPU integrated with persistent and transient compute clusters.

Data Storage Platform

Continuously innovating our Data Storage platform, we redesigned the platform in 2019 by leveraging vSAN, software defined storage which gives lot of advantages. This innovated new design includes

Aggregating locally attached storage from each ESXi host in a cluster.
Leveraging flash optimised storage solution.
VM-Centric data operations and policy driven management principles.
Resilient design based on a Distributed RAID architecture.

**Figure 3 Transitioning to Analytical Private Cloud**

This new architecture that we built mitigated the pain points we had experienced with physical hardware in the past, helping us achieve tremendous success.

The new design is now able to

· Enhance multitenancy for greater performance: We segregated our clusters based on use cases and access to CDH. This allows clusters to run independently, improve the efficiency of our IT teams, especially when it comes to managing codes and production queries.

· Reduce contention risk: Too many applications on the same cluster cause contention risk, leading to performance issues and severe downtime. Through segregation, we minimise our risk and enhance our uptime performance.

· Real-time synchronisation: The active-active redesign ensured real-time synchronicity between production and disaster recovery sites. As we accommodate more volumes of data, we had to build in the capacity to make sure that the two sites are always in sync to meet recover time objectives.

**Figure 4 Active-Active Analytical Private Cloud**

To ensure this active-active servers always have a copy of the data block we had to make changes to the block placement policy and create a custom block policy at the HDFS layer.

The code is an extension of the original BlockPlacementPolicyWithNodeGroup class as DBS currently uses this class to support VMware hypervisor hosted data nodes. The code resides in the same package as the existing HDFS project, namely org.apache.hadoop.hdfs.server.blockmanagement , which was originally defined because there were some protected classes that could not be accessed externally.

Most of the functionality remains the same, except the following function signature from the BlockPlacementPolicyDefault class which was customised to handle block placement.

**Figure 5 Custom Block placement Policy**

This function is re-implemented in the extended class to override the default block placement behavior of placing more than one replica in the same rack.

· Accurate capacity sizing: We created a dashboard providing insights for more accurate forecasts of IT resource usage for projects. With an accurate forecasting, we ensure that we have adequate capacity and resources to complete projects on time, reducing delays and unnecessary expenditure.

**Figure 6 Dashboard for Availability and Capacity view**

· Segregating compute and storage: The new platform allows independent growth of compute and storage, so that we can scale each of these without having to scale the other. This means we can help allocate the necessary resources accurately and cost-efficiently.

· Future proofing for big data: The redesign allowed us to create a compute framework layer that is agile and scalable, accommodating future big data technologies with minimal disruption.

· Establishing strong governance to reduce high costs — Along with the redesign, we incorporated a strong governance framework to prevent cost investments that waste compute and memory resources. We migrated to the Hadoop platform, optimising our compute and ingestion frameworks and avoid incidences caused by poor design, batch jobs and queries. We further used Cloudera API to monitor and automatically track jobs which need tuning and exhibiting anomalies.

· Handling small files — Apps tend to write small files which require platform to constantly increase NN memory. We segregated the clusters to reduce the risk of poor performance due to high CPU processing. In addition, we enabled the reporting to ensure that applications merge the files in partitions to meet the recommended size of 128 MB.

With the new design, we have achieved following:

a) Optimised costs

By streamlining compute and storage, we can avoid technical debt that would have been acquired due to poor design, batch jobs, and queries. In 2019, we avoided 20% costs by optimising our ingestion and compute frameworks. We also reduced our provisioning costs by up to a 75%. These efficiencies have flowed over to the business, helping particularly regulatory risk reporting and finance reporting. The new architecture has helped us save millions respectively in licensing costs.

b) Improved productivity and time to market

With faster provisioning times, our IT teams have seen a productivity increase of about 25% for provisioning related tasks, reducing our provision time from 3 months to 4 hours. They are now able to serve internal business stakeholders more efficiently, provisioning them the environments at a much faster rate.

For developers especially, the redesign has given them a significant boost. The architecture now enables them with metadata-based frameworks, which nearly halves the time they need to create new business applications and serve new use cases. IT teams now can bring an app or service to go live to meet new business use case requirements in less than 3 weeks, which would have previously taken them a few months. We have enabled more than 40 business applications on our platform till date.

c) Reduced risk

We have achieved a 100 % reduction in contention risk. Since its launch, the architecture has experienced 0 availability loss.

Components within the same clusters no longer compete for similar compute resources and clusters are physically segregated based on usage pattern and CDH component usage. We also segregated our Impala and Spark workloads and switched to vSAN on SSD, which improved our IO throughout.

d) Streamline joyful customer experience initiatives

We have improved our read throughput by 10 times and tripled our write speeds, improving batch response times by 72 %. Our IT teams have more confidence when it comes to tackling new use case deployments. The project has also enhanced our predictive analytics for improving our customer experience among internal stakeholders, raising customer satisfaction scores because of automated intervention by 14 %.

Microservices

To support growing demand of technology stack we have built data microservices to complement Cloudera with on-demand data services.

The security framework is tightly coupled with the CDH implementation, allowing us to comply with the banking industry’s tight compliance regulations.

Our core platform runs on Cloudera and follows Shared Data Experience (SDX) principles and we have transient clusters as microservices consumables.

Microservices are built keeping in mind our cloud strategy where some of the common data services can be abstracted and provided as a service to enforce usage on-demand and improve resource utilisation. It paves the way to explore hybrid cloud strategy without changing application codes.

**Figure 7 Microservices — Service framework**

A. The underlying backbone of microservices is VMware Integrated containers and software stack.

B. Service framework consists of,

a. Control Plane — For user interaction to the microservice service enabling them to perform self-service operations. This interface is a home-grown Angular frontend which is tied to dynamic Content Management Service which enables on-boarding of services as no-code setup.

b. Backend Automation — Performs the actual provisioning and management of resources. We use ansible and python-based orchestrator which integrates with bitbucket and create a dynamic docker compose customised for each application based on the specifications from the front end.

**Figure 8 Backend automation workflow**

c. API Interfaces — for programmatically accessing and managing resources. These APIs are exposed as part of an Orchestration framework and allow users to execute operational changes and runbook activities as API instead of using UI. This enables flexibility to spin-up and do runbook activities as a part of batch execution.

This configuration has greatly streamlined the IT operations such as cataloguing, security and governance.

In the redesigned architecture, nearly 100% of our work is automated across hadoop clusters. This frees up a tremendous amount of time and energy from our IT teams to serve our business stakeholders and help the bank adapt to rapid changes with data and technology leveraging on data engineering principles.

It took us nearly 2 years to migrate all applications to new cloud native infrastructure and as technology enhances, we are ready to keep adding and accelerating our compute frameworks so both Application and Business units can benefit and deliver value to the enterprise. Migrating to Software Defined storage and immutable computes is a key milestone to the way Big data platforms should be built at scale.

For Bigdata, compute remains an ever growing challenge and there is a need to always keep looking for new technology innovations to reduce our compute footprint and accelerate our batches.

GPU Compute

Vast majority of conventional data lakes are based on traditional multicore CPU technology. Even with significant enhancements to storage, memory and networking subsystems, nearly every enterprise user and systems are constrained by the lack of performance and ability to have burst capacity for batch analytics based on data in an enterprise data lake or warehouse.

GPU technology, pioneered by years of research & development by Nvidia and others, has witnessed a decade of rapid industrial growth. Hence, GPUs are inherently able to process data in parallel because of the thousands of cores packed in them.

GPU cards are typically plugged into servers and they then augment the main CPU. They also offer faster I/O capabilities which increase the amount of data that can be passed from them to the main CPU and back.

Ability of GPU accelerators confers them a great advantage in performing compute heavy tasks for industrial applications, however GPU are expensive and there is a learning curve for applications to adopt. We, at DBS, use vSphere Bitfusion to support AI and ML based workloads by virtualising hardware accelerators such as GPUs. vSphere Bitfusion decouples physical resources from servers within an environment.

**Figure 10 How vSphere Bitfusion enable shared GPU farm**

The platform can share GPUs in a virtualised infrastructure, as a pool of network-accessible resources, rather than isolated resources per server. Bitfusion works across AI frameworks, clouds, networks, and in environments such as virtual machines, containers, and notebooks. Bitfusion stands ready to carry virtualisation forward as new hardware accelerators are introduced.

DBS chose vSphere Bitfusion for following reasons,

a. Dynamic GPU Attach Anywhere — Bitfusion disaggregates your GPU compute and dynamically attaches GPUs anywhere in the datacenter, just like attaching remote storage.
b. Fractional GPUs for Efficiency — Bitfusion enables use of any arbitrary fractions of GPUs. Support more users in the test and development phase.
c. Standards Based Accelerator Access — Leverage GPUs across an infrastructure plus integrate evolving technologies as standards emerge.
d. Application Run Time Virtualisation — Bitfusion attaches GPUs based on CUDA calls at run-time, maximising utilisation of GPU servers anywhere in the network.
e. Use with any Application — Bitfusion is a transparent layer and runs with any workload, framework, container or notebook.

We, at DBS, are using Apache Spark as a part of Cloudera distribution which is an open-source distributed general-purpose cluster-computing framework and provides unified analytics engine for large-scale data processing.

Spark accelerates python code by leveraging pyspark libraries over python modules such as Pytorch, Tensorflow etc. however this means applications on batch have to change the Scala code to python which is cumbersome.

We have leveraged on Apache 3.0 and enabled integration with open source RAPIDS suite of software libraries, which are themselves built on CUDA-X AI. The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. It enabled developers to take their Spark code and, without modification, run it on GPUs instead of CPUs and achieve acceleration.

RAPIDS cuDF library is enabled by setting the classpath of your spark configuration and enabling spark.conf.set(‘spark.rapids.sql.enabled’,’true’)

Data architecture and Big data deployment is a fast-growing field with rapid technology transformation driven by both opensource community and cloud providers. We, at DBS, strive to provide the right balance of technology modern data architecture to ensure cost efficiency for our applications.

(This article is also translated in Traditional Chinese language.

星展銀行如何打造大數據平台— Ashish Bhan — Medium)

Architecting Big data at Scale

Data Storage Platform

Microservices

GPU Compute

Written by Ashish Bhan