Part 3 — Modernising a Data Platform and BigQuery Concepts

Nikhil (Srikrishna) Challa
Data Knowledge Hub
Published in
5 min readJul 11, 2020

In Part 1 and Part 2, we discussed about the concept of modernisation and some key elements of a DWH design.

In this part, we will talk about, what makes a Data Platform/DWH modern.

There are various DWH solutions available in the market and one would consider quite a lot of parameters before adjudicating the right fit for their architecture or organisation. Some of those parameters are:

1. Architectural complexity

2. Data Size

3. TCO — Total cost of ownership

Which involves Storage costs, Compute costs, Maintenance costs

4. Source system technology

5. Data Engineering capabilities

6. End user requirements

Whilst DWH technology is not new, there are a lot of leaders in the market who offer on premise DW solutions and unsurprisingly most of them have adapted themselves to the modern techniques and started with Cloud offerings too. Few of the renowned DWH solution providers are (Order does not signify their capabilities):

Teradata (Onprem/Cloud)

Oracle (Onprem/Cloud)

IBM DB2 (Onprem/Cloud)

Amazon Redshift (Cloud)

Google BigQuery (Cloud)

Microsoft Azure (Cloud)

Snowflake (Cloud)

A DWH operations team and data engineering teams are often found brainstorming about quite a few things which possibly gives anyone a good insight into what the modernisation of a data platform is all about. Few of them are:

How does the data in my organisation look like?

The answer to this is not something that a data geek is unaware of. It’s fondly known as ‘disparate data sources’. Very often we find that the data sources are a combination of legacy systems, relational databases on different technologies such as Oracle, z/OS DB2, SQL Server & Non relational/NoSQL databases and at times on SaaS applications such as Salesforce, SAP etc.,. The data in these sources is of different types (Structured and Unstructured). All of these make the composition of a data platform extremely complex.

However, if the first version of the DWH is carefully handled, then the evolution becomes smooth and streamlined. As we discussed briefly in part 2, DWH maturity evolves with time.

What is the quality of my data?

A common problem that most organisations have to handle. I don’t think it is an exaggeration to say that, organisations spend highest amount of their time in analysing the quality of the data in a maturity life cycle.

The very fact that the data comes from different sources, often deteriorates the quality. Ex: Customer data is most vulnerable and highly prone to data quality issues, unless it is financial services where a customer is bound to provide authentic information. Customer data comes from Online channels, call centres, marketing surveys etc., and not every source handle them uniformly.

How do I store my data and organise my compute resources?

While the first 2 points discussed above determine how easy or complex it is to integrate the data, this particular point determines how big and vast a data platform is going to be. It’s quite common that organisations today deal with millions of rows of data that is batched in to the data platforms or streamed in.

As an architect, one should be able to rightly size their data platforms and also allocate necessary compute resources in order to process such huge volumes.

Finally, the most important aspect, Performance?

The end users or stakeholders of the data platforms should be happy with the performance. What’s the fun if a SQL query runs for 30 mins to scan 20TB data in retrieving a simple information.

Performance in my head, is not something that should come into the picture at the end of the proceedings. It should be an integral part of the design strategy, where the designers/enterprise architects consistently pose what-if if sort of questioning w.r.t performance for every decision or choice that they made from a design point of view.

Resilience:

Resilience is a key aspect of modernisation and it determines how adaptable our platform is to “crisis” situations.

Everyone who is alive on this earth today, is resilient. We are fighting a pandemic without letting our will powers deter.

Elasticity:

I was very hungry yesterday and I ate 3 slices of bread for my breakfast. I am not too hungry today and hence I had just 2 today. Hunger is my workload and no of slices I provide myself to satisfy my hunger are the resources. The ability of the platform to up-size or down-size the resources automatically for a given workload is called elasticity.

I feel it is not difficult to conclude that Resilience and Elasticity to the modern-day workloads is what makes something modern? I definitely think so.

Along with these, there are few more important characteristics that makes a data platform modern.

Ability to accommodate Petabytes of data.

Trivia: Forbes expects that the amount of digital data generated by 2025 will be around 175 Zettabytes. 1ZB à 1 Trillion Gigabytes. (Offered without comments :) )

Serverless/No-Ops — It is good at times to have the operational overheads managed by someone else and we focus on essentials such as allocation of resources, maintenance of the systems, infrastructure management etc.,

An ecosystem of tools — As discussed briefly in Part 1 and Part 2, data analysts, data management offices, marketing teams etc are the end users of data platforms. They often implement Machine Learning workflows, setup data visualisation dashboards and implement pipelines that move data across locations in the same or different form/shape. If there are options available to perform these tasks within the eco-system, then the implementation, tests and productionising of these services becomes much easier.

Up-to-date data — “Real-time” is the word. The more up-to-date my data platform is, the better and meaningful it would be to derive accurate insights.

Security — Data & Security cannot be separated. Especially with all the regulations across the globe w.r.t the usage of personal data, it is imperative that security considerations are paramount. No point in building a castle which has a weak front wall.

Collaboration — As we spoke earlier, organisations have data across disparate sources. A modernised data platform should be collaborative and have the abilities to integrate with different technologies. The higher the ease of collaboration, the easier it is to make a choice.

Machine Learning capabilities — If I am offered a penny, every time a techy says Machine Learning at least once a day, I would be a multi-billionaire by day 10 :) . Such is the buzz and importance. It’s a no-brainer to expect embedded ML capabilities in a modernised data platform.

All the above ingredients together make a data platform, modernised. Amazon, Google and Microsoft are the big players who offer modernised data platforms and also provide various solutions and options to migrate organisation’s existing data platform to their offered platforms and get “Modernised”.

While each of those have their respective pros and cons, comparing them is out of this article’s scope.

I will be discussing Google’s modernised data platform — BigQuery, its features, Data migration solutions/architectures with few use-cases in next parts.

For ex: If the on-premise Data platform is on Teradata, the data modernisation solution/architecture is different compared to a platform that is setup on IBM DB2 or Oracle.

We will take a closer look at how we can achieve those in upcoming articles.

Happy reading!

--

--