Building a Data Lake and Warehouse on GCP

From theory to practice, main considerations and GCP services

Alexander Goida
Geek Culture
11 min readJun 19, 2022

--

This article won’t be technically deep. We will talk about Data Lake and Data Warehouse, important principles which should be considered and which GCP services could be used. I will guide you along with GCP services which could be used to build and why you might consider them.

Before going with my version of Data Lake and Data Warehouse, I would like to bring a few well-known architectures, which you might have seen if you’re interested in this topic. The architecture which I would suggest would be more generic than those.

In my version of the Data Lake and Data Warehouse with more generic purpose and use cases I will talk about such GCP services as Data Transfer Service, Dataproc, Cloud Storage, Cloud Scheduler, BigQuery, Cloud SQL.

We will gather the “puzzle” and see why and how services could be used.

Main Considerations

There are a lot of important points to think about during the designing of your own Data Lake and Data Warehouse. When features of Data Warehouse are almost fully covered by BigQuery, features of Data Lake require more careful analysis and approach for implementation. All those “buzz-words” like fail-tolerance, reliable, scalable should also be applied to a Data Lake. You want data to be of a good quality when it reaches the storage, you want fast processing, and easy integrations of new data sources, and many others. All these points could be divided into three main groups of questions which address specific areas of the system. I want to mention them only briefly, because their discussion is not the purpose of this article.

  1. Data transmission. This group includes points about data transmission itself and their storing. The raw data should be transmitted from your main system to Data Lake without risks to be lost. It should be stored “as is”. Data Lake should not fail when unexpectedly something is changed in the incoming data. In case of unexpected increase of incoming data volume, the system should still be able to accommodate it fast. New integrations of data sources should be a matter of a few hours (even minutes), not days. In case of downtimes and failures nothing should be lost and the system should restore itself automatically.
  2. Data organization. In this group of questions you would need to answer how your data will be stored and queried in Data Lake and it has its own challenges. The incoming data structure is the subject for change with the time. There might not be correlation between time data arrives at Data Lake and business time related to the data object. The same data could come from multiple suppliers. You will need to decide about partitioning of your raw data for their cost effective and performant usage. You also might need to consider data access rights and security for different roles and users, though this could be skipped on this level.
  3. Data processing. The last group is related to the techniques of data processing. First of all, you want to keep data processing simple enough and agnostic to tools, so that your organization could use the best tools to achieve goals. The writing of new processing logic should be as simple as writing a new SQL script when you know the data structure. The processing logic should equally be able to process small and big volumes of data. The building of new data processing pipelines should be fast and deployment is easy. It should be equally simple to create real-time or batch processing, on-demand processing or those which are constantly running.

The answers for questions in these groups highly depend on the needs of your business and prerequisites. During the design of our Data Lake and Warehouse with my team, we have discussed many points. Below I would like to stop on a few concrete points, on which we spent the majority of time, choosing the right approach.

Choosing between Cloud Storage and BigQuery for Data Lake

One of the first questions is about usage of BigQuery and Cloud Storage for Data Lake. If your data has a very stable structure and its schema changes over time very slowly and you can consider it as constant for a long period, then BigQuery may be a good choice for you. You can have external tables which would read files from Cloud Storage and you can work with such tables almost the same way as with others. You can automate import of such data to your data product tables in Data Warehouse. Those very rare cases when your data is changing its structure, you cover it with semi-manual procedures and don’t have issues with supporting various schemas at the same time. You can build data processing using SQL and scheduled queries in BigQuery and keep the required skills very comfortable for DBAs.

In cases when data structure changes more often, the efforts on support of such Data Lake, which utilizes BigQuery, may grow exponentially and become too high and too complex to stay cost effective for the business. In such a case, Cloud Storage and processes around it could be a better solution. And this is what we have chosen for us, because in our case data schema changes from record to record and we have multiple data pipelines and suppliers.

Choosing of Raw Data format: Parquet, Avro, or JSON

The choice between data formats has the same basis as the choice between BigQuery and Cloud Storage for a Data Lake. Parquet and Avro file formats require a certain structure to be embedded in files to support the features which they offer. For example, Parquet is extremely effective for querying because it’s a columnar data storage. It also offers better compression. But it is more expensive on writes, because it requires more CPU and RAM for operation. Avro looks as a better choice for cases, where write operations should be fast. You may append row by row to Avro files, but you cannot do this with Parquet files. This makes Avro more preferable in scenarios where you process data in stream-wise manner. I’ve found a very nice explanation of the details of these formats.

Thus, if your schemas do not evolve fast and you may guarantee all records to be of the same structure in a single file, then Avro or Parquet is very good choice. But when the structure is frequently changed and you need to keep in the same file records with different schemas, then it becomes challenging.

Parquet and Avro require the schema to be embedded into a file. If your data records’ schema changes for almost every record, then you would need to patch every record to adjust them to some “target” schema or create a lot of files with a bit different schema. Avro declares that it supports schema evolution for both adding and removing columns, but our research shows that it still requires all records to have the same structure in the same file. It’s possible to solve, but increases the complexity of the system and adds additional points to failure and overhead on support of such a system.

In contrast, JSON format doesn’t have aforenoted problems, and is well supported by many processing frameworks. The only problem which you may think of is the size of text files with JSON data. Spark allows you to read files with different schemas. When you specify some expected schema it correctly loads it, even when records have significantly different schemas. Also you can store records with different schemas to the same files. Eventually, after several experiments, we settled on JSON.

Choosing the right Data Organization for Data Lake

The main purpose of data organization is to offer an uniform structure in the Data Lake and help to perform effective data querying. How you store data, how you organize it, partitioning, all this affects the performance of queries. A very good guide about this is attached below as a link. The main idea is that the ingestion date should be part of the data organization at the 1st tier of your Data Lake.

The 2nd tier of Data Lake may be more tailored for needs of the business. This will give the possibility to rebuild the 2nd tier from the 1st one in case you decide to reorganize prepared data.

It’s important to keep the Data Lake’s 1st tier isolated from the data structure and not to couple it with your business. That will make much easier integrations of new suppliers and further manipulations with the data inside Data Lake.

GCP Services for Data Lake and Warehouse

Now I would like to talk about building blocks of possible Data Lake and Warehouse. All components are GCP services and fully covers needs of specific areas: data transmission, processing and querying. We will not be configuring any of the services, but I will outline their main benefits and why they are suitable for you Data Lake and Warehouse. You may find the schema of the system which I’m talking about in the beginning of the article.

Data Transfer Service

In order to ingest data to your Data Lake you will need to establish reliable data transfer from the source system. Data Transfer service helps to transfer data from AWS, Azure, Cloud Storage or on-premise to another on-premise or Cloud Storage. The service transfers files, verifying the transmission and can resume if it has been interrupted. Agents for on-premise can be scaled to increase performance of the data transmission.

In our case we transfer data from on-premise to Cloud Storage. In order to configure, an agents’ pool should be created in Data Transfer. All agents are communicating to each other via PubSub service in order to coordinate which files were sent and where to resume in case of some infrastructure failure. For example, if an agent dies, or connectivity dropped, other agents, which still can work will continue the transmission. In the end, all agents work in cooperation making the data transfer fail-tolerance and not skipping any files.

Dataproc

Dataproc is a managed Apache Spark and Apache Hadoop service. It’s a core of ETL processes of raw data files. It’s possible to configure auto-scaling, which uses preemptible secondary worker nodes by default and is based on YARN metrics. It’s possible to create a minimal cluster of 1 master node and two worker nodes, and set in the auto-scaler maximum secondary worker nodes some high value. The difference between main and secondary worker nodes is that the latter cannot store data on them. It’s also possible to use Dataproc in serverless mode, submitting batch jobs. We are constantly running several ETLs every three minutes. Because of that our Dataproc cluster is running constantly.

Wildcard in file path

Very interesting feature of Spark is the way it works with wildcards in path for files to load. For example, this is what is possible to specify as source for spark.read.json. Let’s say you have the following folder organizations in your Data Lake: gs://object/year/month/day/hour/minute/instance.json. Almost any service supports the specification of an asterisk in the end of a string, like that: gs://object/year/month/day/hour/* — if you want to load data for a specific hour on a certain day. But not all would support this: gs://object/year/month/*/*/minute/* — if you want to load data only for a specific minute across the whole month. And your folder organization could be a bit more complicated then this. Especially on the 2nd tier of Data Lake where you will be working with prepared data files, where storage organization could include business information. For example, you may want to load data only for a certain customer across all year.

SparkSQL

SparkSQL allows to use well-know DSL for operation upon data. There are a lot of built-in functions which make learning curve very fast, especially for DBAs.

All your operations could be executed within spark.sql operator:

spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW my_view
AS
SELECT
CAST(field AS FLOAT) AS mapped_field
FROM another_view
""")

Cloud Scheduler

The Cloud Scheduler service allows you to trigger some URL or send a message to Pub/Sub topic using a defined schedule. We use it to trigger Data Transfer jobs and Dataproc workflows via URL. The scheduler uses GCP REST API for services to start operations. Data Transfer jobs do not support schedules more frequently than once in an hour. And Dataproc workflows or jobs do not have schedules at all. It’s possible to use Cloud Composer or Cloud Functions to trigger operations on Dataproc or Transfer Service, but this would increase the complexity of the system, introducing more components and more places for failure.

Cloud SQL

This component is not present on the schema of our Data Lake and Warehouse, but we’re using it. The main purpose of it is traceability of operations and to ensure that all files are processed, when they arrive at Data Lake. There are Cloud Functions which trigger when a new file arrives to the bucket and add files to the metadata database on Cloud SQL. In order to overcome transient errors, the re-try is enabled for functions. With the enabled re-try, the function will be trying to add a file during the next 7 days and eventually guarantees that every file will be added to the database, regardless of transient errors. We use PostgreSQL.

The Rest, but Not the Least

The rest of GCP services do not need special coverage. We use Cloud Storage for our Data Lake and BigQuery for Data Warehouse.

When you organize your Data Lake on Cloud Storage it is worth to configure the life cycle for data objects. It’s possible to have objects with different classes in the same bucket. Life cycle rules can help to change the class of objects depending on their age from Standard, to Nearline and so on. Even if the cost of data storage is low, when you deal with a huge volume of data and read operations, this will generate a noticeable price. Managing object classes will cut some cost.

In BigQuery you may want to have multiple partitions for the same data product. For example, the same history of orders could have partitioning by customers and by location. BigQuery tables do not support multiple partitions. We are overcoming this by having multiple tables with the same data, but different partitions. A group of tables, which belong to the same data product, are generated by the same ETL and because of that it’s not difficult to support several tables.

Afterword

The designing of Data Lake and Data Warehouse is an interesting process from many perspectives: technical architecture and implementation, collaboration with people from different business areas, and learning advanced usage of systems, which previously were used as “black box”. In further articles, I might describe details of usages of Data Transfer Service, triggering of Dataproc workflows and other, if I find the time between these tasks.

--

--

Alexander Goida
Geek Culture

Software Architect in Cloud Services and Data Solutions