How I used Google Cloud Data Fusion to create a data warehouse — Part 2

In part 1 of this post I explored for what you can use Google Cloud Data Fusion exactly, explaining the use case of a POC for one of our customers. I also talked about the differences between Cloud Dataflow and Cloud Dataproc. The next step in creating a data warehouse with Google Cloud Data Fusion is to create the data pipeline, wrangle the data, and to schedule and export the pipeline. Read on to see how I did this, plus my final thoughts on the pros and cons of using Google Cloud Data Fusion.

Creating the Pipeline

Creating a data pipeline is quite easy in Google Cloud Data Fusion through the use of Data Pipeline Studio. In there you select your data source, select the transformation that you want to perform, and define the sink.

These are done with just a couple of clicks and drag and drop actions.

Once you click on ‘Create Pipeline’ you end up in the Pipeline Studio. You then have 6 options:

  1. Source
  2. Transform
  3. Analytics
  4. Sink
  5. Conditions and Actions
  6. Error Handlers and Alerts

You can select amongst various types of data source connectors to connect your data, such as:

  • BigQuery
  • Cloud Data Store
  • Cloud Storage
  • Cloud Spanner
  • Microsoft Excel
  • FTP
  • Amazon S3
  • Etc…

Once you select the data source, Data Pipeline Studio creates a box for you on the grid. At this point you need to put the configuration parameters by clicking on to “properties”.

Pipeline Studio and adjusting properties of the data sources
Google Cloud Storage Properties

For example if you have your data in Google Cloud Storage, you have to provide a label and reference name. Then you can either select you Project ID or keep it in automatic to be detected automatically. You’ll have to provide the path to your bucket in which your data stays, and that’s it.

Pre-built or custom Transformations

If you’re creating an ETL (Extract Load Transform) pipeline you can also perform some transformations using pre-built transformation plugins. Or you can create your own transformations. Another interesting option is to use Data Wrangler to perform data cleaning, formatting etc.

In the Analytics section you have options to Deduplicate, Group by, Row normalizer, Find distinct values or Data joiner. Which is super handy since these are very general transformations which are performed quite often.

Sink Tab

After deciding on your transformations you can choose to save your transformed data into a Sink. For this you can use one of the predefined options from the “Sink” tab. In that tab you have options such as:

  • BigQuery
  • Google Cloud Storage
  • Google Cloud Datastore
  • Google Cloud Pub/Sub
  • Database
  • Hbase
  • Avro
  • Amazon S3
  • Etc.

In addition to this, if you want to create or delete a bucket from Google Cloud Storage or move the data contained in it, you can easily do this through your pipeline using the relevant plugins.

In this Sink tab you also have the option to use “Remote Program Executor” which basically enables you to run your own code from a Virtual Machine. For example, if you have a custom Python code or Bash scripts you want to run you can create a VM and give access to Cloud Data Fusion so it can access it through SSH and execute the code that resides there.

Error Handlers and Alerts Tab

In the Error Handlers and Alerts tab you can trigger alerts and run error collector to keep track of the errors in the process.

Pipelines List

After you’re satisfied with your pipeline design you can save it and click on deploy to have it readily available for execution in production. After you deploy your pipeline design, it will be available in the Pipelines List.

Pipelines list in a namespace

If you just want to duplicate, export or delete your pipeline design, you can perform these actions from the pipelines list.

Data Wrangling

So you also want to wrangle your data? Of course you do… If you not only want to replicate your data from an on-premise source system to a data warehouse in the cloud like BigQuery , but also clean it by manipulating it to extract the information that you need, you can use a great tool in Google Cloud Data Fusion called “ Data Wrangler “.

Using this tool will make your life a bit easier, as you don’t have to write all these long SQL queries to perform all the cleanups and joins.

In Data Wrangler, you can clean and manipulate your data with just a few simple clicks. You can then save this Data Wrangler recipe to use in your data pipeline processes.

With Data Wrangler in Cloud Data Fusion you can select a column, apply a filter to it, change the format and copy or extract fields from it with just a few clicks. Each step you applied is saved so you can go back and modify them in any step of the data wrangling.

After you are satisfied with the steps you can just click on ‘create pipeline’ and it will automatically switch to the Pipeline Studio. So you can directly use the data wrangling steps you just created and operationalise effortlessly into production pipeline.

Read rest of the article here!

If you like my blog post please don’t forget to clap and share!

Want to know more about Data Fusion or other Google Cloud Platform solutions?

Visit the Fourcast GCP page or just send a message to our expert engineers!

Originally published at the Fourcast Blog.

--

--