Running Cloud Dataprep jobs on Cloud Dataflow for more control

Mehdi BHA
Mehdi BHA
Jan 10 · 7 min read

Cloud Dataprep is cool but … jobs run only in us-central1 region 😢

I love Cloud Dataprep : First, it offers a large set of transformations on your raw data without the need to write “any line of code”. Then it integrates well with Google Cloud. In fact Cloud Dataprep jobs run as Cloud Dataflow jobs that read / write from / to Google Cloud Storage and / or Google BigQuery.

With Cloud Dataprep you create a flow. A flow design starts by importing one or many datasets. A dataset can be a local file uploaded to Cloud Dataprep, a file in Cloud Storage or a table in BigQuery. Then you add one or more recipes to your imported dataset(s). A recipe contains a sequence of steps to structure, clean, transform, and combine your dataset(s). For each recipe, you can create an output. An output let you define one or more publishing actions which define the output format, location, and other publishing options that are applied to the results generated from a job run on the recipe. Today you can only choose from Google Cloud Storage and BigQuery as publishing destinations. When you are satisfied with your recipe and finished configuring your output action, you can run a job to materialise the output of your recipe. The job will execute your recipe and you can choose to profile it by making sure to check the Profile Results in the Publishing Settings screen as you can see below :

Profiling the results will generate a visual summary, as you can see in the screenshot below (see Result profile). It can provide important feedback on data quality and can help you refining your recipe ; but it will also make your job slower because it will add extra steps (transforms) to the corresponding Cloud Dataflow job. So if you prefer faster jobs, you can disable this option. Below you can see 2 runs of the same recipe, one with profiling, the other without. You can see the difference between their executions durations: the first took 8m49s and the second, without profiling took only 5m19s (see Two Dataprep jobs with and without profiling)

Result profile

The problem

Two Dataprep jobs with and without profiling

Did you notice a problem in the screenshot above ? No ? Pay closer attention … still not ? Let me give you a hint. I live in France. Still not ? I framed the problem with a purple rectangle. Yes 😄 indeed …The Dataflow jobs corresponding to Dataprep ones, have run in us-central1 region, the default regional endpoint used by Cloud Dataflow.

This is a problem because this would increase our network latency and our jobs duration; without forgetting the useless network transport costs. For more detail take a look at why specify a regional endpoint.

By the way, us-central1 is used even-though :

  • Cloud Dataprep is set up to use a Google Cloud Storage regional bucket located in Europe (see the 2 screenshots below)
  • The imported dataset is a file located a Google Cloud Storage regional bucket located in Europe
  • The output is done towards a Google Cloud Storage bucket and BigQuery dataset also located in Europe

At the moment of writing there is no solution in Cloud Dataprep to make its jobs run in regions other than the default one. But there is a workaround 😉 : using Cloud Dataflow templates

Running Dataprep job from a Dataflow template

Getting the Cloud Dataflow template file

You should know by now that when you run a Dataprep job it will run as Cloud Dataflow job. When a Cloud Dataflow job execution completes, it will create a template file. This file will be stored in Cloud Storage. You can find it in the folder set in Cloud Dataprep Temp Directory (see the screenshot above).

You may have many Cloud Dataprep run jobs, so here is how to get the Cloud Dataflow template file corresponding to a specific job:

  1. Go to Jobs page in the Cloud Dataprep UI. For more information, see Jobs Page.
  2. Hover over the desired job. A context menu will appear at the end of the row. Select the menu then select Export Results. (see Jobs Page screenshot below)
  3. In the Export Results window, copy the Cloud Dataflow Template URL value (see Export Results window screenshot below). You will use this link to reference the template in Cloud Dataflow.
  4. Close the window.
  5. For more information, see Export Results Window.
Jobs page
Export Results window

Using the template to choose the region

Now that you have your Cloud Dataflow template, you can execute it through the Google Cloud Platform Console, REST API, or gcloud command-line tool; but most importantly you will be able to customize the execution of your job and choose, for example, a proper regional endpoint.

Here I will explain how to do this through Google Cloud Platform Console :

Go to the Cloud Dataflow page in the GCP Console

Click CREATE JOB FROM TEMPLATE

Enter a valid job name in the Job Name field.

Select Custom Template from the Cloud Dataflow template drop-down menu.

In the template Cloud Storage path field, enter the Cloud Dataflow Template URL value that you copied from the Export Results window

The UI will dynamically load one or more fields corresponding to the required parameters of your template, that you have to fill.

The first field will be the key solution to our problem. It is a select to choose a Regional endpoint for your Dataflow job. Here you will be able to set a region for your Dataflow, other than the default us-central1, for example, europe-west1.

The other fields are the remaining required parameters for your template. Typically, for a template created by Dataprep, you will have provide input and output locations; both values are JSON documents. The last required field is a custom location in Google Cloud Storage for storing temporary files. Lucky us, each field comes with an example of value to provide.

If you need to specify additional technical parameters, click on Additional parameters. You will be able to specify for example the zone and the machine type of the workers.

If your template needs more parameters, click on Add item in the Additional Parameters section. Enter in the Name and Value of the parameter. Repeat this step for each needed parameter.

Finally, what remains to be done is to click Run Job 😃 and you will see that the Cloud Dataflow job will run in the region that you have selected as you can see in the screenshot below.

Conclusion

Even though we found a workaround to the problem of running Cloud Dataprep jobs in other region than the default one, I would have preferred to be able to do it directly from Cloud Dataprep rather than using all the steps I detailed in this post.

The other issue is that we can get the template file, only after a first successful execution of our Cloud Dataprep job. This means that we have to accept a first run in us-central1, then we can use the workaround for future recurrent jobs with different parameters. I’m talking about recurrent jobs because if you would use Cloud Dataprep for a one shot job, our workaround will be useless; unless you run the job for a tiny sample of your data just to get the template file, before using it for the real data.

Have fun !

Mehdi BHA

Written by

Mehdi BHA

Getting divorce from #BigData on premise : in love with @GCPcloud. Google Cloud Certified — Professional Data Engineer & Cloud Architect