Google Cloud Dataprep: Prepare Data of Any Size

Easy data preparation with clicks and no code.

Dhvani Vashist
The Startup
7 min readNov 11, 2020

--

We all are familiar with the pain of cleaning and altering raw data before it’s analysis. It is highly important in any analysis or evaluation. It can be an intense process that can consume a lot of our time. But there are many tools available online to alleviate this task. One such tool is Cloud Dataprep from Google.

What is the preparation of data and how important is it?

Data preparation is nothing but a process of exploring, cleaning, and modifying raw and messy data so that it can be used for reporting, analysis, and evaluation of machine learning algorithms. It is done to prepare and structured or unstructured data ready for implementation.

The following are the steps involved in data preparation:

  • Deleting unnecessary data
  • Consolidating/separating fields
  • Transforming data
  • Making corrections to data
  • Changing formats

Now that we are clear about what data preparation is, let’s focus on its importance. Not all the data we have is useful and thus, it’s necessary to extract quality data from the original raw data. It’s an important tool to improve the ability to employ data in a distributed manner for data discovery, data mining, and advanced analytics. Data preparation increases accuracy in the data thus increasing the quality of decision making which in turn allows organizations to work better and faster. The preparation, blending, fixing, and refining of data provides a much smoother data analysis experience. This helps the institutions to improve their functions. These features help in improving the overall functionality of the software.

Data preparation Tools:

Although data preparation is an important step, it consumes almost 80% of a developer’s time. This data related issue is one of the major setbacks for AI practices in organizations. So to help with this process we have many tools available online. These tools are used for processing, blending, exploring, cleaning, and transforming data. They provide quick and efficient integration of data.

The popularity of these tools is increasing worldwide.

There are certain major requirements that a data preparation tool must follow to qualify. It must be sold as an independent software program or as an integrated data tool, which has certain data preparation capabilities. It must allow users to merge, combine, and transform datasets for basic data analysis and integration. It must also offer a higher level of cleansing and purification for enhanced data quality. There are so many data preparation tools available online and each of them is different in some way.

One such software is Google Colab Dataprep by Trifacta.

Cloud Dataprep is an integrated partner service operated by Trifacta and based on their industry-leading data preparation solution.

Google along with Trifacta ensures a smooth user experience for preparing structured and unstructured data for analysis etc. Cloud Dataprep is an intelligent data service that is completely managed and can extend on-demand and needs.

What all data prep tools and softwares are available and how Google Cloud Dataprep is different?

As discussed earlier, there are several options for data preparation and ETL tools available online. So why not use other tools?

Well to answer this question I must discuss some other tools as well. ETL (extract, transform, load) tools like Google dataflow, stitch, etc. can also be used. In case you don't know what these are, Google dataflow is a data processing service that Google provides. It uses pipelines to transform and analyze batch and real-time data. Stitch is also a similar ETL product that does transformations like translating data types within the pipeline. ETL and data preparation sounds very similar because their basic functionalities are somewhat similar. The main downside with ETL is that it requires it requires a well-defined structure and the altering in data would require much more time. While on the other hand data preparation tools can process the entire data and not just some parts of it thus, the process executes much more quickly. One other convenience with data preparation is that the end-users can do this process which saves the IT a lot of time and burden to create and prepare data assets. Some of the data preparation tools are Trifacta, OpenRefine (earlier known as Google Refine), alteryx and others can also be used. They all are very reassuring and have many outstanding features. but Google Cloud Dataprep holds an advantage on all of them.

With Cloud Dataprep there is no need to worry about the size of the data file. The documentation on its website states that there is no limit to the file size. (Uploading a file directly would throw an error but uploading a large file from Google Cloud Storage would work just fine).

Some other advantages that Google Cloud Dataprep are :

  • It will be easy to use and integrate other services provided by Google Cloud Platform.
  • People can access multiple data sources from Cloud Storage and BigQuery for further analysis.
  • The prepared data can also be used by services like Google Data Studio or Google Cloud Machine Learning Engine to train ML models and analysis.
  • There is no need for any VM.
  • The process becomes much easier with the intuitive GUI which lets you see the whole process.
  • Cloud Dataprep provides its users with more functionalities like anomaly detection, drag, and drop development, and out of the box integration with GCP.

There are obviously some limitations with Dataprep :

  • Sort transform is not supported.
  • It doesn’t support user-defined functions.
  • It doesn’t support custom dictionaries and data types.
  • User access to administrative functions is not supported.
  • There may also be some limitations to file formats.

Check out the following link to see all the file formats that are supported by Google Cloud Dataprep.

How to use Google DataPrep?

To start using the Dataprep application you can refer to the following site. Each and every step is explained in detail.

Pricing:

The Google Dataprep jobs are executed by the Dataflow workers. They are priced per second for CPU, memory, and storage resources. A job in Dataprep is billed according to the number of Dataflow worker virtual CPUs used to process a job. The time that these virtual CPUs are used is multiplied by the Dataprep service rate of $0.60 per hour.

Example: A Dataprep job runs for 1 hour and requires 5 Dataflow virtual CPUs. Dataprep job cost = 1 hour * $0.60 * 5 vCPUs = $3.00

So what’s next with Google Dataprep ?

With the increasing data science industry, the demands for Google Cloud Dataprep is also increasing. To meet this demands continuous innovation and upgrades are being done which the customers will soon get access to. Currently, it provides its users with many remarkable functionalities, and the future of the data preparation process and Cloud Dataprep application is very promising.

Conclusion :

Data preparation is of high importance and is a necessity for analytics as it increases the quality of data. We discussed how it may be time-consuming but that’s where certain online data preparation tools come in handy. Google Cloud Dataprep managed by Trifacta is one such tool.

To conclude this article I must say that there are some areas in which the application needs some improvement, for eg. the extraction of zipped files, etc. But overall Cloud Dataprep is a pretty good application with an amazing user interface that can surely help you with data preparation and will save you a ton of time.

All there now left is for you to try Google Cloud Dataprep for your data preparation needs.

You can refer to the above link for starting your experience in Google Dataprep.

--

--