My views on Google Cloud Dataprep

Kimoon Kim
Google Cloud - Community
4 min readOct 27, 2017

Does data cleaning, data cleansing, data prepping, data alteration etc. ring a bell? If so, this article is for you! I know your pain because I am one of you. It’s an agonizing process that one has to go through that takes up nearly 80% of one’s time in preparing and managing data for analysis.

Why don’t you rather use ETL tools like Cloud Dataflow to clean and prepare you data?

For those of you who are not familiar with Cloud Dataflow, it is a data processing service from Google that uses pipelines to ingest, transform and analyze both batch and real-time data. Sure, you can use Cloud Dataflow, but do you feel like coding when correcting errors, omissions, or inconsistencies in your data? I sure don’t.

There are many data cleaning tools out there such as Trifacta and OpenRefine (formally Google Refine), but they weren’t suited for my use case scenarios. Trifacta Wrangler (free version) looked promising and had all the features I needed, but I was limited to 100 MB file sizes. The smallest CSV file I had was around 150 MB and I therefore had to split up the file into smaller pieces and perform the same transformation on them which was quite frustrating. Sometimes I would even forget which transformation I did and had to start all over again. Upgrading to the enterprise version made no sense for my use case so I eventually stopped using them. OpenRefine, an a open-source software, was ancient and buggy compared to Wrangler. What then did I eventually use? MATLAB.

So, when I heard there would be a FULLY MANAGED service of Trifacta called Cloud Dataprep from Google, I got hyper-excited.

TL;DW (Too Long; Didn’t Watch) — Google Cloud Dataprep is an intelligent data service from GCP that allows you to visually explore, clean and prepare data that is not ready for immediate analysis.

I immediately signed up straightaway for a private-beta access and was given access last week. You guys can do the same by clicking on the link below.

Google Dataprep

Trifacta Wrangler

The reasons for my excitement was that I had access to a service that could easily integrate with all the other tools on the Google Cloud Platform.

  • I could easily import structured/unstructured data that I have on my Google Cloud Storage buckets to explore, clean and prepare data for analysis.
  • It was easy to cleanse and enrich multiple data sources using an intuitive, visual interface.
  • I could import tables from BigQuery for further analysis!
  • I could clean and prepare the data so that I can use Google Cloud ML Engine to train machine learning models.

The use cases were endless…but I was worried because of the 100 MB file limit size that made me stop using Trifacta. The documentation on Google’s website states that there is no file limit size.

“Prepare datasets of any size, megabytes to terabytes, with equal ease.”

I tested to see if this was true and tried to upload a file directly, and received the following message!

Okay… uploading a file directly does not work, however, IT WORKS using Google Cloud Storage! This product is what I needed! Besides that, there are many cool functionalities that come with Cloud Dataprep including anomaly detection, drag-and-drop development and out-of-box integration with GCP that are fully managed.

Would I use the product once released? Yes

Is it easy to use? Yes

Would I recommend it to those poor souls spending 80% of their time cleaning data? Yes

Next Steps

Kimoon

--

--