Oracle data (ETL) to Google BigQuery using Google Cloud Dataflow and Dataprep
ETL made easy
Despite the fact that an ETL task is pretty challenging when it comes to loading Big Data, there’s still the scenario in which you can load gigabytes or terabytes of data from Oracle into BigQuery ( Analytics Data Warehouse Google Cloud) relatively easy and very effective.
The Data Preparation and Transformations process consume up nearly 80% of one’s time of Data Scientist work.
In my case, the diagram below is based on cloud architecture that we have been created for our customer using Machine Learning Libraries like pandas, numpy, and others for Data Science. Our customer’s database table was imported into BigQuery for further data preparation and transformations using Google Cloud Dataprep.
- I would like to mention the excellent job/work of TRIFACTA, a company that developed Dataprep together with Google.
This tool is really useful and time-saving, let’s say that it made our Data Scientists smile.
1. Export Oracle database table to CSV.
You can use a third-party ETL tool as well to connect your database directly to the BigQuery as described in this link, eliminating some steps described below.
2. Upload file to Cloud Storage bucket.
3. Create Dataset and import csv data to Google Cloud Dataprep
Select Dataset and import data from cloud storage:
Choose Cloud storage bucket and select file to import:
4. Add dataset into a flow to run a job on Google Cloud Dataflow
Add new recipe
5. Clean Data and set job to run
I really like that could easily integrate with all the other tools on the Google Cloud Platform. You can save a lot of time on data cleaning and Data Scientists can focus on the business value.
I will not detail Dataprep so much because there are other great posts that you can use as a reference. ;)
6. Edit Job
You can schedule the job as well to run periodically. Dataprep offers this option.
7. Create a new Table on BigQuery
8. Save Settings and Run Job
9. View job running on Dataflow
10. Check results
Dataprep is as important to data analysis as a pre-flight checklist is to a Datascience pilot.
11- View Bigquery Table
Example of select searching a string ( 38.5m of rows).
- Query complete (16.0s elapsed, 4.78 GB processed)
SELECT * FROM
REGEXP_CONTAINS(LOWER(TO_JSON_STRING(t)), r’caixa marcha’)
Next Steps :
- Explore Big query with GCP Datalab ( notebooks for Data Science ).
- Explore Big Data visualization with GCP Data Studio