Transfer Data from BigQuery to MongoDB

Hussain Poonawala
Quinbay
Published in
3 min readJan 30, 2021

In this article, I’ll introduce the problem, the approaches thought off, and the one which we decided to use.

Problem: We wanted to build a recommendation engine but as the calculation of scores would be computationally heavy (MapReduce and Aggregation queries in realtime would not be of much help), we wanted to use the capabilities of bigquery and keep the scores/recommendations ready on the DB side. We could then update the recommendations every day by running a job.

For this we came up with 3 approaches:

  1. Retrieve the data from the bigquery and export the data to the CSV file(s) to a common location. A scheduler job process those files going through the ETL process and persist them into a MongoDB collection.
  2. Create an AirFlow job that does the extraction of the data from the bigquery, transforms the data as per the requirement, and loads them to the MongoDB collection.
  3. Last but not least, using the bigquery client SDK, retrieve the required data from BigQuery, transform and insert the data into MongoDB.

We decided to go ahead with the 3rd approach.

Step 1: Ready the query to get the required data from Bigquery
For this blog, I am using a sample dataset and retrieving all the data from the table and will insert this into MongoDB.

Step 2: Initial Set Up to use the client library
Follow all the steps mentioned in the Before you begin section.

Step 3: Install the client library
We will be using maven, so we need to import the following dependencies in the pom:

Step 4: Complete source code to retrieve data from bigquery

After finishing the job, we can iterate over the result and insert the records into MongoDB:

Step 5: Inserting into MongoDB

You can persist or save the documents into MongoDB either by loading them one by one or in several batches. Inserting the records in batches can improve the performance dramatically. So we insert them in batches as below:

Conclusion: We were able to transfer the data from BigQuery to MongoDB using the above steps. But we found that BigQuery is quite fast at processing large sets of data, however retrieving large datasets was a bottleneck. It took a lot of time to paginate and insert into the DB. We played a lot with different page sizes and batch sizes to find the optimal parameters for us. And as our data size was not very huge (approx 15 lakh), we were fine with this approach. But if your dataset size is huge, you might want to try some other approach. Do comment if you have any suggestions or a totally new/better approach.

References:

--

--