Hack Reactor SDC | Part II | Fake the millions

Published in

Glitter Guys

2 min readMar 20, 2018

Today begins the task of understanding a legacy codebase, disecting the connections, breaking down the Schema and then slowly building the scripts to seed/populate the database with 10 million unique data points.

The DB during this part will be MongoDB, but we are required to test both a SQL and noSQL databases which means that I will transition to sql next. For now, the focus becomes how to create 10 million unique data points. A few ideas I’ve had:

Make API calls to scrape real MeetUp data
Create a small handful of real data points and then just replicate those across the database millions of times
Focus more on the shape of the data instead of the content and use a library like Faker to do the heavy lifting.

It turned my focus to real data. Any real API calls to scrape data from MeetUp would eventually result in my ip address being banned/blocked and well before the 10 million mark. In addition, the network time require to make those calls is not fesible in real application.

Okay, but what about just using a few real data points and then replicating that data across/within the database? Sure, that is possible but it really does not meet the guidelines of unique data points. It also does not handling the primary challenge set in front of us — how to deal with this much data — both insertion and retrival.

This left me with one course of action — focus on the shape of the real data — design my Schema around that shape and simply populate the objects with fake data using faker.js. Simple? Right? Kinda

My first attempt eventually caused an out of memory error. While looping 10 million times, I was inserting into an array each object data point. Then I would hopefully seed the database while looping through the array of object data. In an ideal world of infinite memory — this may have worked. In reality, I had to find a better way. The answer came on the second attempt and after better learning about async/await:

When the outer argument is 1,000 and the inner is 10,000 — an input array of data points can be made and then using mongoose’s built in insertMany method— we are able to insert multiple objects at once. Utilizing the async/await will ensure the data for that loop is seeded without bogging down the machine.

So how long does it take to generate 10 million data points and then seed a MongoDB utilizing the insertMany method?

12 minutes and 34 second.

Next step? SQL

Just Keep Coding,

Chase Norton

Hack Reactor SDC | Part II | Fake the millions

Written by Chase Norton