Seeding Data Into MongoDB Running On Docker

Luis Osta
Luis Osta
Jun 20, 2020 · 14 min read

Learn how to seed data into a running Docker container in a simple and flexible way.

Legally required image that has nothing to do with the article topic
Photo by NASA on Unsplash

Why would we need to generate data

Modern applications are, at least to some extent, data-rich. What this means is that often times applications will have features like the Twitter Feed, aggregated statistics, friends/followers, and many other features that rely on complex inter-related data.

It's this data that provides the vast majority of the application value. Twitter would be quite useless if the only thing you could do is post and could only see a handful of others’ tweets.

The biggest pitfall most developers can fall to is re-using the production database for development

Due to the complexity of this data, during the early development process of an application, it's tempting to use the same database for production as for development.

The thinking is, “if we need the most realistic data, what’s more, realistic than actual user-generated data”.

There are a few serious reasons why you should strongly consider not going that route:

  1. Even in a 1-person team, re-using the same database means any glitch or bug that you create during development spread to production. Nobody thinks they’d ever accidentally nuke production until it happens to them.

How can generated data solve those problems?

The combination of programmatically-generated data with a locally running database will prevent any of those aforementioned problems from causing significant issues.

Since even if you do nuke the database or DDOS yourself, it’s a trivial task to refresh your development environment or re-generate the data you need.

By reducing the number of external dependencies during development we increase system consistency solving debugging and isolation issues.

But the additional value gained from data generation will depend on two major factors:

  1. The quantity of data generated determines the realism of the dataset and what issues may not be visible during development

So how we should approach the problem?

So in order to have a setup that maximizes your chances of catching bugs and testing the real quality of the software being developed the data powering the application and its usage must follow the following rules:

  1. The data should be generated on the developer’s computer and stored on a locally running database. This is to prevent any individual’s developers bugs and problems to ripple out into other developer’s machines.

Our Requirements

  1. Node & NPM

Foundations

For this article, we will create a simple React application that will simply render a list of employees. Specifically, it will display:

  1. The employee name

Then, since the front-end needs to display the list of employees, the API will return an array of objects. The object will have each aforementioned property for all of the users stored on the DB.

Since we’re focusing on the database side, we’ll breeze through the rest of the applications but in a future article, I’ll be diving deeper into how to get complex orchestrations with Docker Compose.

Client

For our front-end client, we’ll be utilizing React to display the data stored on the MongoDB database. The client will make requests using Axios to the Node API.

To get started, we will utilize Create React App to set up the baseline application that we’ll make a few changes to.

You can create a CRA application with the following command from the project root:

npx create-react-app client

Client Dependencies

Then, we will have to download the dependencies that we’ll need for the React application. For our purposes, we’re only going to need Axios and Material-UI.

You can download them both with the following command (make sure you’re in the client direction, not the project root)

npm i axios @material-ui/core --save

Getting Started On The Client

For our purposes, we will only be making changes to the App.js file, which in the starter project is a major component that displays content.

This is what that file should look like at the start:

import React from 'react';
import logo from './logo.svg';
import './App.css';

function App() {
return (
<div className="App">
<header className="App-header">
<img src={logo} className="App-logo" alt="logo" />
<p>
Edit <code>src/App.js</code> and save to reload.
</p>
<a
className="App-link"
href="<https://reactjs.org>"
target="_blank"
rel="noopener noreferrer"
>
Learn React
</a>
</header>
</div>
);
}

export default App;

The changes we will make to this file are in order:

  1. Remove all of the children of the HTML tag

After the three steps, your App.js file will look something like this:

import React, { useState, useEffect } from "react";
import { Card, Grid, Typography, makeStyles } from "@material-ui/core";
import axios from "axios";
import "./App.css";

const useEmployees = () => {
const [employees, setEmployees] = useState([]);

useEffect(() => {
const handleAPI = async () => {
const { data } = await axios.get("/api/employees");
const newEmployees = data.employees || [];
setEmployees(newEmployees);
};

handleAPI();
}, []);
return employees;
};

const useStyles = makeStyles((theme) => ({
card: {
padding: theme.spacing(5),
},
}));

function App() {
const employees = useEmployees();
const classes = useStyles();
return (
<div className="App">
<header className="App-header">
<Grid container direction="column" spacing={2} alignItems="center">
{employees.map((value, index) => {
const { name, title, department, joined } = value;
const key = `${name}-${index}`;
return (
<Grid item key={key}>
<Card raised className={classes.card}>
<Typography variant="h4">{name}</Typography>
<Typography variant="subtitle1" align="center">
{title} • {department}
</Typography>
<Typography variant="body1">
{name} has been at the company since {joined}
</Typography>
</Card>
</Grid>
);
})}
</Grid>
</header>
</div>
);
}

export default App;

The styling and components we used will result in the cards looking like this(Note the black background is the CRA default background, not the actual card):

You’ll be able to see it for yourself once we have wired up the API and implemented the data generation.

Client Dockerfile

The last step we need to finish the client-side portion of our small application is to create the Dockerfile.dev file that will be utilized by Docker Compose to run the React application.

Here it is, we just have to install the necessary dependencies into the image and then run the development server as normal

FROM node:10-alpine
WORKDIR /app
COPY package.json .
RUN npm update
RUN NODE_ENV=development npm install
COPY . .
CMD ["npm", "run", "start"]

API

On the API, we’ll have a single unauthenticated route named /employees which will return an array of objects containing the properties we defined above.

The folder structure for the api will ultimately end up looking like this:

api/
node_modules/
src/
models/
User.js
index.js
Dockerfile.dev
package-lock.json
package.json

The User.js model will contain a simple Mongoose model which we’ll use to interface with the database when querying for the list of employees.

API Dependencies

Then, we will have to download the necessary dependencies to quickly make a web server and integrate it with a MongoDB server. Specifically, we’ll utilize Express, Mongoose, and Nodemon.

The first two we’ll download as regular dependencies with the following command (make sure you’re in the api directory and not in the project root):

npm i express mongoose --save

Then nodemon we will install as a development dependency

npm i nodemon --save-dev

Once you have your dependencies downloaded make sure to add the ‘nodemon’ prefix to your npm start script. Your “start” script in the package.json should look like this:

"start": "nodemon src/index.js"

Getting Started On The API

First let’s build out the User mongoose model, in the User.js file in the models folder the User model can be created like this:

Employee.js

const mongoose = require("mongoose");
const { Schema } = mongoose;

const EmployeeSchema = new Schema({
name: String,
title: String,
department: String,
joined: Date,
});

const Employee = mongoose.model("employee", EmployeeSchema);

module.exports = Employee;

Where we the ‘mongoose.model’ function registers it into mongoose as long as we require the file in our index.js.

Then our index.js file we require the User model, create a basic express server, and have our singular route the GET /employees route.

index.js

const express = require("express");
const mongoose = require("mongoose");
require("./models/Employee");
const Employee = mongoose.model("employee");
const PORT = 8080 || process.env.PORT;
const MONGO_URI = process.env.MONGO_URI || "";
const app = express();

app.get("/employees", async (req, res) => {
const employees = await Employee.find();
res.status(200).send({ employees });
});

mongoose.connect(MONGO_URI, {
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: true,
});

app.listen(PORT, () => {
console.log(`MONGO URI ${MONGO_URI}`);
console.log(`API listening on port ${PORT}`);
});

API Dockerfile

The API Dockerfile will look exactly the same to the Client Dockerfile since we’ve updated the package.json file to abstract away the functionality of the API needs.

Dockerfile.dev

FROM node:10-alpine
WORKDIR /app

COPY ./package.json ./
RUN npm update
RUN NODE_ENV=development npm install

COPY . .
CMD ["npm", "run", "start"]

NGINX

From the project root, create a folder named nginx, which will contain the configuration of an NGINX server that will route the requests either to the React application or the Nodejs API.

The following is the Nginx configuration will that you should name nginx.conf. It defines the upstreams servers for the client and server.

nginx.conf

upstream client {
server client:3000;
}

upstream api {
server api:8080;
}

server {
listen 80;

location / {
proxy_pass <http://client>;
}

location /sockjs-node {
proxy_pass <http://client>;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
}

location /api {
rewrite /api/(.*) /$1 break;
proxy_pass <http://api>;
}
}

The blocks for sockjs-node are there to allow for the WebSocket connection that CRA utilizes during development.

We also need to create a Dockerfile for the NGINX server that uses our config file to override the default. Make sure to create the Dockefile in the same folder as the config file.

Dockerfile.dev

FROM nginx
COPY ./nginx.conf /etc/nginx/conf.d/default.conf

Docker Compose

We won’t be going too deeply into how Compose works in this article, but suffice it to say that it ties together the individual containers we defined above.

docker-compose.yml

version: "3"
services:
client:
build:
context: "./client"
dockerfile: "Dockerfile.dev"
stdin_open: true # fixes the auto exit issue: <https://github.com/facebook/create-react-app/issues/8688>
volumes:
- ./client/src:/app/src
api:
build:
context: "./api"
dockerfile: "Dockerfile.dev"
volumes:
- ./api/src:/app/src
environment:
- MONGO_URI="mongodb://mongo:27017"
nginx:
restart: always
depends_on:
- api
- client
build:
context: ./nginx
dockerfile: Dockerfile
ports:
- "3050:80"
mongo:
image: "mongo:latest"
ports:
- "27017:27017"
dbseed:
build:
context: ./mongo
dockerfile: Dockerfile.dev
links:
- mongo

Towards the bottom of the docker-compose.yml file, you’ll see the services for the MongoDB Database and the container that will seed the aforementioned database.

Now that we’ve finished defining the foundations of the application, we will move on to creating the mongo directory where will we define the Dockerfile for the dbseed service and the scripts for generating data.

Data Generation

Now that we have all of the infrastructure setups, let's move on towards seeding our MongoDB database with dynamically generated data.

Before defining the database seeding container, first, we’ll focus on the actual data generation for development data.

The folder structure of the data generation script and DB seeding container will match the following:

mongo/
node_modules/
scripts/
index.js
employees.js
Dockerfile.dev
init.sh
package.json
package-lock.json

The scripts will output an array of JSON which will be imported into the database. Then the bash file, init.sh, will handle import the generated data into the running database.

Data Gen Dependencies

As part of the data generation scripts, we only utilize two NPM libraries, these are yargs and faker respectively. Which can be downloaded by running the following command

npm i faker yargs --save

These libraries make it incredibly simple to generate fake data and to handle CLI in JS files respectively.

Our Data Generation Scripts

We will have two main files we’ll need for the data generation, these are an index.js file which will serve as our point of contact for the data generation and the employee.js which will hold all of the data generation functions needed for employees

index.js

const yargs = require("yargs");
const fs = require("fs");
const { generateEmployees } = require("./employees");
const argv = yargs
.command("amount", "Decides the number of claims to generate", {
amount: {
description: "The amount to generate",
alias: "a",
type: "number",
},
})
.help()
.alias("help", "h").argv;

if (argv.hasOwnProperty("amount")) {
const amount = argv.amount;
const employees = generateEmployees(amount);

const jsonObj = JSON.stringify(employees);
fs.writeFileSync("employeedata.json", jsonObj);
}

employees.js

const faker = require("faker");
const localuser = require("../localuser.json");
const generateEmployees = (amount) => {
let employees = [];
for (x = 0; x < amount; x++) {
employees.push(createEmployee());
}
employees.push(createEmployee(localuser));
return employees;
};

const createEmployee = (user) => {
const companyDepartments = [
"Marketing",
"Finance",
"Operations Management",
"Human Resources",
"IT",
];
const employeeDepartment =
companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
const employee = {
name: faker.name.findName(),
title: faker.name.jobTitle(),
department: employeeDepartment,
joined: faker.date.past(),
...user,
};
console.log(user);
return employee;
};

module.exports = {
generateEmployees,
};

This script is then called by the aforementioned init.sh, which is a simple bash file that posts the mongoimport CLI command.

init.sh

mongoimport --collection employees --file employeedata.json --jsonArray --uri

Database Seeding Container

Now that we’ve defined the scripts to generate and import the data, we can define the Dockerfile that will be utilized by Docker Compose.

Specifically, we will utilize a multi-stage build to first generate the data, and then move the data from the generator container and then utilizes it in a mongo container which then executes the init.sh bash script.

Dockerfile.dev

FROM node:10-alpine as generator
WORKDIR /data
COPY . .
RUN npm install
RUN node ./scripts/index.js --amount 10

FROM mongo:latest

COPY . .
COPY --from=generator ./data/ .
RUN ["chmod", "+x", "init.sh"]
CMD ./init.sh

Things To Keep In Mind When Generating Data

When generating development data for a MongoDB database there are three primary concerns that must be considered:

  1. DB Import Method (for our case mongoimport vs mongorestore)

Within this article, we will only have to consider the first one but the other two we will cover and discuss

Importing Data Into MongoDB

There are two major methods via CLI to import data into a running MongoDB database. These are the mongoimport and mongorestore. The primary difference between these two types of import methods is the data types they work with and the metadata they preserve.

Specifically, mongorestore only works with BSON data, this allows it to run faster and preserve the metadata BSON provides.

This is possible because unlike mongoimport, mongorestore doesn’t have to convert the data from JSON into BSON.

This conversion process doesn’t guarantee that the rich data types provided by BSON are maintained in the import process. Hence, why mongoimport isn’t recommended for usage in production systems

Why not go with mongorestore

Mongorestore is:

  1. Faster than mongoimport

But the reason why it’d advise to instead utilize mongoimport for development data is the simplicity it provides.

Due to the flexibility of data it can receive, mongoimport is significantly easier to use compared to mongorestore. Unlike its faster alternative, mongoimport can directly import both JSON and CSV.

This allows us to write a simple script to generate an array of JSON which can be easily imported as so

mongoimport --collection employees --file employeedata.json --jsonArray --uri "mongodb://mongo:27017"

Predefined Data Alongside Faked Data

There may be times where the generated data used for development should be related to developer-dependent information.

For example, the developer has a specific logon (username and userId) and the generated data is user-specific.

Hence, in order for the developer to have the data generated for their specific account, there should be an optional JSON that is only locally defined.

We can achieve this by creating a JSON file in the same folder as the data generation scripts. For example:

localuser.json

{
_id: "<Unique Identifier>",
name: "<User Name>"
}

Which can then be imported and used by the general data generation script as such:

const faker = require("faker");
const localuser = require("../localuser.json");
const generateEmployees = (amount) => {
let employees = [];
for (x = 0; x < amount; x++) {
employees.push(createEmployee());
}
employees.push(createEmployee(localuser.name));
return employees;
};

const createEmployee = (name) => {
const companyDepartments = [
"Marketing",
"Finance",
"Operations Management",
"Human Resources",
"IT",
];
const employeeDepartment =
companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
const employee = {
name: name ? name : faker.name.findName(),
title: faker.name.jobTitle(),
department: employeeDepartment,
joined: faker.date.past(),
};

return employee;
};

module.exports = {
generateEmployees,
};

Here you can see how we can import the localuser and then create an employee based on the provided data.

In this situation, we could also use destructuring to provide an easier way to override the generated data with an arbitrary number of properties. Like this:

const createEmployee = (user) => {
const companyDepartments = [
"Marketing",
"Finance",
"Operations Management",
"Human Resources",
"IT",
];
const employeeDepartment =
companyDepartments[Math.floor(Math.random() * companyDepartments.length)];
const employee = {
name: faker.name.findName(),
title: faker.name.jobTitle(),
department: employeeDepartment,
joined: faker.date.past(),
...user
};

return employee;
};

But do note that the JSON key must match the properties defined in the ‘employee’ object. So to override the title and name property the localuser.json must look like this:

{
name: "Jane Doe",
title: "Senior Software Engineer"
}

Let’s say that the company that all of our employees are a part of gives each employee a computer. In such a case we would want to keep track of each computer the company owns and the employee who currently has it.

Its schema would look a bit like this (Ignored the overly simplistic example):

{
computerName: String,
employeeName: String
}

Hence, if we wanted to generate data for the computers the company owns we would have to utilize the names of the employees we generated.

This inter-collection example uses a computer schema that isn’t how it would actually be done in real life. This would probably make more sense as an embedded document within an employee document. This example is just used for simplicity’s sake.

We can do this by simply passing down the array of employees generated to the function that generates the computers.

This would look roughly like this:

const yargs = require("yargs");
const fs = require("fs");
const { generateEmployees } = require("./employees");
const {generateComputers} = require("./computers");

const argv = yargs
.command("amount", "Decides the number of claims to generate", {
amount: {
description: "The amount to generate",
alias: "a",
type: "number",
},
})
.help()
.alias("help", "h").argv;

if (argv.hasOwnProperty("amount")) {
const amount = argv.amount;
const employees = generateEmployees(amount);
const computers = generateComputers(amount, employees);

const jsonObj = JSON.stringify(employees);
fs.writeFileSync("employeedata.json", jsonObj);
const computerObj = JSON.stringify(computers);
fs.writeFileSync("computerdata.json", computerObj);
}

Where generateComputers is a function similar to generate employees but takes an extra parameter that holds the data that belongs to a separate collection.

Conclusion

Congrats!! Now everything you need has been hooked together and the data you need should be in the database.

You can go to localhost:3050 and should see something like this:

With all of the names, titles, departments, etc (except for the one specified in localuser.json) being randomly generated.

The final big-picture folder structure of the application should look kinda like this:

api/
client/
mongo/
nginx/
docker-compose.yml

You can check out the Github repository to double-check against your version if you’re having any issues.

Future Steps

  • Integrate the scripts with Typescript types used by the Mongoose Schema

You can find the original article here.

Valencian Digital helps businesses and startups stay ahead of curve through cutting-edge technology that integrates seamlessly into the business workflows.

Valencian Digital

We’re a Dallas-based web agency dedicated to empowering…

Luis Osta

Written by

Luis Osta

Valencian Digital

We’re a Dallas-based web agency dedicated to empowering clients to build strong and modern websites. Learn about what we do on our website — https://valenciandigital.com/

Luis Osta

Written by

Luis Osta

Valencian Digital

We’re a Dallas-based web agency dedicated to empowering clients to build strong and modern websites. Learn about what we do on our website — https://valenciandigital.com/

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store