4 Pitfalls when importing MongoDB into Redshift

We store our production data in MongoDB. For batch processing and data science we load the Mongo data into Redshift.

Doing so is actually surprisingly simply. Yet, we encountered a couple of pitfalls worth sharing.

1. Importing escaped commas

We instruct mongoexport to export data as CSV files:

mongoexport -d db -c users --csv --fields name,title > users.csv

Mongoexport will automatically escape with double quotes all values that contain themselves the delimiter (comma), e.g.:

name,title
Joe,"Coder, Dreamer"

To read this data, Redshift must be instructed to reverse the escaping:

copy "users" from 's3://my-bucket/users.csv'
credentials 'aws_iam_role=arn:aws:iam::...'
region 'us-west-1'
delimiter ','
removequotes
;

2. Importing datetime fields

Mongoexport exports datetime fields as ISO 8601 time stamps (e.g. 2016–12–06T07:47:15Z), which Redshift can’t import.

The best solution is importing the datetime field as a varchar:

CREATE TABLE "users" (
bday_raw varchar(20)
);

Afterwards, you can define a new field and let Redshift parse the raw datetime field:

ALTER TABLE "users" ADD COLUMN bday TIMESTAMP DEFAULT NULL;
UPDATE "users" SET bday = CAST(bday_raw as timestamp) WHERE bday_raw <> '';

Important: Redshift will import missing values as empty strings, which of course can’t be converted to dates. Hence, you must filter out these cases via WHERE bday_raw <> '';

3. Removing CSV headers

Mongoexport adds to each export a header row whereas Redshift expects to not have headers. This results in cryptic errors about invalid data because the name of the header fields might be too long for the defined Redshift fields.

Removing the header row is trivial:

tail -n +2 users.csv > users.no-header.csv

4. Flattening structured data

MongoDB is all about easily storing structured documents. In contrast, Redshift requires flat data. Luckily, mongoexport can take care of most of the mapping.

For example, Mongo might hold a list of email addresses and information about all companies a user worked for:

{
emails: [ "joe@home.com", "joe@work.com" ],
companies: [
{ name: "My company 1" },
{ name: "My company 2" }
]
}

To export the first two email addresses and the name of first company:

mongoexport -d db -c users --csv --fields emails.0,emails.1,companies.0.name

Need more fancy mapping? For that, we added a transformation step between mongoexport and the Redshift loading: a simple NodeJS application streams through the exported CSV files and manipulates them accordingly.

Happy coding!

Want to learn more about AWS? Have a look to our other articles.