How we migrated data from PostgreSQL to MongoDB

Photo by Barth Bailey on Unsplash

by Vladimir Stolyarov

At Containerum we always strive to minimize the number of technologies for simpler maintenance and increased reliability. For example, in the beginning we used only one relational database for storing data, and it was PostgreSQL.

But it turned out, storing data like this is not always convenient. For instance, one of Containerum Platform services, resource, which manages the resources within Kubernetes namespaces, operates highly nested datasets. In case of a relational database it leads to having too many tables, complicated queries, slower performance, etc.

It is way more convenient to store such data in document-oriented databases which were designed for such purposes. They allow storing objects without introducing additional entities, thus allowing for faster and simpler development process. There are quite a lot of document-oriented databases, but we opted for MongoDB since one of our developers had already had some experience with it. Besides, we got much simpler replication than what we had in PostgreSQL, which is a cool bonus for a project that works entirely on top of Kubernetes.

The first question was how to migrate the data we already had to MongoDB. There are solutions available like FME by Safe. However, we only needed to do migration once, so we decided that buying a license for one-time procedure was not the best option. The solution we finally implemented was inspired by How to migrate PostgreSQL to MongoDB article here on medium. It proposes the following scheme: SQL query through psql-> CSV file-> import via mongoimport

In our case, however, the scheme could barely be implemented, as CSV doesn’t allow operating hierarchical data structures. But it got me thinking ‘What if we export the data from PostgreSQL directly to JSON and then import it to MongoDB?’. So this was the plan: write SQL queries returning JSON objects (or an array of objects) that will be saved in a file and then imported via mongoimport.

To print the result of an SQL queries as a json object/massive, two functions are necessary: row_to_json and array_to_jsoncorrespondingly.

To print a json object, create the following query:

SELECT row_to_json(t) FROM (<primary query>) t;

The keys in the json object are column names, the values are drawn from the line of the column. E.g.:

SELECT row_to_json(t) FROM (VALUES (‘a’, ‘b’)) t(c1, c2);

This query via psql returns as follows:

row_to_json   
---------------------
{"c1":"a","c2":"b"}
(1 row)

JSON arrays are created likewise:

SELECT array_to_json(array_agg(t)) FROM (<primary query>) t;

This query will return an object array. The keys here are also column names, the values are drawn from the line for each column. The functionarray_agg is also used to transform a set of strings into a postgresql array, since thearray_to_json input expects this data type.

For instance,

SELECT array_to_json(array_agg(t)) FROM (VALUES (‘a’, ‘b’), (‘c’, ‘d’)) t(c1, c2);

returns

array_to_json       
-------------------------------------------
[{"c1":"a","c2":"b"},{"c1":"c","c2":"d"}]
(1 row)

To create a value array the query should be modified a little:

SELECT array_to_json(array_agg(t.column)) FROM (<primary query>) t(…, column, …);

For instance, the following query:

SELECT array_to_json(array_agg(t.c1)) FROM (VALUES (1), (2)) t(c1);

will return:

array_to_json
---------------
[1,2]
(1 row)

So, we know how to get json objects and arrays. Now we should learn how to nest json objects to json objects. To do this, we will use subqueries. A subquery should also return a json object or an array. For instance, the following query:

SELECT row_to_json(t)
FROM (
VALUES (
'some_value',
(
SELECT array_to_json(array_agg(t1))
FROM (
VALUES ('a', 'b'), ('c', 'd')
) t1(k1, k2)
),
42
)
) t(c1, c2, answer);

will return:

row_to_json                 
--------------------------------------------------------------------------------
{"c1":"some_value","c2":[{"k1":"a","k2":"b"},{"k1":"c","k2":"d"}],"answer":42}
(1 row)

Now we know how to get json objects with the desired structure. Now let’s check the data in the tables. Previously, our resource service operated the database with the following structure (I included only those tables that we will use to create a query):

For the ‘new’ version of the service, we need three types of objects: deployment, service and ingress.

Let’s see how we create a query for deployment object. What we want to get is a json that looks like this:

{
"_id": "36f4da21-5229-485e-a156-3f18c564399d",
"deployment": {
"status": null,
"owner": "0f4de816-043f-4d2e-a3a8-d45cb1f3394d",
"containers": [
{
"image": "nginx",
"name": "first",
"limits": {
"cpu": 100,
"memory": 10
},
"env": [
{
"value": "world",
"name": "hello"
}
],
"commands": [],
"ports": [
{
"name": "myport",
"port": 80,
"protocol": "TCP"
}
],
"volumemounts": [],
"configmaps": [
{
"name": "myconfig",
"mode": null,
"mountpath": "/",
"subpath": null
}
]
}
],
"name": "nginx2234-1-444444-555555-1",
"replicas": 1,
"totalcpu": 100,
"totalmemory": 10
},
"deleted": false,
"namespaceid": "88468eaa-f729-4293-863a-1cc6b9f866eb"
}

On the top level we have 4 fields, 3 of them being base type. By base type we mean number (integer and float), string, null. Let’s create a query:

SELECT array_to_json(array_agg(t))
FROM (
SELECT
depl.id AS _id,
depl.ns_id AS namespaceid,
depl.deleted,
(...) AS deployment,
FROM deployments depl
) t;

Now form an object for deployment field. Here we will use the query method we already know, and will change the initial query a little:

SELECT array_to_json(array_agg(t))
FROM (
SELECT
depl.id AS _id,
depl.ns_id AS namespaceid,
depl.deleted,
(
SELECT row_to_json(kube_depl)
FROM (
SELECT
depl_data.name,
depl_data.replicas,
ns.owner_user_id AS owner,
NOT depl.deleted AS active,
...
FROM(VALUES(
depl.name,
depl.replicas
)) AS depl_data(
name,
replicas
)
) AS deployment,
FROM deployments depl
JOIN namespaces ns ON ns.id = depl.ns_id
) t;

Here we merged it with namespaces table, as we will need the data it contains. We will also do SELECT for deployment field from an artificially created table (VALUES) that has two columns and one line. Thus we avoid doing SELECT twice from the real table (we kept getting an error doing that). It is clear that the subquery allows accessing query IDs from a higher level. We only have to add subquery to generate the missing fields:

Here we have added LEFT JOIN with containers table. We use LEFT JOIN, because in some cases the database contained deployments with no containers. Technically, this is not how it should be, but the mistakes we did at the early stages of development played their role. We solved the issue with ‘missing’ containers by synchronization with Kubernetes. Since one of the subqueries uses aggregation, GROUP BY should contain all columns from the tables after FROM, which were requested between SELECT and FROM. In this case, there are 6 columns in GROUP BY.

The resulting query generates a correct JSON. However, when we launch it via psql we get extra lines (table name, separators and the number of lines in a reply). We can get rid of it with COPY( … ) TO STDOUT.

Now we can just run:

psql -U <user> -h <host> -p <port> -f generate-mongo-deploys.sql > deploys.json

It will connect to the base, perform a query from the file and save the result to deployments.json. To import the data to MongoDB we used mongoimport which was launched as follows:

mongoimport --uri mongodb://<user>:<password>@<host>/deployments --jsonArray deploys.json

Before importing we launched a new version of resource, which created all the necessary indexes.

As the system had been launched just recently, there were very little entries (under 10,000), and import/export took very little time. mongoimport has a limit of max. 16Mb per import file, so it didn’t pose a problem for us. This method can also be applied to larger data with some optimization. First, copy the data to the file located on a server with COPY( … ) TO /path/to/file.json to prevent transferring data to clients, as the file will be generated on the server. Second, you can try splitting the data into several files with LIMIT-OFFSET, and, probably, optimize the query. Or you can output the data not as a JSON array, but line-by-line, changing array_to_json(array_agg(t)) to row_to_json(t) at the beginning of the query to remove file size limit in mongoimport.

Sure, we didn’t get the things right the first time — first import attempts failed. We had to fix the query (the article features the final version) and MongoDB indexes several times.

The benefit of this method is that it only uses the standard tools for PostgreSQL and MongoDB. And it taught us something about writing SQL queries. As of disadvantages, this data migration method is intended for one-time use and has not been tested on larger data volumes. If you have any thoughts on that, please share.

Thanks for reading! Feel free to leave feedback and ask questions. Don’t forget to follow us on Twitter and join our Telegram chat to stay tuned!

→ This article is part of a series by our engineers who share their stories about how we built Containerum. You can support the project by checking out our GitHub repo. Submit issues, leave feedback or just support us by giving a ⭐. Your support matters to us!