playgrdstar
quaintitative
Published in
2 min readAug 30, 2018

--

Setting Up a Data Lab Environment — Part 5 — Databases

[Post is also available at quaintitative.com]

When we do stuff in Jupyter notebooks, we could save the output in a range of local files, from CSV, to JSON, to HDF5. But there might be instances where it might make sense to save things to a database, be it a SQL database like Postgres, or a NoSQL database like MongoDB.

Usually when we set-up a database on the server, we would have to have it running at some host and port, and then make a connection to the database.

So for MongoDB, we would usually first install and run MongoDB, then install a library like pymongo, and then make a connection to the host and port of the MongoDB instance.

With Docker, it’s similar, but slightly easier. Having a database setup within Docker is as easy as pulling an image. Continuing from where we left off on the docker-compose.yml file in the previous part of this series, we just have to add a few lines right at the end to have access to MongoDB and Postgres databases within the Docker container.

version: '3'
services:
jupyterone:
build: docker/jupyter
ports:
- "8888:8888"
volumes:
- .:/home/jovyan/work
env_file:
- config/jupyter.env

# The lines below are the new ones!
this_mongo:
image: mongo
volumes:
- mongo_data:/data/db
this_postgres:
image: postgres
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
mongo_data:

And connecting to it within the Docker container is super easy. Here’s one to make MongoDB play nice with Pandas. We just have to connect it to this_mongo, which is what we named it in the docker-compose.yml file.

from pymongo import MongoClient

def get_mongo_database(db_name, host='this_mongo'):
conn = MongoClient('this_mongo')
return conn[db_name]

def dataframe_to_mongo(df, db_name, collection):
db = get_mongo_database(db_name)
# 'records' means that it will be saved as an array of objects
entry = df.to_dict('records')
db[collection].insert(entry)

def mongo_to_dataframe(db_name, collection, query={}):
db = get_mongo_database(db_name)
cursor = db[collection].find(query)
df = pd.DataFrame(list(cursor))
# Remove the mongo id
if no_id:
del df['_id']

return df

And here’s how to connect to the Postgres database.

import psycopg2 as pg2
import psycopg2.extras as pgex

conn = pg2.connect(host='this_postgres',
user='postgres',
database='postgres')

cur = conn.cursor(cursor_factory=pgex.RealDictCursor)

cur.execute("""
BEGIN;
CREATE TABLE jupyter_test(
_id INTEGER,
name TEXT,
list DOUBLE PRECISION[],
vector BYTEA
);
COMMIT;
""")

cur.execute("""
BEGIN;
INSERT INTO jupyter_test VALUES(1, 'Test1', '{1,2,3,4,5}');
INSERT INTO jupyter_test VALUES(2, 'Test2', '{2,4,6,8,10}');
COMMIT;
""")

cur.execute("""
SELECT * FROM jupyter_test;
""")

result_raw = cur.fetchall()

print(result_raw[0]['name'])

Pretty simple right?

--

--

playgrdstar
quaintitative

ming // gary ang // illustration, coding, writing // portfolio >> playgrd.com | writings >> quaintitative.com