Python — Generic Data Ingestion Framework

alpha2phi
The Startup
Published in
5 min readDec 6, 2020

--

For any projects, be it software or data science related projects, there is always a need to load data from files into databases, which can be relational databases or big data storages like Hive or Impala.

Flat files can come in different formats, e.g. command separated (CSV), tab delimited or in binary format like Excel. Also there will be different types of data you need to load, e.g. account, customer, sales, etc and each one of them will have different fields/columns and data types that you need to handle differently during data loading.

In this article I will show you a simple way to have a generic data loading framework by using sqlalchemy and pandas. I am using Excel and Postgresql as examples but it can be easily extended to other databases like MySQL, Hive/Impala, etc as long as the file types are supported by pandas and the databases are supported by sqlalchemy.

Docker Setup

Firstly, let’s set up postgresql database with a table for ingestion purpose.

docker-compose.yml

version: '3'
services:
pgsql:
image: "postgres" # use latest official postgres version
env_file:
- db.env # configure postgres
volumes:
- database-data:/var/lib/postgresql/data/ # persist data even if container shuts down
- ./sql/init-pgsql.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
volumes:
database-data: # named volumes can be managed easier using docker-compose

--

--

alpha2phi
The Startup

Software engineer, Data Science and ML practitioner.