Python — Generic Data Ingestion Framework

Published in

The Startup

5 min readDec 6, 2020

For any projects, be it software or data science related projects, there is always a need to load data from files into databases, which can be relational databases or big data storages like Hive or Impala.

Flat files can come in different formats, e.g. command separated (CSV), tab delimited or in binary format like Excel. Also there will be different types of data you need to load, e.g. account, customer, sales, etc and each one of them will have different fields/columns and data types that you need to handle differently during data loading.

In this article I will show you a simple way to have a generic data loading framework by using sqlalchemy and pandas. I am using Excel and Postgresql as examples but it can be easily extended to other databases like MySQL, Hive/Impala, etc as long as the file types are supported by pandas and the databases are supported by sqlalchemy.

Docker Setup

Firstly, let’s set up postgresql database with a table for ingestion purpose.

docker-compose.yml

version: '3'
services:
  pgsql:
    image: "postgres" # use latest official postgres version
    env_file:
      - db.env # configure postgres
    volumes:
      - database-data:/var/lib/postgresql/data/ # persist data even if container shuts down
      - ./sql/init-pgsql.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "5432:5432"
volumes:
  database-data: # named volumes can be managed easier using docker-compose

Python — Generic Data Ingestion Framework

Docker Setup

Written by alpha2phi