How To Load Multiple Files Into Your Database With Python And SQL

SeattleDataGuy
Feb 24, 2019 · 2 min read

by Ben Rogojan

Image for post
Image for post

Loading files into databases are not exactly exciting work. It can be tedious and boring work. Often times it is constrained by tools like SSIS that freak out when a small change is made.

This is one of the core roles of BI and data engineering teams. They need to load files from operational databases into data warehouses. They do this by developing pipelines that process and transform data.

One problem we see is teams have multiple file types, data spec versions, and data providers. Each of which requires a different pipeline just to load into their raw database.

This forces teams to manage 10, 20 maybe even 50 smaller pipelines that are all subject to the change of the data providers. In addition, each new data provider requires an entirely new pipeline to be developed. It also requires managing infrastructure around each pipeline in order to log and track data pipelines statuses. This can quickly become very difficult to manage and tedious to build.

Another solution instead of attempting to manage multiple pipelines is to develop a central system that manages the various insert statements required to insert data into each different raw table. Kind of how we explain in the video below.

What the system does is essentially act as a lookup system that matches flat files to their insert queries. This reduces the need to redundantly have separate pipelines that would also all need an insert statement.

Instead, now the system uses one main system to load all the various tables. Theoretically, you can also put version numbers on each of the files that come in. This could be set up by tracking each version in the metadata table that will then allow you to tie a raw file to an insert file version.

Loading multiple file types by individual pipelines requires a lot of tedious development. Whether it be through SSIS or another system, the constant development quickly weighs heavily on data engineers. Instead of spending time developing analytical pipelines that drive clear value for your company. They will be stuck maintaining and developing operational pipelines.

This reduces its impact and reduces its overall value.

Thank you so much for reading! We hope you enjoyed the video. Please feel free to send us any questions. We would love to help you design your next automation system.

Also, if you liked this, feel free to check out our video on using bulk insert!

Coriers

A Place For Freelancers And Consultants To Share Their…

SeattleDataGuy

Written by

#Data #Engineer, Strategy Development Consultant and All Around Data Guy #deeplearning #machinelearning #datascience #tech #management http://bit.ly/2uKsTVw

Coriers

Coriers

A Place For Freelancers And Consultants To Share Their Knowledge

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store