Mini-guide, Automation, Data Cleaning

Automate Data Cleaning & Combining of Files

ShengJun
Data Arena
8 min readApr 3, 2022

--

Data preparation takes around 80% of any data practitioner’s job (for most instances). Part of the work involves consolidating data across various data files into a single repository for data ingestion. Some of these processes could be automated and here, the notebook showcases several examples to go about it. Certainly, it would not be a “one-size-fits-all” approach, considering the existing data management policies, tools and work processes that one has to work with, but hopefully, it would help inspire the ways to go about it for your own projects.

Photo by Kaleidico on Unsplash

Scenario

We have several excel worksheets — the task is to automate the combining of the data across these sheets. We would want to select only several columns of interest as well as go along with the examples.

Part 1

Let’s start from a single file, work out the kinks and then think about how we might want to scale the implementation.

# load data of first file
df = pd.read_excel('./data_expt/2022_Jan_Norm.xlsx')
df.head(1)

Great! we have the first file loaded. We might want to consider keeping tabs on the number of records (i.e. rows as well as the number of the columns), and also some descriptive…

--

--

ShengJun
Data Arena

Engineer | Analyst | Matcha & Coffee lover | Iaidoka