In-Depth ETL in Machine Learning

Data extraction

We want to extract the data from the csv file, and make it usable for our experimental purposes. To do, first we create a small Python Data Manager class that will take care of parsing the csv, extract, and format any relevant data for our analysis.

Data transformation

We’ll focus on two transformation phases:

  • Preprocessing transformations
  • Analytics transformations

The idea is that we absolutely need to preprocess incoming raw data, eliminate duplicates, drop null and missing values. Furthermore, doing univariate analysis we’ll soon observe that many of the samples had out of range values for ratio variables. Typically, we need to detect and delete outliers.

With usable data, we’ll start implementing Factor Analysis to extract insightful features that best explain variance and correlation.

Preprocessing transformations

--

--

Andre Vianna
My Data Science Journey

Software Engineer & Data Scientist #ESG #Vision2030 #Blockchain #DataScience #iot #bigdata #analytics #machinelearning #deeplearning #dataviz