Data Pre-processing for ML

Iurii Katser
Product AI
Published in
2 min readSep 14, 2021

Most machine learning (ML) algorithms require pre-processed data as an input to function properly and build more accurate models. Generally, data pre-processing helps to reduce the amount of analyzed data, create additional informative features, make complex underlying dependencies and hidden patterns explicit, discard uninformative raw signals, and remove noise.

Data pre-processing consists the following main parts:

Data cleansing and editing: This helps eliminate invalid values, outliers, or other issues in data by removing or correcting them. At this stage, either missing data (NaNs) is filled in, or data objects (or data features) containing such missing values are removed if their proportion is large. Invalid data should first be detected, and then can be corrected or dropped from the dataset.

Feature transformation: This affects the values of the features (the distribution is changed or features are scaled), their type (continuous values converted into categorical by aggregating), modality (pictures converted into tabular data), etc. This stage mainly includes transformations focused on improving the quality of features or making features applicable for ML algorithms.

Feature selection: This reduces the number of features by searching for the subspace of a lower dimension using dimensionality reduction methods, or simply by removing some irrelevant or redundant features. This stage is focused on simplifying the models, reducing the complexity of training the model, and avoiding the curse of dimensionality.

Feature generation and construction: This involves creating new features based on logic and domain knowledge or standard transformations, e.g., raising to the polynomial power, multiplication on feature values, or other kinds of feature crossing. This stage is focused on capturing non-linear complex dependencies in the data and providing easy-to-use features for ML algorithms.

Data generation and augmentation: This consists of increases the amount of data by copying existing points (for example, increasing the minor class), adding slightly-transformed data points, creating new synthetic data from existing data, or even generating data from some physics-based models.

During pre-processing, raw data, which is often not applicable for analysis and ML algorithms, transforms into preprocessed or prepared datasets ready for specific ML tasks.

More details regarding specific operations, algorithms, and techniques for the pre-processing stage can be found in this article and the references therein (with a focus on time-series data). Some general options and recommendations for pre-processing are presented in this article.

--

--

Iurii Katser
Product AI

Lead DS | Ph.D. alumnus | Researcher | Lecturer. Time-series analysis, Anomaly detection, Industrial data processing