Data Science Is All About Model Tuning? (2/2)

Yamac Eren Ay

Published in

The Startup

5 min readJan 22, 2021

If you haven’t read my previous article, go check it out:

Data Science Is All About Model Tuning? (1/2)

That’s just the tip of the iceberg.

yamaceay.medium.com

Now that we have a clean data ready to be explored, let’s proceed with the real fun stuff.

Step 4: Explore Data

Now you can take a deep breath, because the most exhausting steps are done. It’s time to get the initial insights about the data. A complete Exploratory Data Analysis (EDA) consists of:

Data Wrangling
Statistical Analysis
Data Visualization

Data Wrangling: Perform the following methods if necessary

Group by mean/mode/median/count …
Pivot/Melt
Create a new feature using existing features
Split data into different categories based on a feature/features

Statistical Analysis: Find out anything about variables

Description of data: Mean, mode, median, count, number of unique values etc.
Guessing the distribution: Shapiro-Test (for normal distribution)
Measuring correlation: Pearson Correlation (standard) and Predictive Power Score for asymmetric relationships and causality (if necessary).
Measuring the independence/difference of two or more groups: T-Test (test and control group) and ANOVA (2 or more groups)
Measuring the multicollinearity: Pearson Correlation and VIF (Variance Inflation Factor)

Data Visualization: See relationships differently

Histogram: Find out the distribution of a variable.
(Stacked) Bar Chart: Compare different categories based on one feature
Line Chart: Compare different features based on 2 or more features
Scatterplot: Find out how variables are co-distributed. 2, 3 and 4 variables. Provides insight about ML.
Box Plot and Violin Plot: Compare the distribution of different variables, includes quartiles, mean, range etc.
Heatmap: Compare all values in a table, suitable for plotting correlation.
Pie Chart: Compare the percentage of categories, suitable for plotting value counts.
Word Cloud: It may be pretty useful for exploring the most frequent words in a text.

So as you can see, EDA isn’t about just plotting a curve and commenting that “there is a strong correlation and sh*t”. It is a more complex procedure, sometimes even more complex than analysis itself.

Step 5: Let the Analysis Happen

Now that you have a good understanding of data, welcome to the most remarkable part of the project! This is the part where you pass input and the model magically gives the answer. Firstly, you need to define what the ML model will be doing. A quick overview of types of ML:

Clustering:

If your intention is to segment data based on a feature/features or group similar data.
No labels, just features.
There are tens of clustering algorithms, each having their own strength. The most popular ones: K-Means Clustering, Agglomerative Clustering, DBSCAN.
The most popular metric is Silhouette Score. If applicable, A/B-Tests and ANOVA also can be very useful to test group differences.

Classification:

Try to predict the class of anything based on numerical features.
There can be 2 (binary) or more (multi-class) categories.
Besides Support Vector Classifier, Naive Bayes, Decision Trees, there are also more advanced algorithms such as Random Forest and Neural Networks.
The most popular metrics are Accuracy, F1, Log Loss and Categorical Crossentropy.

Regression:

Try to predict the value of a continuous label based on given features.
It can be Linear or Non-Linear.
Linear Regression is the most popular algorithm and there are a lot of variants of Linear Regression with different strengths. If the accuracy is the top priority, Support Vector Regressor and Neural Networks are usually the more accurate models but computationally more expensive.
The most popular metrics are MSE (Mean Square Error) and R2 (R-Squared).

Recommender Systems:

If you are trying to predict how a user would rate an item, given other users or items.
Collaborative Filtering predicts based on other users with similar ratings, Content-Based Recommendation predicts based on the type of items that the user rated.

Dimensionality Reduction:

This reduces the number of variables and stores the meaningful information from variables, which makes the predictability possibly higher. It comes in handy, if there are too much variables or there is a strong multicollinearity between variables.
The most popular algorithm: PCA (Principal Component Analysis).

Other Types of ML:

Association
Reinforcement Learning in general

Splitting data into train/validate/test sets comes next. There is no perfect ratio, because it solely depends on the data. The conventions I most frequently use are:

Train/Test = 20/80 or 10/90
Train/Validate/Test = 70/10/20

The next step is training model using train set (if validation set is given, assuring at each step it’s doing well using validation set). If model performs very well on test set, then proceed with the Step 6. Otherwise, you may try:

Searching for optimal hyperparameters by iterating over every combination of hyperparameters (e.g. alpha = [0.001, 0.01, 0.1], weighted = [True, False], beta = [0.1, 0.2, 0.3, 0.4])
Going back to the previous steps and make a whole fresh start, if that doesn’t work for you.

Step 6: Present Results

Not everyone understands the technical details of your project. Make it clear and to-the-point. Here’s what you can do after the whole hard work:

Integrate your model in a simple web app using minimal web frameworks e.g. Streamlit or Flask and push it to Heroku, Kubernetes or a similar service.
Visualize the insights of EDA. Explain what, how, why.
Take feedbacks seriously.
Keep training your model using real-life feedbacks and make it more and more accurate.

As you may see, this is a more general framework than simply model tuning and playing with GPU. It doesn’t matter if you’re predicting the stock trends or the winner of the El Clasico next year. What really matters is the curiosity and argumentation that motivates you towards making a project. Every single day, the world is changing. Data is becoming the new currency, Data Science the new literacy. Stay safe, keep getting better!