Explore Wandering Data

Han Li
Data Science Student Society @ UC San Diego
2 min readFeb 11, 2019

When we face a dataset that has over a hundred trivial features, it is so hard to tell which Machine Learning Algorithms suits this dataset best. Sometimes we even don’t know if there exist some features that would lead to confusion and thus decrease predicting performance. So an unavoidable action to analyze dataset is data exploration. That being said, we need to understand how our dataset is composed, what kind of feature is indicative or trivial and useless. Then we are going to either remove trivial features or combine them to be an indicative feature.

In this tutorial, I am going to show you how to do data exploration given a dataset. The dataset I am using is Costa-rican Household Poverty Prediction. This dataset contains over 100 features, such as house condition, education level and family members. Unfortunately, most of them are trivial and one hot encoded. So how to explore data is a critical part in trying to predict that part. I am going to focus on the house condition part and combine trivial housing features into feature about the evaluation of the house price.

Check out the repo here! Have fun!

For full repo, check here.

Thank you~

--

--