A Data Scientist’s Deep Dive into the WiDS Datathon

Women in Data Science
Jan 25 · 5 min read

A discussion by Sharada Kalanidhi, Lead Data Scientist at Stanford University

Working with the WiDS Datathon dataset over the past week has been a thrilling exercise. This dataset presents an opportunity to learn about interesting and real-world modeling challenges, and is different from other curated datasets in textbooks and classic machine learning exercises. For that reason, I discuss some of the challenges you may experience around missing data, multicollinearity and linear/ nonlinear approaches. I will also provide resources to help you on these topics.

Sharada Kalanidhi, Stanford University

The Problem of Missing Data

The dataset has many important variables which have a significant number of values that are missing. Data is often missing because of issues that could have arisen during the collection and curation of data. This type of data is usually coded as “Null” or “Not Available”, in short “NA.” What are the best assumptions for NAs? Do you assume they have been generated by a completely random process? Do you drop them? Or impute (fill in) appropriate values? Simulation of data might result in data that is incongruent with reality. It turns out, imputation (and simulation/ imputation) is an active area of research. Here are some resources — R and Python packages (mice, simputation and autoimpute) and blogs that address the issue of missing data in modeling challenges:

For R users:

For Python users:

There is quite a bit of discussion in the technical literature about whether it is enough to simply replace NA values with the mean or median of existing values. Some researchers believe this could be a problem because values that are completely unreasonable and incorrect could replace the Null values. The preferred approach is capturing and simulating the entire underlying multivariate distribution. In other words, model-based imputation is the preferred approach. In case you decide to pursue imputation and are looking for further reading about different approaches to take, I also recommend the following resources:

  • Van der Loo, Mark, de Jonge Edwin, “Statistical Data Cleaning with Applications in R” Wiley 2018. Chapter 10 focuses on Imputation and Adjustment. There is a detailed discussion of model-based imputation.
  • Van Buuren, Stef and Groothuis-Oudshoorn “mice: Multivariate Imputation by Chained Equations in R” Journal of Statistical Software, December 2011. On page 6 there is a discussion of the problems in imputing multivariate data.
    https://www.jstatsoft.org/article/view/v045i03

Multicollinearity

When I started working with the dataset, I saw right away that there could be a problem with several of the variables being highly correlated with each other. This situation is called multicollinearity and it could affect the significance of regression results. It could also impact the results of a classification exercise. Here are some resources that might put things into perspective, and may provide some avenues for handling multicollinearity.

The problem of Multicollinearity:

Linear or Nonlinear Approaches

A question one might like to address early on is the type of patterns embedded in the data that could result in appropriate classification. For example, can the data points be separated easily with a line in the middle? Or could there be nonlinear patterns in the data that if represented appropriately, would assist with the separation of the classes? If you wish to explore nonlinear separation, there are several approaches. Here are some pointers:

Other general thoughts:

Searching and Searching

It is tempting to do a grid search that includes every possible combination of parameters. Tempting and also somewhat infeasible. I would suggest keeping your grid searches reasonable.

Housekeeping

Definitely fit ensembles of models. But keep track of your dataset versions and models. In one week, you could fit dozens of models (not including parameter grid/ random searches with hundreds or thousands of fits.) Was RF_52 the Random Forest fit on the dataset where I dropped NAs or imputed NAs? After all that fitting, it is easy to lose track. Start off by creating a system, folder and logs where you can track your assumptions right at the moment you train the models.

Above all…

Enjoy the data challenge. There is a lot to observe and experiment with. Take your time — iterate through the models. Go back to early fits. What was fit poorly or well? How could you change the algorithms you fit? Are there research papers discussing your observations? Once you start on this journey into statistical questioning and research, there’s no telling where it might lead you. Think of it as an invitation into the deep and rich world of mathematical statistics — an ocean of knowledge in its own right.

Learn more about the WiDS Datathon and sign up to take the challenge.


Sharada Kalanidhi is a Data Scientist/Quantitative Strategist and Inventor with over 20 years of industry experience. She spent a decade in the bond markets doing quantitative strategy for the fixed income/ mortgage sector. She later transitioned to the west coast, working on diverse Data Science problem areas such as biochemistry research, website engagement, IOT and mathematical economics. Having been exposed to the methods favored by different industries, she believes in a Data Science approach rooted in iterative research across a spectrum of statistical methodologies.

Women in Data Science

Written by

Inspiring and educating data scientists, regardless of gender, featuring outstanding women in the field. widsconference.org

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade