image credit www.pexels.com

Data sets to play with while learning Data Science and Deep Learning

Most of the learning of different data science techniques comes through actually working with data. However, this can quickly become a time consuming and a frustrating adventure if you use the wrong kind of data.

Apurva Naik
Startup Data Science
3 min readJul 1, 2017

--

I’ve compiled some rule-of-the-thumb to be followed for selecting the right type of data while learning Data Science techniques.

Regression is used to predict a real valued output (numerical value) of the variable defined by the problem statement. An example of a regression problem is the famous Auto mpg data set where the mileage of a car is predicted by considering the make of the car, model year, horsepower and other attributes. The data sets available at the UCI Machine Learning repository have been extensively studied and is a good starting point for learning regression.

Classification is similar to regression except the predicted variable is categorical (non- numeric classes) in nature. An example of a classification problem is predicting credit score (high, low) based on income, age, education etc. In addition to the above mentioned datasets, Kaggle has a wealth of open, clean data that is suitable for beginners.

Time series analysis is done on data that includes a time variable and is often done to look for patterns that help to predict how the series will look in the future and to study the interaction between variables over time. Example includes stock market data. In addition to the Time Series Data Library, Quandl has interesting financial, economic and industry based datasets

Visualizations are an important part of exploratory analysis to know which features are more important than others, or to display results of an analysis. But this does not mean that visualizations don’t exist beyond the realm of data science. They are often used to tell a story effectively. Flowing Data, the subreddit r/dataisbeautiful host wonderful visualizations in addition to bl.ocks, a gallery hosting viz made in d3.js.

Learning Natural Language Processing does not have to be limited to analyzing the reddit comment corpus or doing twitter sentiment analysis. Check out the Spoken English database or Stanford’s SNAP database and get creative!

Large datasets and powerful GPUs are becoming increasingly accessible and image classification is no longer limited to academic researchers and think tanks. Although ImageNet is still the go-to place for learners and researchers, there are many other resources like the Face Recognition Homepage, the Cats and Dog dataset and many more on Kaggle, the Tiny Images data set and the Indian Movie Face data set.

Here’s a list of lists containing links to even more datasets:

All you need to do now is flex your curiosity, find an interesting problem and learn away!

--

--

Apurva Naik
Startup Data Science

Applied Data Scientist at National Real Tax Tracking LLC and Co Host at Startup Data Science