Before performing any data science task such as exploratory data analysis or building a model, you must ask yourself the following important questions:
What do you want to find out or discover using your data?
Do you have the appropriate data to analyze?
Data is key to any data science and machine learning task. Data comes in different flavors such as numerical data, categorical data, text data, image data, sound data, and video data. The predictive power of a model depends on the quality of data used in building the model.
Advantages of high-quality data
a) High-quality data is less likely to produce errors.
b) High-quality data can lead to low generalization errors, i.e., model easily captures real life effects and can be applied to unseen data for predictive purposes.
c) High-quality data will produce reliable results due to small uncertainties, i.e., the data capture real effects and has smaller random noise.
d) High-quality data containing a large number of observations will reduce variance error (variance error decreases with sample size according to the Central Limit Theorem).
This article will discuss the various sources of data that can be used for analysis and model building.
If you are just looking for data to practice and hone your data science skills, we will discuss how to access open and free datasets in Section I. Sometimes data used for analysis or model building isn’t available. In this case, an individual or organization has to design an experiment for collecting data or simply purchase the data. This will be discussed in Section II.
I. Building a Machine Learning Model When Data is Available
If you are interested in open and free datasets that you can use to practice your data science and machine learning skills, here are some open resources:
a) R Datasets Package
The R Datasets Package contains a variety of datasets. For a complete list, use
library(help = "datasets")
As an example, women is a dataset belonging to the datasets package containing heights and weights of women, and could be accessed as follows:
b) R Dslabs Package
R dslabs package contains datasets and functions that can be used for data analysis practice, homework and projects in data science courses and workshops. 26 datasets are available for case studies in data visualization, statistical inference, modeling, linear regression, data wrangling and machine learning.
The dslabs package could be installed and accessed as follows:
c) Python Sklearn Datasets
Sklearn datasets could be accessed as follows:
from sklearn import datasetsiris = datasets.load_iris()digits = datasets.load_digits()breast_cancer_data = datasets.load_breast_cancer()
d) University of California Irvine (UCI) Machine Learning Repository
UCI currently maintains 487 datasets as a service to the machine learning community that could be used for data analysis practice, homework and projects in data science courses and workshops.
e) Kaggle Datasets
Kaggle datasets also contain lots of datasets for very challenging data science and machine learning projects.
f) From the Internet
Sometimes you can scrape data from websites, lots of work has to be done to clean, organize and reshape the data. However, some websites contain data in a clean and structured format. An example is the list of college towns dataset that can be scraped from Wikipedia. The scraped data can then be wrangled and saved as a text file for further analysis: Tutorial on Data Wrangling: College Towns Dataset.
Python and R programs have resources that allow you to import data from a CSV file if you know the file’s URL.
(i) Import CSV File Using Python and File’s URL
import pandas as pddf = pd.read_csv('https://archive.ics.uci.edu/ml/machinelearning-databases/breast-cancer-wisconsin/wdbc.data',header=None)
(ii) Import CSV File Using R and File’s URL
- download.file() function
This function will download a CSV file and save it as a new file:
- read.csv() function
This function will download the file and save it as a data frame:
(iii) Extracting data from pdf files
In addition to the CSV file format, internet data can also be extracted from pdf files: Extracting Data from PDF File Using Python and R.
II. Building a Machine Learning Model When Data is Not Available
In section I, we’ve been assuming that we already have the data or we assume that whatever data that we want can easily be collected, for example, if we would like to use content from a person’s Twitter or Facebook feed to predict the person’s likelihood of following a certain musician, data can be obtained from the twitter or Facebook feeds whenever it is needed.
Sometimes we don’t have the data and getting the full dataset either isn’t possible or would take too long. If that is the case, then we need to design a way of collecting the best subset of data that is possible to get in a quick and cost-efficient manner. In this case, we need to make sure the data collected will be sufficient for answering the questions we need to.
Here are some methods for obtaining data that is not open and free:
a) Purchase raw data from organizations or companies
This method is costly. But it saves time as sometimes data purchased from a company or organization may already be in a structured format that can be used directly for analysis without cleaning and reshaping the data.
b) Data from Surveys
This method involves cost, as it cost money to design and implement a survey. Also, data collected from surveys may contain lots of missing data, or data in an incorrect format, for example, someone could enter their age as ‘twenty-eight’ instead of 28, so lots of work is needed to preprocess, organize, clean, and reshape data collected from surveys.
c) Data from Experiments
In this case, you need to decide what your dependent variables are or predictors e.g looking at house prices in a given neighborhood, you may decide to predict housing prices based on predictors or features such as a number of bedrooms, square footage, zip code, school district, year built, etc.
d) Data from Sensors
Sensors can be built for collecting data. Industries and companies can build sensors for collecting data, for example, a sensor to collect temperature data, pressure data, humidity data, etc.
e) Simulate Data
This method is mostly used for stochastic processes. For example, you can use Monte-Carlo simulation to simulate data that follows a given probability distribution like Poison distribution or Normal distribution. This method of generating raw data is free. Probabilistic methods can be used for building models in this case. Some famous probability distributions used to simulate real-life phenomena include Uniform Distribution, Gaussian or Normal Distribution, Bernoulli Distribution, Poisson Distribution, or Exponential Distribution.
Here is an example of a machine learning model in which Monte-Carlo simulation was used to create replicas of the original sample dataset: Machine Learning Model for Stochastic Processes.
In summary, we’ve discussed several sources of data that can be used for data science projects. Data is the key to data science and machine learning tasks. It is always advisable to make sure the data used for model building is readily available and is of high quality. If the data needed is not available, then we need to design an experiment for collecting data. In this case, we have to make sure the data collected will be sufficient for answering the questions we need to.