Data and Machine Learning

Raghavendra R
astringe
Published in
5 min readOct 19, 2020

Machine learning data analysis uses algorithms to continuously improve itself over time, but quality data is necessary for these models to operate efficiently. Today, we will be discussing what machine learning datasets are, the types of data needed for machine learning to be effective, and where engineers can find datasets to use in their own machine learning models.

What is a dataset in machine learning?

To understand what a dataset is, we must first discuss the components of a dataset. A single row of data is called an instance. Datasets are a collection of instances that all share a common attribute. Machine learning models will generally contain a few different datasets, each used to fulfill various roles in the system.

For machine learning models to understand how to perform various actions, training datasets must first be fed into the machine learning algorithm, followed by validation datasets (or testing datasets) to ensure that the model is interpreting this data accurately.

Once you feed these training and validation sets into the system, subsequent datasets can then be used to sculpt your machine learning model going forward. The more data you provide to the ML system, the faster that model can learn and improve.

What type of data does machine learning need?

Data can come in many forms, but machine learning models rely on four primary data types. These include numerical data, categorical data, time-series data, and text data.

Numerical data

Numerical data, or quantitative data, is any form of measurable data such as your height, weight, or the cost of your phone bill. You can determine if a set of data is numerical by attempting to average out the numbers or sort them in ascending or descending order. Exact or whole numbers (ie. 26 students in a class) are considered discrete numbers, while those which fall into a given range (ie. 3.6 percent interest rate) are considered continuous numbers. Numerical data is not tied to any specific point in time, they are simply raw numbers.

Categorical data

Categorical data is sorted by defining characteristics. This can include gender, social class, ethnicity, hometown, the industry you work in, or a variety of other labels. This data type is non-numerical, meaning you are unable to add them together, average them out, or sort them in any chronological order. Categorical data is great for grouping individuals or ideas that share similar attributes, helping your machine learning model streamline its data analysis.

Time series data

Time series data consists of data points that are indexed at specific points in time. More often than not, this data is collected at consistent intervals. This makes it easy to compare data from week to week, month to month, year to year, or according to any other time-based metric you desire. The distinct difference between time series data and numerical data is that time-series data have established starting and ending points, while numerical data is simply a collection of numbers that aren’t rooted in particular time periods.

Text data

Text data is simply words, sentences, or paragraphs that can provide some level of insight into your machine learning models. Since these words can be difficult for models to interpret on their own, they are most often grouped together or analyzed using various methods such as word frequency, text classification, or sentiment analysis.

Where do engineers get datasets for machine learning?

There is an abundance of places you can find machine learning data, but we have compiled five of the most popular ML dataset resources to help get you started:

Google’s Dataset Search

Google released its Google Dataset Search Engine in September 2018. Use this tool to view datasets across a wide array of topics such as global temperatures, housing market information, or anything else that peaks your interest. Once you enter your search, several applicable datasets will appear on the left side of your screen. Information will be included about each dataset’s date of publication, a description of the data, and a link to the data source.

Microsoft Research Open Data

Microsoft is another technological leader who has created a database of free, curated datasets in the form of Microsoft Research Open Data. These datasets are available to the public and are used to “advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences.” Download datasets from published research studies or copy them directly to a cloud-based Data Science Virtual Machine.

Amazon datasets

Amazon Web Services (AWS) has grown to be one of the largest on-demand cloud computing platforms in the world. With so much data being stored on Amazon’s servers, a plethora of datasets has been made available to the public through AWS resources. These datasets are compiled into Amazon’s Registry of Open Data on AWS. Looking up datasets is straightforward, with a search function, dataset descriptions, and usage examples provided.

UCI Machine Learning Repository

The University of California, School of Information and Computer Science provides a large amount of information to the public through its UCI Machine Learning Repository database. This database includes nearly 500 datasets, domain theories, and data generators that are used for “the empirical analysis of machine learning algorithms.” Not only does this make searching easy, but UCI also classifies each dataset by the type of machine learning problem, simplifying the process even further.

Government datasets

The United States Government has released several datasets for public use. These datasets can be used for conducting research, creating data visualizations, developing web/mobile applications, and more. The US Government database can be found at Data.gov and contains information pertaining to industries such as education, ecosystems, agriculture, and public safety, among others. Many countries offer similar databases and most are fairly easy to find.

--

--