“New to Data Science? 7 FREE Dataset Sites to Jumpstart Your Machine Learning Projects”

Prashant
AIMLTutorial
Published in
4 min readJun 25, 2024
Girl Programming

So, you’ve entered the world of Data Science & Machine Learning and want to test your recently acquired data science skills, but unsure where to find free datasets for your practice projects ?

Your search ends here!

What does a Data Scientist need ?

DATA ………………….. (a lot of it)

Here is a list of “Top Free Sites to get datasets for your Data Science and Machine Learning Projects” . All of these sites allow you to download the datasets for free and use in your projects.

Kaggle

When you Google for machine learning datasets, Kaggle is one of top resources that would come up in the search results. That’s because, Kaggle not only has a lot of datasets but it also has a big community of Data Science and Machine learning users varying from beginner to experts.

Why Kaggle is great for Beginners?

  • Machine Learning Focused: Datasets are specifically curated for machine learning projects, categorized and searchable.
  • Competitions: Participate in competitions to test your skills (win $$ too) and learn from others.
  • Community Code: Learn from others code e.g. EDA, model training.
  • Model Access: Get access to a thousands of machine learning models including LLMs.

Example Datasets:

Click here to access Kaggle Datasets

UCI Machine Learning Repository

University of California, Irvine also maintains 600+ datasets for machine learning projects. These datasets have been used by millions of machine learning experts.

Why datasets from UCI are recommended ?

Filtering Options:

  • Allows filtering by Image, Multivariate, Sequential, Time-Series
  • Filter by Subject area such as Business, Computer Science, Engineering, Games, Law etc.
  • Filter by Task — Classification, Regression, Clustering
  • Filter by Features

Proven Quality: Datasets are well documented and have been used in numerous research papers and projects.

Example Datasets:

Click here to access UCI Machine Learning Repository

US Government Open Data

Data.gov provides anyone access to datasets related to State & Federal government. You will find thousands of datasets for your data science projects and some of these datasets are updated frequently too.

Why to use datasets from Data.gov ?

  • Huge Collection: Over 300,000 state and federal datasets.
  • Variety of Topics: Covers Federal, State, City, University, County & various other sources.
  • Multiple download format: XML, HTML, CSV, JSON and many others.

Example Datasets:

Click here to access US Government Open Data

OpenML

OpenML is an open platform for sharing datasets, algorithms and experiments that can be used for machine learning research. OpenML datasets have tasks that need to be solved and each task also shows evaluation runs that other people did and how their models performed. Also, for each run can you also see additional information and pipeline.

Why to use datasets from OpenML ?

  • Extensive Collection: Over 5000 machine learning datasets.
  • Collaborative Platform: Share your experiments and see how others performed on the same tasks.
  • Download format: JSON, XML, CSV

Example dataset:

Click here to access OpenML Datasets

NASA Open Data Portal

Even if you don’t work for NASA, you can download NASA’s datasets from NASA Open Data Portal & use it for your projects. Data is available for Aerospace, Applied Science, Earth Science & others categories.

Why to use datasets from NASA ?

  • Because it’s NASA !
  • Prestigious Source: Utilize high-quality dat from a leading space agency.

Example Datasets:

Click here to access NASA datasets

Google Dataset Search

Google It !

Ok. it’s not your regular google search, but a special search engine by Google to help you find the datasets that meets your data science project needs.

Why to use Google Dataset Search?

  • Focused Search: Specially developed to search for datasets.
  • Comprehensive Results: Find datasets that you may not find through regular search engine.
  • Citations: Shows how many times a dataset is cited in scholarly articles.

Example Datasets:

Click here to access Google Dataset Search

Our World in Data

At some point, you would have visited this site already or seen it’s analysis on other leading website. ‘Our World in Data’ is like an encyclopedia of the world data. It has a lots of graphs and most of the graphs have the link to download the dataset used. So, if you’re interested in analyzing world data in your data science projects, make sure to check this.

Why to use Google Dataset Search?

  • Diverse Topics: Hundreds of topics to choose from.
  • Geography Filtering: Filter data by geography/country.
  • Graphical Inspiration: Use existing graphs for inspiration in your own projects.

Example Datasets:

Click here to access Our World in Data

General Tip for Beginners in Data Science

Tip: If you’re new to data science, instead of downloading random datasets, try searching for datasets that are related to your hobbies/interests. This will keep you engaged and might provide valuable insights into topics you care about.

--

--

Prashant
AIMLTutorial

Exploring the world of data science and machine learning, one dataset at a time. Come join me on this journey together.