Top 100 Open Source Datasets for Data Science

Sowhardh Honnappa
Analytics Vidhya
Published in
6 min readApr 24, 2020

Datasets for Categories: Computer Vision, NLP, Reinforcement Learning, Deep Learning etc.

Image Source: https://ahrefs.com/blog/public-data-sets/

1. Quandl

It is a massive repository for Economic and Financial data. Most of the datasets are free but some are available to purchase as well.

Link: https://www.quandl.com/search

2. Academic Torrents

It has data used to publish scientific research papers. The variety of datasets is massive with availability of free download.

Link: https://academictorrents.com/browse.php?cat=6

3. Data.gov

It consists of a variety of datasets from US Government agencies. Domains include Education, Climate, Food, Chronic disease and what not.

Link: https://www.data.gov/

4. UCI Machine Learning Repository

This site consists of datasets hosted by the University of California, Irvine. It has a collection of about 400+ datasets aimed towards the Machine Learning community.

Link: http://archive.ics.uci.edu/ml/index.php

5. Google Public Datasets

Google has hosted tons of datasets on Google Public Datasets which is basically their Cloud Platform. You can browse through their dataset collection using BigQuery. The first 1 Terabyte of queries you make are basically free.

Link: https://cloud.google.com/bigquery/public-data/

6. Datasets on Github

It hosts tons of awesome datasets. This github boasts a variety of datasets such as Climate Data, Time Series data, Plane crash data etc. Feel free to dig in.

Link: https://github.com/awesomedata/awesome-public-datasets

7. Socrata

Socrata hosts cleaned datasets across domains such as Government data, Radiation data, Workplace related data etc.

Link: https://opendata.socrata.com/

8. Kaggle datasets

Kaggle is a house-hold name by now amongst data professionals. Kaggle hosts massive open source…

Sowhardh Honnappa
Analytics Vidhya

Head of Analytics - Paytm