10 awesome ML datasets that deserve your attention!

Sourabh Gupta
ml-concepts.com
Published in
4 min readAug 10, 2022

Note: this article was originally published at our main site https://ml-concepts.com

In today’s day and age, data is everything. Datasets have been popping up left and right on various big platforms like Kaggle and Google Dataset Search. We’ve never had so much data in our entire history, and because of this, there are a few datasets that are a bit too underrated.

In this article, I’ll like to bring such datasets to light and hopefully provide them with the attention they deserve!

1. Common Crawl Corpus

The common crawl corpus is a collection of all the web pages that have been uploaded on the internet since 2008. You heard that right, “a collection of all the web pages that have been uploaded on the internet”. Go check it out for yourself on their website.

This dataset is absolutely massive, probably one of the biggest datasets out there. And what’s more…is that this is completely free! Anyone can access the data they have stored on their AWS S3 buckets as long as they have an internet connection.

The corpus contains tens of billions of web pages and is updated monthly. Using Common Crawl you can extract any type of data you’ll ever need. The fact that something like Common Crawl is not widely known is very frustrating.

2. StockNote API

This one is for all the stock market enthusiasts out there. StockNote is a service provided by an Indian stock market broker called Samco. This service also provides an API that is free to use, unlike most other brokers. All you need to use the API is a Demat account.

The API provides you with the prices for different types of financial instruments like stocks, futures, derivatives, commodities, and currencies for NSE (National Stock Exchange), BSE (Bombay Stock Exchange), MCX (Multi Commodity Exchange of India), and CDS (Currency derivatives Segment on National Stock Exchange).

The API is well documented and also provides an SDK on GitHub. You can get the prices on a per-minute basis for each instrument for the past 30 days, real-time price data during market hours, and daily price data using the API. I myself have worked with this for a few projects and rate it a 10/10.

3. YFinance (Yahoo Finance)

YFinance is another stock market-related API provided by Yahoo Finance.

Unlike StockNote, YFinance isn’t restricted to the Indian stock market. It instead provides data related to instruments from all types of financial markets. You can check stocks, derivatives, forex, and even cryptocurrencies.

From my experience, the API is not as well documented as the StockNote API but it does provide useful data. However, there is a python module available on pypi website (https://pypi.org/project/yfinance/)

4. COVID-19 in India

As the name suggests, this dataset records the daily Covid-19 cases, deaths, and recoveries in states and Union Territories of India.

The dataset is excellent for beginners as it would help them practice data cleaning, data visualization, and inferential statistics. Check out the dataset for yourself.

I myself wrote a Kaggle notebook on this. You can check it out here.

5. Weather in Szeged City

This dataset has weather-related data recorded on an hourly basis and can be very helpful for Regression practice.

The dataset is something a lot of people of different levels have worked on. So it would provide you with various approaches, ranging from Beginner to Advanced.

Check the dataset out for yourself

6. Face Mask Detection

With the rise of Covid-19, the importance of face masks sky-rocketed. Wearing masks incorrectly became a significant health hazard and something needed to be done to prevent this. Wouldn’t it be convenient if computers can identify if someone wore their mask correctly or not?? And thus was born this dataset…or at least I think that’s how it was born.

The dataset has 853 images of people wearing masks. Each of these images belongs to either of the 3 classes –

  • With mask
  • Without mask
  • Mask is worn incorrectly

This dataset would be good practice for all the CV enthusiasts out there. Check the dataset out for yourself.

If you have also come across any such not-so-popular dataset that you think people should know of, please let everyone know in the comments section.

Please continue reading the article on 10 awesome ML datasets that deserve your attention! on our main site ML-Concepts

--

--