Best sources for Big Data

Yaswanth Kumar Gothireddy
3 min readDec 21, 2016

--

Data is exploding in every way. The amount of Data accumulated in the last two years is more than the data accumulated in the entire previous years. For example, Facebook users send on an average 31.25 million messages and view 2.77 million videos every minute, and in every minute up to 300 hours of video is uploaded to YouTube alone. By 2020, we will have over 6.1 billion smartphone users globally. It means, Data will explode from 4.4 zettabyets today to around 44 zettabytes. And, the surprising fact is that, only 0.5 percent of entire data is actually analysed or used.

There are several sites where you can find huge datasets. Here, I am listing best out of them. Most of them are open to experiment with their data. But, few are paid such as Quandl.

  1. Kdnuggets datasets

KDnuggets is a leading site on Business Analytics, Big Data, Data Mining, and Data Science. Knowledge Discovery Nuggets was created by Gregory I. Piatetsky-Shapiro, with the mission of covering the field with short, concise “nuggets”. The resource started as a directory of main areas of data mining and data science, including Software, Jobs, Academic positions, CFP (calls for papers), Companies, Courses, Datasets, Education, Meetings, Publications, and Webcasts. Here, one can find a large number of data repositories .

2. UCI Machine learning repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited “papers” in all of computer science.

3. AWS Public Datasets

AWS hosts a variety of public datasets that anyone can access for free. Previously, large datasets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these datasets via the AWS centralized data repository and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.

4. Kaggle Datasets

Kaggle is the best place to find and seamlessly integrate open data. One can search and explore its 460 datasets. Also, anyone can publish his/her own dataset. Its rich UI helps you to find the best and new datasets. Not only open Datasets, one can participate in Kaggle competitions which are hosted by several companies. Also, one can host his/her own competition and publish his/her own data sets.

5. Stanford Large Network Data collection

The SNAP library is being actively developed since 2004 and is organically growing as a result of research pursuits in analysis of large social and information networks. Largest network SNAP researchers analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.

6. Google Public Data Directory

Google public data directory consists of curated datasets from Eurostat, World Bank, UK Open Data portal, US Census bureau, Iceland statistics and Ireland Central Statistics office.

7. Quandl

Quandl delivers financial, market, economic, alternative and derived data. It delivers data via its API, or directly into python, R, Excel and many other tools. Quandl hosts 25 million data sets and serves 10 million downloads daily.

8. GitHub Awesome Datasets

Here, Datasets are collected and tidied from blogs, answers, and user responses.They were categorized in to Economics, Earth science, Energy, Finance, Transportation etc.

kindly, comment any huge data repository that is missing out here. So that, it’ll be helpful for others too. If you would like to know about data analytic softwares available in the market, go to CybrHome. Thank you.

--

--