Rare text-classification open datasets

DataTurks: Data Annotations Made Super Easy

8 min readSep 11, 2018

Hi !! back again with some interesting and mind challenging datasets for training your computer. A small question!! How often you check with the ratings /reviews of places or buses? Maybe regularly or at least once a day. Have you observed when you check some restaurant in google, it started showing your news feed accordingly favorable to what you have searched for in the last few days.

Yeah !! This is all because of algorithms that are there in the systems. These algorithm needs to be trained with several different and challenging datasets to perform to its best.

An important step in machine learning is creating or finding suitable data for training and testing an algorithm. Working with a good data set will help you to avoid or notice errors in your algorithm and improve the results of your application

Movie Lens Dataset:

The data set was collected over various periods of time, depending on the size of the set. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

Format: text
Default task: Text classification, Regression, clustering.
Created by: GroupLens Research in 2016
link to download the dataset

OPIN-RANK REVIEW Dataset:

This dataset contains full reviews for cars and hotels collected from TripAdvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

Car Reviews:

Full reviews of cars for model-years 2007, 2008, and 2009
There are about 140–250 cars for each model year
Extracted fields include dates, author names, favorites and the full textual review
Total number of reviews: ~42,230
Year 2007 -18,903 reviews
Year 2008 -15,438 reviews
Year 2009–7,947 reviews

Hotel Reviews:

Full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago)
There are about 80–700 hotels in each city
Extracted fields include date, review title and the full review
Total number of reviews: ~259,000

Format: text
Default task: classification, Sentiment analysis, clustering.
Created by: K. Ganesan et al. in 2011.
link to download the dataset.

Cyber-Trolls Dataset :

Dataset used to classify tweets as aggressive or not to help fight trolls. The dataset has 20001 items of which 20001 items have been manually labeled. There are 2 categories 1(Cyber-Aggressive) and 0 (Non-Cyber-Aggressive). These are Human labeled dataset.

Format: Text
Default Task: Text classification
Created by: Data Turks
Link to download the dataset.

Chat Messages By Category Dataset :

The dataset has 20001 items of which 68 items have been manually labeled. A text classification dataset with 8 classes like Alcohol & Drugs, Profanity & Obscenity, Sex, religion etc.

Format: Text
Default Task: Text classification
Created by: Data Turks
link to download Dataset

SPAMBASE Dataset:

The Spam base data set includes 4601 observations corresponding to email messages, 1813 of which are spam. From the original email messages, 58 different attributes were computed. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

Format: Text
Default task: Spam detection, classification
Created by: Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs.
link to download the dataset.

Sentiment140 Dataset:

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter. use causes Brand management (e.g. Windows 10), Polling (e.g. Obama), Planning a purchase (e.g. Kindle)

Format: Text
Default Task: Sentiment analysis
Created by: Alec Go, Richa Bhayani, and Lei Huang, who were Computer Science graduate students at Stanford University.
link to download the dataset.

News Classification Dataset :

News descriptions classified into World, Business, Sports etc. A manually curated dataset of news description and their classes from AGWeb.com. There are divided into 4 categories SciTech, world, business, sports. This dataset can be used as a golden set to evaluate text classifying news, for example, showing tags on news site.

Format: Text
Default Task: Text classification
Created by: Data Turks
link to download the dataset

Distress classification Dataset :

This is a text classification dataset for classification of news headlines/articles based on whether they are distressed or not. The dataset has 1983 items of which 1983 items have been manually labeled. Labels are distress and not-distress.

Format: Text
Default Task: Text classification
Created by: Data Turks
Link to download the dataset

Blog Authorship Dataset :

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. For each age group, there is an equal number of male and female bloggers.

Format: Text
Default Task: Sentiment analysis, summarization, classification
Created by: J. Schler et al in 2006.
Link to download Dataset.

Musk Dataset:

This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. Because bonds can rotate, a single molecule can adopt many different shapes. This many-to-one relationship between feature vectors and molecules is called the “multiple instance problem”. When learning a classifier for this data, the classifier should classify a molecule as “musk” if ANY of its conformations is classified as a musk. A molecule should be classified as “non-musk” if NONE of its conformations is classified as a musk.

Format: Text
Default Task: Text Classification
Created by: Arris Pharmaceutical Corporation in 1994.
Link to download database

Commentary Dataset :

Comments in the matches classified as humor, praise, stats, teasing etc.. The dataset has 1408 items of which 1287 items have been manually labeled. These labels are classified into 23 categories such as injury, audience, feeling, communication, teasing etc.

Format: Text
Default Task: Text classification
Created by: Data Turks
link to download the dataset

Emotion Classification Dataset :

The Dataset consists of data which is labeled with different sentiments. The dataset has 269 items of which 269 items have been manually labeled. These are divided into 7 categories happy, sad, excited, angry, scared, tender, others

Format: Text
Default Task: Text classification
Created by: Data Turks
link to download Dataset

NSDUH Dataset :

The National Survey on Drug Use and Health (NSDUH) series, formerly titled National Household Survey on Drug Abuse, is a major source of statistical information on the use of illicit drugs, alcohol, and tobacco and on mental health issues among members of the U.S. There are 55,268 instances in the Dataset.

Format: Text
Default task: Text classification, regression
Created by : United States Department of Health and Human Services in 2012
link to download the dataset

Zoo Dataset :

A simple database containing 17 Boolean-valued attributes. Animals are classed into 7 categories and features are given for each. Here is a breakdown of which animals are in which type:

Set of animals:

1 — aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf
2 — chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren
3 — pitviper, seasnake, slowworm, tortoise, tuatara
4 — bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna
5 — frog, frog, newt, toad
6 — flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp
7 — clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm

Format: Text
Default task: Text classification
Created by: R. Forsyth in 1990
Link to download the dataset

URL Dataset:

This Dataset is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

120 days of URL data from a large conference were taken and Many features of each URL are taken.

Format: Text
Default task: Text classification
Created by :J. Ma in 2015.
Link to the dataset.

one last word “practice is the main key to success”. Get in touch with as many datasets as possible. Each dataset you are working with will help to enhance your coding skills.

You can find thousands of such open datasets here

Hope this blog would have given you a better insight of different Datasets !

I would love to hear any suggestions or queries. Please write to me at nidhi.surapaneni@dataturks.com