Rare text-classification open datasets

--

Hi !! back again with some interesting and mind challenging datasets for training your computer. A small question!! How often you check with the ratings /reviews of places or buses? Maybe regularly or at least once a day. Have you observed when you check some restaurant in google, it started showing your news feed accordingly favorable to what you have searched for in the last few days.

Yeah !! This is all because of algorithms that are there in the systems. These algorithm needs to be trained with several different and challenging datasets to perform to its best.

An important step in machine learning is creating or finding suitable data for training and testing an algorithm. Working with a good data set will help you to avoid or notice errors in your algorithm and improve the results of your application

Movie Lens Dataset:

The data set was collected over various periods of time, depending on the size of the set. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

  • Format: text
  • Default task: Text classification, Regression, clustering.
  • Created by: GroupLens Research in 2016
  • link to download the dataset

OPIN-RANK REVIEW Dataset:

This dataset contains full reviews for cars and hotels collected from TripAdvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

Edmunds icon

Car Reviews:

  • Full reviews of cars for model-years 2007, 2008, and 2009
  • There are about 140–250 cars for each model year
  • Extracted fields include dates, author names, favorites and the full textual review
  • Total number of reviews: ~42,230
  • Year 2007 -18,903 reviews
  • Year 2008 -15,438 reviews
  • Year 2009–7,947 reviews

Hotel Reviews:

TripAdvisor icon
  • Full reviews of hotels in 10 different cities (Dubai, Beijing, London, New York City, New Delhi, San Francisco, Shanghai, Montreal, Las Vegas, Chicago)
  • There are about 80–700 hotels in each city
  • Extracted fields include date, review title and the full review
  • Total number of reviews: ~259,000
  1. Format: text
  2. Default task: classification, Sentiment analysis, clustering.
  3. Created by: K. Ganesan et al. in 2011.
  4. link to download the dataset.

Cyber-Trolls Dataset :

Dataset used to classify tweets as aggressive or not to help fight trolls. The dataset has 20001 items of which 20001 items have been manually labeled. There are 2 categories 1(Cyber-Aggressive) and 0 (Non-Cyber-Aggressive). These are Human labeled dataset.

  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • Link to download the dataset.

Chat Messages By Category Dataset :

as drugs & alcohol

The dataset has 20001 items of which 68 items have been manually labeled. A text classification dataset with 8 classes like Alcohol & Drugs, Profanity & Obscenity, Sex, religion etc.

  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • link to download Dataset

SPAMBASE Dataset:

identifying spam

The Spam base data set includes 4601 observations corresponding to email messages, 1813 of which are spam. From the original email messages, 58 different attributes were computed. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

separating spam
  • Format: Text
  • Default task: Spam detection, classification
  • Created by: Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs.
  • link to download the dataset.

Sentiment140 Dataset:

sentiment analysis

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter. use causes Brand management (e.g. Windows 10), Polling (e.g. Obama), Planning a purchase (e.g. Kindle)

meter of emotions
  • Format: Text
  • Default Task: Sentiment analysis
  • Created by: Alec Go, Richa Bhayani, and Lei Huang, who were Computer Science graduate students at Stanford University.
  • link to download the dataset.

News Classification Dataset :

identified as SciTech

News descriptions classified into World, Business, Sports etc. A manually curated dataset of news description and their classes from AGWeb.com. There are divided into 4 categories SciTech, world, business, sports. This dataset can be used as a golden set to evaluate text classifying news, for example, showing tags on news site.

identified as business
  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • link to download the dataset

Distress classification Dataset :

identified as distress

This is a text classification dataset for classification of news headlines/articles based on whether they are distressed or not. The dataset has 1983 items of which 1983 items have been manually labeled. Labels are distress and not-distress.

identified as not-Distress
  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • Link to download the dataset

Blog Authorship Dataset :

blogger busy writing blogs

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. For each age group, there is an equal number of male and female bloggers.

a sample of profiles in blogger.com
  • Format: Text
  • Default Task: Sentiment analysis, summarization, classification
  • Created by: J. Schler et al in 2006.
  • Link to download Dataset.

Musk Dataset:

molecules

This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. Because bonds can rotate, a single molecule can adopt many different shapes. This many-to-one relationship between feature vectors and molecules is called the “multiple instance problem”. When learning a classifier for this data, the classifier should classify a molecule as “musk” if ANY of its conformations is classified as a musk. A molecule should be classified as “non-musk” if NONE of its conformations is classified as a musk.

different types of musks
  • Format: Text
  • Default Task: Text Classification
  • Created by: Arris Pharmaceutical Corporation in 1994.
  • Link to download database

Commentary Dataset :

identified as refs_neg and play

Comments in the matches classified as humor, praise, stats, teasing etc.. The dataset has 1408 items of which 1287 items have been manually labeled. These labels are classified into 23 categories such as injury, audience, feeling, communication, teasing etc.

identified as player_praise
  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • link to download the dataset

Emotion Classification Dataset :

identified as society and culture

The Dataset consists of data which is labeled with different sentiments. The dataset has 269 items of which 269 items have been manually labeled. These are divided into 7 categories happy, sad, excited, angry, scared, tender, others

identified as family and relationship
  • Format: Text
  • Default Task: Text classification
  • Created by: Data Turks
  • link to download Dataset

NSDUH Dataset :

official logo of NSDUH

The National Survey on Drug Use and Health (NSDUH) series, formerly titled National Household Survey on Drug Abuse, is a major source of statistical information on the use of illicit drugs, alcohol, and tobacco and on mental health issues among members of the U.S. There are 55,268 instances in the Dataset.

survey

Zoo Dataset :

A simple database containing 17 Boolean-valued attributes. Animals are classed into 7 categories and features are given for each. Here is a breakdown of which animals are in which type:

animals in zoo

Set of animals:

1 — aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf
2 — chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren
3 — pitviper, seasnake, slowworm, tortoise, tuatara
4 — bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna
5 — frog, frog, newt, toad
6 — flea, gnat, honeybee, housefly, ladybird, moth, termite, wasp
7 — clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm

  • Format: Text
  • Default task: Text classification
  • Created by: R. Forsyth in 1990
  • Link to download the dataset

URL Dataset:

url

This Dataset is to construct a real-time system that uses machine learning techniques to detect malicious URLs (spam, phishing, exploits, and so on). To this end, we have explored techniques that involve classifying URLs based on their lexical and host-based features, as well as online learning to process large numbers of examples and adapt quickly to evolving URLs over time.

types of domains

120 days of URL data from a large conference were taken and Many features of each URL are taken.

  • Format: Text
  • Default task: Text classification
  • Created by :J. Ma in 2015.
  • Link to the dataset.

one last word “practice is the main key to success”. Get in touch with as many datasets as possible. Each dataset you are working with will help to enhance your coding skills.

You can find thousands of such open datasets here

Hope this blog would have given you a better insight of different Datasets !

I would love to hear any suggestions or queries. Please write to me at nidhi.surapaneni@dataturks.com

--

--

DataTurks: Data Annotations Made Super Easy

Data Annotation Platform. Image Bounding, Document Annotation, NLP and Text Annotations. #HumanInTheLoop #AI, #TrainingData for #MachineLearning.