How to Label Unlabeled Tweets

Unsupervised Learning

Huda
Geek Culture

--

www.istockphoto.com

While learning data science, we mostly get a well-labeled dataset to build our models on. However, in a real-world scenario, seldom do we get good labeled datasets. Many data science problems revolve around the unlabeled and unstructured data domain. A primary solution to analyze unlabeled data is through Clustering. However, a more important aspect here is to understand how to cluster.

Clustering can only be done on numerical values as it primarily calculates the similarity of each datapoint with respect to others by calculating their mathematical distance (Euclidean, Manhattan, Minkowski, etc.). Hence, we can either cluster the textual data points by converting them into a numerical dissimilarity matrix, or we can cluster them by understanding the semantics of the textual datapoints by vectorizing and clustering them by their word embeddings.

When it comes to sentiment analysis, it often makes sense to understand the semantics in order to cluster similar meaning words together. Thus, I’ll further discuss a step-by-step process of carrying out a sentiment analysis of an unlabeled dataset. A prime example of unlabeled data is the tweets…

--

--

Huda
Geek Culture

Data Scientist with recent experience in data acquisition and data modeling, statistical analysis, machine learning, deep learning and NLP