A Summary of Articles for Web Mining [IS688, Spring 2021]

Doug Rizio
May 20 · 3 min read

Here is a brief overview of each of the articles I wrote for my Web Mining class at NJIT, taught by Professor Cody Buntain.

An Exploratory Data Analysis (EDA) on the CDC’s COVID-19 Data in the Tri-State Area

In this first article, we explore a dataset of Coronavirus cases and deaths in New York, New Jersey, and Connecticut, obtained from the Center for Disease Control and Prevention. We learn how to download data from websites using API’s, how to clean and organize data, and how to plot data in order to extract insights from it — including the differences between incidences of COVID-19 in the 3 states and what that might mean for each of them.

What Twitter Friends Can Tell Us About the CDC — A Social Media Network Analysis

In the second article, we introduce social media’s impact on society, what part it played in the Coronavirus pandemic, and how the Center for Disease Control and Prevention has navigated both. We learn about the nodes, edges, and alters of an egocentric network, how to get data on the friends and friends of friends of Twitter users, how to place those users onto a graph, and how to organize that graph in order to detect patterns between user connections.

COVID Tweets — Finding Similar Twitter Users in the First Days of the Pandemic

In our third article, we start by talking about Twitter’s use as a public political forum for users during the earliest spread of the Coronavirus, and how we might find users with similar beliefs by analyzing tweets. We learn about the collaborative filtering and cosine similarity, how to extract large volumes of old tweets using a hydrator program, what we can do to sort tweets by user locations, what we can do to find the most frequently used terms, and how to vectorize text to rank different tweets by similarity.

What Makes a Politician Popular on Social Media?

In article number four, we cover the rise of social media in politics and whether we can measure a politician’s popularity based on different Twitter statistics. We learn about K-means clustering and centroids, how to use the elbow method to determine k, how to download Twitter user information from an existing list of users, and how to visualize and cluster various metrics such as followers, friends, and favorites on a graph.

Building a Reddit Recommendation System

In article number five, we detail Reddit’s rise to prominence and ask how a user might find different subreddits with a new recommendation system. We re-introduce collaborative filtering and cosine similarity along with item-based filtering, similarity matrices, and the K-nearest neighbors algorithm, we learn how to expand limited data into a larger set with implicit ratings, how to identify the most popular subreddits, how to deal with sparsity in a matrix, and how to design a recommendation program.

Can We Use Twitter to Track COVID-caused Unemployment in the USA?

In the final article, we detail the emergence and spread of the Coronavirus around the world, how it impacted peoples’ lives and work, how social media use rose in response to the pandemic, and whether we can use social media to understand the intersection of multiple phenomenon. We learn how to download different datasets, how to clean and organize separate sets of data in order to fit each other properly, how to visualize them independently of each other, and lastly how to visualize them together in order to extract meaningful insights from both.

Web Mining [IS688, Spring 2021]

Extracting Insights from Web-Scale Datasets

Web Mining [IS688, Spring 2021]

This publication covers posts from NJIT’s IS688 course and covers machine learning, data mining, text mining, and clustering to extract useful knowledge from the web and other unstructured/semi-structured, hyper- textual, distributed information repositories.

Doug Rizio

Written by

Web Mining [IS688, Spring 2021]

This publication covers posts from NJIT’s IS688 course and covers machine learning, data mining, text mining, and clustering to extract useful knowledge from the web and other unstructured/semi-structured, hyper- textual, distributed information repositories.