Image for post
Image for post

“Nobody is Perfect” This quote not just applies to us humans but also the data that surrounds us. Any data science practitioner needs to understand all of the imperfections present in the data and handle them accordingly in order to get the desired results. Once such imperfection is the inherent Class Imbalance which is highly prevalent in most of the real world datasets. In this blog we will cover different Sample Weighting schemes that can be applied to any Loss Function in order to cater to the Class Imbalance present in your data.

What is the Class Imbalance Problem?

The Class Imbalance problem is a problem that plagues most of the Machine Learning/Deep Learning Classification problems. It occurs when there are one or more classes (majority classes) that are more frequent occurring than the other classes (minority classes). Simply put, there is a skewness towards the majority class. …


Image for post
Image for post

The world we live in is not a just world. It is infected by different kinds of bias, be it Gender Bias or Racial Bias. More recently, the world was shocked by the tragic news of George Floyd’s death due to extreme police brutality. This brought issues like Systemic Racism, unconscious bias, Racial and Gender gap at the focus for many people, organizations, and nations. This blog talks about what we at GumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. …


Image for post
Image for post

It is of extreme importance that one understands the different evaluation metrics and when to use them. Evaluating your model on inadequate metrics and then judging your model based on the improvements achieved on these metrics is a huge trap. Often, especially in the Industry, these metrics are indicators for productionization of newer models. Therefore, as a Data Scientist, one should be aware of the pros and cons of different evaluation metrics in order to avoid falling in to this trap.

Evaluating a Keyword Extraction model is not as straightforward as it is to evaluate a model for a Classification problem. Keyword extraction is fundamentally a ranking task rather than a classification task, where we would expect to rank relevant keywords or key-phrases (Going forward I will be using these interchangeably) higher up the order than irrelevant key-phrases. When it comes to the evaluation of such systems we have to compare two lists of key-phrases. Traditional tasks such as classification tasks just predict which class a sample belongs to and therefore, do not consider any form of ranking during evaluation. Keyword extraction on the other hand requires Rank-Aware evaluation metrics. In future sections we will explore the shortcomings of the traditional evaluation metrics such as F1, Precision and Recall and then look at the following Rank-Aware…


Image for post
Image for post
Image Source

Continuing the series of blogs on different keyword extractors, this blog brings us to the Graph Based approaches. We will cover what inspired the researchers to start exploring a graphical solution for Keyword Extraction and we will then discuss the four Graph Based approaches (TextRank, SingleRank, TopicRank and PositionRank). If you would like to read up on different Statistical Approaches, please refer to the first blog in this series.

Introduction — Graph Based Approaches

All the graph based approaches employ a ranking algorithm like HITS or PageRank. These algorithms compute the importance of a vertex in the graph. …


Image for post
Image for post
Image Source

Exploring Different Keyword Extractors is an ongoing series which contains a total of three blogs. This blog is the first in this series. It provides an introduction to Keyword Extraction and why it is important. I also go into the details of three Statistical approaches for Keyword Extraction. The second blog will cover four graph-based Approaches for Keyword Extraction and the third one will cover different Evaluation Metrics and a comparison of different statistical and Graph Based approaches.

Introduction

The pace with which the data and the information generated has been growing, makes summarizing it a challenge. According to Netcraft’s January 2020 Web Server Survey, there are over 1 billion websites today increasing with a pace of around 380 new websites being created every minute. Millions of people, either through blog posts, new articles, comments, forum posts, or social media publications contribute to this increasing size of the internet today. …


In this blog we will look at the impact of Covid-19 based on GumGum’s publisher network from January 10th 2020 to April 8th 2020. I utilized GumGums AI capabilities to classify all the web pages into different IAB categories to see how different IAB categories were impacted by Covid-19. Links to the interactive versions for each of the graphs in this blog are also present.

Data Collection

I queried GumGum’s database to collect the processed data from the traffic of English Webpages seen by GumGum over the course of January, February, March and April. Due to the high data volume of GumGums AI databases, I utilized Databricks to run a PySpark job to collect, aggregate and deduplicate across time. The deduplication entails collecting only the first occurrence of a webpage and removing all the other duplicates. …

Ishan Shrivastava

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store