Image for post
Image for post

In the world of Natural Language Processing (NLP), the most basic models are based on Bag of Words. But such models fail to capture the syntactic relations between words.

For example, suppose we build a sentiment analyser based on only Bag of Words. Such a model will not be able to capture the difference between “I like you”, where “like” is a verb with a positive sentiment, and “I am like you”, where “like” is a preposition with a neutral sentiment.

So this leaves us with a question — how do we improve on this Bag of Words technique?

Part of Speech (hereby referred to as POS) Tags are useful for building parse trees, which are used in building NERs (most named entities are Nouns) and extracting relations between words. POS Tagging is also essential for building lemmatizers which are used to reduce a word to its root form. …


Model Evaluation, Model Tuning

Image for post
Image for post

Welcome to part two of the predicting taxi fare using machine learning series! This is a unique challenge, wouldn’t you say? We take cab rides on a regular basis (sometimes even daily!), and yet when we’re hitting that ‘Book now’ button, we rely on manual on-the-fly calculations rather than hardcore ML ones. And that’s what I aim to demonstrate here.

In the previous post, we looked at cleaning the data and exploring it to identify relationships between variables, and also to understand various features that will have an impact on the Taxi Fare.

In this article, we will understand how to build machine learning models to predict taxi fare and look at the significant impact feature engineering plays in this process. The code for this article can be found here. …


Has there been a focus shift towards privacy since the scandal?

When the Cambridge Analytica- Facebook scandal emerged, articles related to misuse of user data by technology companies were ubiquitous. The issues raised related to privacy made me want to understand the impact of the scandal on the views people had on Facebook and how their perception of Facebook has changed since the scandal.

Image for post
Image for post
The Scandal — A brief outline

I dug into the archives of Medium and collected all the posts that were tagged “Facebook”. …


Data Cleaning, Visualisation, Feature Engineering

I was learning Python for data analysis and wanted to apply the concepts on a real data set — and lo, there I was on Kaggle and found the New York Taxi Fare Prediction problem.

In this challenge we are given a training set of 55M Taxi trips in New York since 2009 in the train data and 9914 records in the test data. The goal of this challenge is to predict the fare of a taxi trip given information about the pickup and drop off locations, the pickup date time and number of passengers travelling.

In any analytics project 80% of the time and effort is spent on data cleaning, exploratory analysis and deriving new features. …


In my previous post, I had written about how to scrape search results for a particular query string from Medium. In this post, we will go into details of analyzing the data scrapped for search term “Data Science” to group posts based on Number of claps and Responses into different levels of popularity and also understand what makes these posts popular.

The data scrapped from Medium search results was JSON file with extensive data about each search result. To explore the structure of JSON file, I used Notepad++ with JSON plugin. …


I wanted a way to look at what people are writing on Medium about Data Science and here’s how I did it.

Medium is a great tool for posting and discovering content on latest topics and being an data enthusiast, I wanted to understand what people are writing on Data Science and what kind of articles are well-read. So I decided to build a crawler using scrapy — a python library.

To build any crawler, it is imperative to understand what requests are made to the server to fetch the data. To get this information, I used the “Network” tab in “Developer Tools” in Chrome to understand how requests are made and on this basis set the “header” and “cookie” information. Also, in the network tab, when you click on a query, you can check how the response is. This proved to be very useful, as when a request is made to Medium, it returns a JSON object in its response, which contains the information about the post, which meant all I had to do with the response was write the json output to a file, which can then be processed and stored in excel or a database. …

About

Aiswarya Ramachandran

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store