Last week, I gave you all a little insight into using the MLB-StatsAPI to gather daily up-to-date baseball data.
In doing so, I became more comfortable with using the API that has given me trouble for quite some time now. The funny part of it all is that after digging deeper into the queries, I realized that it may not be as difficult as it seems.
Once you get comfortable with some of the team/player parameters, you can figure out the rest intuitively. …
For a while now, I’ve been looking for an API to get good daily data for MLB games and up-to-date season stats. I’ve grown tired of scraping Baseball Reference for menial tasks and have been wanting to build a script that can bring in daily data without too much effort. It’s been a while since I’ve tried to use the MLB API. Reason being, that it’s never been that user-friendly due to poor and or unavailable documentation. But today I thought I’d give it another shot. In this short post, I’ll give you a couple of snippets to get you…
Ah yes one of my favorite probability problems. The Birthday Paradox is a counter intuitive and powerful problem that tests the mind of those who hear it. When I first heard of it, I too didn’t understand the intuition behind it. So I decided write a brief post describing the math behind solving it.
The Birthday Paradox problem is this: given a room full of n people, what is the probability that two people have the same birthday?
The answer: probability of this occurrence goes up with the more people in the room (as expected). Once the number of people…
Oftentimes when cleaning text data, you need to find certain words or phrases within the body of the text. You can do this in a number of ways, one of the most popular being RegEx. But what if I told you there was a Python library that could do the job more quickly and is much easier to work with? Well, there is, and it’s called FlashText.
In this brief piece, I will show you the simplest way to extract keywords from text. Let’s get started.
Easily install FlashText with the simple pip command
pip install flashtext and import into…
A Moneyball spin off. In this weeks blog I will try to use linear regression models to predict how many runs each Major League team will score during the course of the 2018 regular season. This was a project I completed earlier this year when the season was just beginning.
The data used for this project came from baseball-reference.com. The initial dataset consisted of each team’s seasonal hitting statistics from 1990–2018 (excluding 1994–1995 due to MLB strike). Data was split into 2 groups: 1990–2017 and 2018 for testing.
In this piece, I explain a recommendation system that my classmates Akshay (@akshay.sharma8426), Fhel (@FhelDimaano), and I built during our time at The Flatiron School in New York City. Although I won’t give details of the math behind singular value decomposition and collaborative filtering, I will show you the steps we took to build out our system. This project took roughly three days.
There’s a link to the demo of our whiskey recommendation system at the bottom of this article.
We collected our data from whiskybase, which is a website devoted to whiskey enthusiasts. Users on the site can rate…
This week I thought I would share with you a short story of how I decided that tech was right for me and give you a brief look into some lessons I have learned throughout the past year. Throughout my life it has seemed like I have had many different plans. Plans that ranged from sports to school to business to travel and finally to tech. As a 23 year old and a recent graduate of Michigan State University, if you would have told five years ago that today I would be in New York City studying computer programming…
Hey! I post short blogs every Friday, if you have not done so already, check out last week’s blog “News Article Clustering Using Unsupervised Learning” here.
This past week I spent some time looking at text data from goodreads.com. My goal was to scrape some quotes off of the website and analyze it using Natural Language Processing techniques and supervised classification models.
Using Beautiful Soup 4 I successfully scraped 2,016 (coincidence? I think not..) quotes from goodreads.com that came from 5 different presidents.
Last week I made a post about an extractive text summarization tool I built with Python using NLTK and cosine similarity scores. That article can be found HERE. This week’s blog will focus on another piece of that project where I use unsupervised learning algorithms to cluster news articles followed by supervised learning algorithms to classify recent articles.
The data I am using is roughly 97k news articles that come from the years 2013–2017 and range from roughly 2k-15k characters in length. I store the dataset in a pandas dataframe for analysis, preview shown below.
Being on the go and living in New York City go hand in hand. With everyone constantly rushing to and fro, it seems that everyone is short on one main thing: time. With a shortage of time and a surplus of tasks it would be nice to be able to minimize certain daily activities in order to be more productive.
When I look at the New York Times front page I see articles on articles, but too many for me to read before I exit the 5 train at Bowling Green. Because of this, I decided to create…
Data Scientist at Trusted Media Brands | Weekend Web Hacker