Natural Language Processing using scikit-learn’s CountVectorizer or libraries such as spaCY and Gensim can provide powerful insights into text data, allowing us to extract topics which can then be added as features and regressed on to generate predictions.
However, what if we want to use the sparse matrix that Countvectorizer produces as a feature along with other categoricals or numerical features in the dataset?
The answer is Column-Transformer, and I’ll demonstrate it’s usage on some yelp review data.
spaCY’s lemmas are now very clean and can be processed by TFIDF into a sparse vector matrix once I process them back into strings. …
I just completed a Lambda School kaggle competition involving water pumps in Tanzania. The data comes from Taarifa, a web API that aggregates data to create citizen feedback, the Tanzanian Water Ministry and was put together by DrivenData.
The goal is to use a training set of 59400 observations and 40 features to predict three labels for water points.
The potential use case of this model is to prioritise maintenance for pumps that are either non functional or in need of repair.
I did limited data exploration before throwing together my first model, basically, I searched for the the simplest features and settled on geographical location, amount of water available. …
In my last post about movie budgets, one of the key feature categeories was the Release Dates of films. Dates are an interesting linguistic structure bceause different cultures represent them differently for example 2019–04–21 or YYYY-MM-DD, this is totally intuitive, biggest unti to smallest unit, year then month then day.
Of course in the USA the standard date format is MM-DD-YYYY, which makes absolutely no sense to me but has to be delt with in most US source datasets.
Let’s see how this shows up in my movie dataset:
Note that Release date is listed as an ‘object’, in fact all features are, this means that pandas read all of these features as strings, because they all contained characters, numbers, and symbols, no clear datatype was identified by the pandas parser. …
In my last post about Hollywood movies, I didn’t specifically address the engineering challenge of scraping all the data for that analysis. Below, I will go through the process to identify data for scraping using Beautiful Soup.
The first step is to identify the data you want to scrape, it’s helpful if it’s already in some sort of html structure that beautiful soup can use to locate on the page.
As you can see, the target page I went after already has sort of a table structure, this means there are html tags we can find with BS.
Ok, let’s get started! …
That’s the proportion of movies from a list of 5193 Hollywood Films to actually broke even in the past 100 years. So out of 5193 films, only 1190 make a profit.
You can see that recent decades have been increasingly brutal, with the proportion of break even films falling to 20% in the 90s and staying around that level to the present day.
But what is driving this high failure rate? Are larger budget studio films sucking the dollars away from smaller independent cinema? Is it a budgeting arms race that are concentrating film profits into fewer movies?
If this were the case, we could assume that the return on investment(ROI) of movies should be changing so that a few films make huge returns, and the majority of films make little or no money. …
Dealing with non-NaN strings that represent NaN values.
In our first post we looked at using the pandas .dropna() function to get rid of NaN (not a number values). However, some datasets like the UCIS Machine Learning repository (a home for many toy datasets that are useful for a beginning data scientist) do not provide a pre-formatted dataframe (with or without NaNs) and therefore their NaN values (missing observations, etc) will be passed to you with some sort of string encoding like a ‘?’ or ‘unknown’ or ‘Unknown’ (all are different values to a pandas dataframe).
Here’s an example so you can see what I…
As a beginning data scientist, I’m learning that most of my time is spent preparing data for analysis. Much as writing is about clarifying and polishing ideas, before we can tell any compelling stories with data, it must be thoroughly cleaned and prepared for analysis.
This might not seem very interesting, but it is necessary if we want to extract any interesting stories from it.
Here are some example datasets for us to work with:
We’ve created a dataframe with 4 columns, and 100 rows (zero is counted as the first index marker so 99 will be our last row) that is populated with random integers between 0 and 100. …