
Today, we will take the contents of a Wikipedia article and prepare it for natural language processing. We will use spaCy to process the text and use Power BI to visualize our graph.
Open up a jupyter notebook, and let’s begin!
Let’s get our imports out of the way:
# for manipulating dataframes
import pandas as pd# for webscraping
from requests import get
from bs4 import BeautifulSoup# for natural language processing
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
Then, we’ll issue a get request to Wikipedia like so:
url = 'https://en.wikipedia.org/wiki/QAnon'
response = get(url)To get an idea…

Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!
In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio's map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. …

As a data scientist, you’ll need to learn to be comfortable with analytics tools sooner or later. In today’s post, we will dive headfirst and learn the very basics of Power BI.
Be sure to click on the images to better see some details.
The dataset that we will be using for today’s hands-on tutorial can be found at https://www.kaggle.com/c/instacart-market-basket-analysis/data. This dataset is “a relational set of files describing customers’ orders over time.” Download the zip files and extract them to a folder on your local hard drive.
If you haven’t already, go to https://powerbi.microsoft.com/desktop …

In this post, we’ll go through the process of creating forecasting in Power BI.
You can download the dataset that I used here. It contains daily female births in California in 1959¹. For a list other time-series datasets, check out Jason Brownlee’s article.
Let’s load the data into Power BI. Open up Power BI and click on “Get data” on the welcome screen as shown below.

Every once in a while, I would come across an article that decries online data science courses and boot camps as pathways towards getting a data science job. Most of the articles aim not to discourage but serve as a reminder to take a hard look in the mirror first and realize what we’re up against. However, a few detractors have proclaimed that the proliferation of these online courses and boot camps have caused the degradation of the profession.
To the latter, I vehemently disagree.
Data science have captured popular imagination ever since Harvard Business Review dubbed data scientist as…

In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. This is part three of the Exploring Trump series.
In this post, we’ll continue our journey but this time we’ll be using PyCaret.
For this project, we’ll be using PyCaret:
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.¹
PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification…

This the weekend, I’ve decided to restore my Macbook Pro to factory settings so I can have a clean start at setting up a programming environment.
In this post, we’ll work through setting up oh-my-zsh and iTerm2 on the Mac.
This is what the end result will look like:

Since my article about my journey to data science, I’ve had a lot of people ask me for advice regarding their own journey towards becoming a data scientist. A common theme started to emerge: aspiring data scientists are confused about how to start, and some are drowning because of the overwhelming amount of information available in the wild. So, what’s another, right?
Well, let’s see.
I urge aspiring data scientists to slow it down a bit and take a step back. Before we get to learning, let’s take care of some business first: the fine art of reinventing yourself. …

In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. This is part two of the Exploring Trump series.
On this post, we’ll continue our journey but this time we’ll be using spaCy.
For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.
Let’s get started by firing up a Jupyter notebook!
Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.
# for manipulating data
import pandas…
I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.
This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value…

Data Scientist. Breaking things to solve problems.