Getting Started

Pipelines can be hard to navigate here’s some code that works in general.

Image for post
Image for post
Photo by Quinten de Graaf on Unsplash

Introduction

Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.

Pipelines

Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.

from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
# Load a dataset and print the first examples in the training set
imdb_data =…


As an NLP engineer, I’m going to be out of a job soon 😅

Image for post
Image for post
Photo by C. Cagnin from Pexels

Introduction

I remember when I built my first seq2seq translation system back in 2015. It was a ton of work from processing the data to designing and implementing the model architecture. All that was to translate one language to one other language. Now the models are so much better and the tooling around these models leagues better as well. HuggingFace recently incorporated over 1,000 translation models from the University of Helsinki into their transformer model zoo and they are good. …


How to train a chatbot to sound like anyone with a phone, including your deceased relatives.

Image for post
Image for post
Photo by Manea Catalin from Pexels

Introduction

I grew up reading science fiction where people would try and embed their consciousness into machines. I always found these stories fascinating. What does it mean to be conscious? If I put a perfect copy of myself into a machine which one is me? If biologic me dies but mechanical copy me survives did I die? I still love stories like this and have been devouring Greg Egan lately, I’d highly recommend his book Diaspora if you think these are interesting questions (it’s only $3).

But I digress. With today’s technology, it’s possible to make a rough approximation of a person’s speaking style with only a few lines of code. As Covid-19 has burned through the world I started to worry about the older people in my life and wonder if it would be possible to preserve a little piece of them somewhere. This tutorial is my feeble attempt at capturing and persisting some conversational aspects of a person beyond the grave. …


An introduction to active learning.

Image for post
Image for post
Photo by RUN 4 FFWPU from Pexels

Introduction

Imagine back to your school days studying for an exam. Did you randomly read sections of your notes, or randomly do problems in the back of the book? No! Well, at least I hope you didn’t approach your schooling with the same level of rigor as what to eat for breakfast. What you probably did was figure out what topics were difficult for you to master and worked diligently at those. Only doing minor refreshing of ideas that you felt you understood. So why do we treat our machine students differently?

We need more data! It is a clarion call I often hear working as a Data Scientist, and it’s true most of the time. The way this normally happens is some problem doesn’t have enough data to get good results. A manager asks how much data you need. You say more. They hire some interns or go crowdsource some labelers, spend a few thousand dollars and you squeak out a bit more performance. Adding in a single step where you let your model tell you what it wants to learn more about can vastly increase your performance with a fraction of the data and cost. I’m talking about doing some, get ready for the buzz word, active learning. …


There is a dearth of good psychotherapy data. Let’s work on that.

Image for post
Image for post
Photo by Polina Zimmerman from Pexels

Introduction

This past year I was applying NLP to improve the quality of mental health care. One thing I found particularly difficult in this domain is the lack of high-quality data. Sure, you can go scrape Reddit and get some interesting therapeutic interactions between individuals, but in the author’s opinion, this is a poor substitute for what an actual interaction between a client and a therapist is like. Don’t get me wrong, there are datasets. They are just, more often than not, proprietary or pay to play.

My hope with this post is to introduce a data set of reasonably high-quality therapist responses to mental health questions from real patients. I will discuss the data source, basic information about what is in the data set, and show some simple models we can train using this data culminating with training a chatbot! I am unaffiliated with counselchat.com, but I think they are doing good work and you should check them out. …


Figuring out what words are predictive for your problem is easy!

Image for post
Image for post
Photo Credit: Kendrick Mills

Introduction

Nowadays NLP feels like it’s just about applying BERT and getting state of the art results on your problem. Often times, I find that grabbing a few good informative words can help too. Usually, I’ll have an expert come to me and say these five words are really predictive for this class. Then I’ll use those words as features, and voila! You get some performance improvements or a little bit more interpretability. But what do you do if you don’t have a domain expert? …


It’s only ~100 lines of code but the tweets are infinite.

Image for post
Image for post
The Author’s rendition of a Twitter bot

Introduction

I love generative models. There is something magical about showing a machine a bunch of data and having it draw a picture or write a little story in the same vein as the original material. What good are these silly little models if we can’t share them with others right? This is the information age after all. In this post we’ll:

  1. Walk through setting up a Twitter bot from scratch
  2. Train one of the most cutting edge language models to generate text for us
  3. Use the Twitter API to make your bot tweet!

When you’re done with the tutorial you’ll be able to create a bot just like this one which tweets out generated proverbs. All code for this project can be found in this repository. …


BERT can get you state-of-the-art results on many NLP tasks and it only takes a few lines of code.

Image for post
Image for post
BERT as a Transformer (Image by Author)

Introduction

Getting state of the art results in NLP used to be a harrowing task. You’d have to design all kinds of pipelines, do part of speech tagging, link these to knowledge bases, lemmatize your words, and build crazy parsers. Now just throw your task at BERT and you’ll probably do pretty well. The purpose of this tutorial is to set up a minimal example of sentence level classification with BERT and Sci-kit Learn. I’m not going to talk about what BERT is or how it works in any detail. I just want to show you in the smallest amount of work how to use this model really easily with Sci-kit Learn. …


The determinant is related to the volume of a parallelepiped spanned by the vectors in a matrix let’s see how.

Recently I was asked to create a video demonstrating how converting a matrix into reduced row echelon form (RREF) uncovers the determinant. I know this sounds implausible but it happened. The goal of this post is not to describe all of the properties of a determinant, nor is it to explain how Gaussian Elimination works ,there are plenty of other resources for that, but to show a nifty demonstration of how the geometry of a matrix is related to the determinant and that converting a matrix into RREF uncovers the volume of this geometric object. First off, let’s think about how to view a matrix geometrically. …


GANs can seem scary but the ideas and basic implementation are super simple, like ~50 lines of code simple.

Image for post
Image for post
Botanical drawings from a GAN trained on the USDA pomological watercolor collection.

Introduction

Generative Adversarial Networks (GANs) are a model framework where two models are trained together: one learns to generate synthetic data from the same distribution as the training set and the other learns to distinguish true data from generated data. When I was first learning about them, I remember being kind of overwhelmed with how to construct the joint training. I haven’t seen a tutorial yet that focuses on building a trivial GAN so I’m going to try and do that here. No image generation, no fancy deep fried conv nets. We are going to train a model capable of learning to generate even numbers in about 50 lines of Python code. …

About

Nicolas Bertagnolli

Sr. Machine Learning Engineer interested in NLP. Let’s connect on LinkedIn! https://www.linkedin.com/in/nicolas-bertagnolli-058aba81/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store