Getting Started

Four Python tools for identifying the language of your text and a speed and accuracy test


Most NLP applications tend to be language-specific and therefore require monolingual data. In order to build an application in your target language, you may need to apply a preprocessing technique that filters out text written in non-target languages. This requires proper identification of the language of each input example. Below I list some tools you can use as Python modules for this preprocessing requirement, and provide a performance benchmark assessing the speed and accuracy of each one.

1) langdetect

langdetect is a re-implementation of Google’s language-detection library from Java to Python. Simply pass your text to the imported detect function and it…

Port forwarding, remote jupyter notebook access, and tmux auto-start

Photo by Bill Jelen on Unsplash


Data science work often requires the use of advanced big data analytic techniques against huge datasets and parallel or distributed computing for fast model training and tuning. While most of the data science workflow, especially in the exploratory and development stages, can be carried out on one’s laptop or desktop, it is often impractical or impossible to rely exclusively on a local development environment due to its limitations in processing power, memory, and storage.

To this point, the role of cloud computing technologies such as those provided by Amazon Web Services’ Elastic Compute Cloud (EC2) has never been more important…

Cool transitivity alternations you didn’t know existed in English

Photo by Ben White on Unsplash

As a writer, you’ve probably come across the concept of voice in the context of your high school English teachers pestering you to avoid using the passive voice in your essays: “It is dull, it is too formal, it minimizes the sense of mystery, it kills emotion, it destroys a writer’s flow.” [1]

Consistent use of the active voice does generally help make your writing more clear and direct, but we all know by now that there are situations in which it is more rhetorically effective to choose the passive voice over the active voice — for example, when we…


Are you still using os.path?

Photo by Adam Thomas on Unsplash

0. Basics

One of Python’s most popular standard utility modules, os has provided us with many useful methods for managing a large number of files and directories. But Python 3.4+ gave us an alternative, probably superior, module for this task — pathlib — which introduces the Path class. With this module, we work with path objects instead of path strings and can write much less clunky code.

Below are some of its most useful methods and attributes.

from pathlib import Path>>> Path.home()
>>> Path.cwd()

A Path object, instantiated with a path string argument, can be a directory or a…

Yenthoroto village, Western Province, Papua New Guinea

A story and some insights

I want to preface this post by saying that I have no intention of using this space to minimize the role of a quantitative degree in the work and success of most data scientists, nor is it my objective to “over-encourage” (or discourage) anyone without a solid quantitative background who nevertheless has a serious interest in pursuing data science as a career. As a data scientist coming from an “unorthodox” background as someone put it (I have a Ph.D. …

Using 2020 primary debate transcripts

Image Source

The goal of this post is two-fold.

First, as promised, I’ll be following up on a previous post in which I compared the speech properties of twenty-one 2020 Democratic primary presidential candidates. I identified a range of linguistic features that would distinguish our presidential hopefuls at the descriptive level.

In this post, I’d like to use those features to build a classification model that can predict who will qualify for the Dec 19th debate. Of course, we now know who has qualified for the debate, but I guess the real motive behind this task is to come away with a more general understanding of how much it…


Five steps using scikit-learn

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV # or GridSearchCV
from sklearn.pipeline import Pipeline

1) Prepare the datasets.

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["label"])train_data = train_df["text"]
train_target = train_df["label"]
test_data = test_df["text"]
test_target = test_df["label"]

2) Set up a training pipeline using Pipeline.

The pipeline should consist of a series of transforms and a final estimator.

Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the…

Data preparation and feature engineering for predictive modeling using real-world data

Source: This LA Times article

As we head into the new year reeling from a climate of deep political polarization, I realized that I’ve never paid so much attention to the politics of this country as I have this past whole year — and that means I actually listen to The Daily podcast every morning now and even watched two or three live Democratic primary debates in their entirety.

The day after each debate, my newsfeed gets flooded with commentary, footage, and analysis of what the candidates talked about and even how much they talked. …

With a guide to question type extraction with spaCy

Source: This LA Times article

Table of contents

  1. Ruminations of a Linguist
  2. A solution: Rule-Based Syntactic Feature Extraction
  3. An example: Question Type Extraction Using spaCy

🌳 Ruminations of a Linguist

When I first started exploring data science towards the end of my Ph.D. program in linguistics, I was pleased to discover the role of linguistics — specifically, linguistic features — in the development of NLP models. At the same time, I was a bit perplexed by why there was relatively little talk of exploiting syntactic features (e.g., level of clausal embedding, presence of coordination, type of speech act, etc.) …

With an emphasis on feature engineering and training

One of the most essential characteristics of a machine learning code pipeline is reusability. A reusable, sharable and extensible pipeline will ensure process and code integrity by enforcing consistent use of intuitive structural elements to program flows and can therefore enhance the data scientist’s development process, which is iterative by nature. In this article, I will demonstrate how to build a custom machine learning code pipeline from scratch using scikit-learn with an emphasis on the following two components:

  1. A featurization pipeline that enables flexible definition and selection of features
  2. A training pipeline that incorporates the output of the featurization pipeline…

Jenny Lee

Sr Data Scientist at Chegg | NLP & Linguistics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store