Most NLP applications tend to be language-specific and therefore require monolingual data. In order to build an application in your target language, you may need to apply a preprocessing technique that filters out text written in non-target languages. This requires proper identification of the language of each input example. Below I list some tools you can use as Python modules for this preprocessing requirement, and provide a performance benchmark assessing the speed and accuracy of each one.
langdetect is a re-implementation of Google’s language-detection library from Java to Python. Simply pass your text to the imported
detect function and it…
Data science work often requires the use of advanced big data analytic techniques against huge datasets and parallel or distributed computing for fast model training and tuning. While most of the data science workflow, especially in the exploratory and development stages, can be carried out on one’s laptop or desktop, it is often impractical or impossible to rely exclusively on a local development environment due to its limitations in processing power, memory, and storage.
To this point, the role of cloud computing technologies such as those provided by Amazon Web Services’ Elastic Compute Cloud (EC2) has never been more important…
As a writer, you’ve probably come across the concept of voice in the context of your high school English teachers pestering you to avoid using the passive voice in your essays: “It is dull, it is too formal, it minimizes the sense of mystery, it kills emotion, it destroys a writer’s flow.” 
Consistent use of the active voice does generally help make your writing more clear and direct, but we all know by now that there are situations in which it is more rhetorically effective to choose the passive voice over the active voice — for example, when we…
One of Python’s most popular standard utility modules,
os has provided us with many useful methods for managing a large number of files and directories. But Python 3.4+ gave us an alternative, probably superior, module for this task —
pathlib — which introduces the
Path class. With this module, we work with path objects instead of path strings and can write much less clunky code.
Below are some of its most useful methods and attributes.
from pathlib import Path>>> Path.home()
Path object, instantiated with a path string argument, can be a directory or a…
I want to preface this post by saying that I have no intention of using this space to minimize the role of a quantitative degree in the work and success of most data scientists, nor is it my objective to “over-encourage” (or discourage) anyone without a solid quantitative background who nevertheless has a serious interest in pursuing data science as a career. As a data scientist coming from an “unorthodox” background as someone put it (I have a Ph.D. …
First, as promised, I’ll be following up on a previous post in which I compared the speech properties of twenty-one 2020 Democratic primary presidential candidates. I identified a range of linguistic features that would distinguish our presidential hopefuls at the descriptive level.
In this post, I’d like to use those features to build a classification model that can predict who will qualify for the Dec 19th debate. Of course, we now know who has qualified for the debate, but I guess the real motive behind this task is to come away with a more general understanding of how much it…
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV # or GridSearchCV
from sklearn.pipeline import Pipeline
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["label"])train_data = train_df["text"]
train_target = train_df["label"]test_data = test_df["text"]
test_target = test_df["label"]
The pipeline should consist of a series of transforms and a final estimator.
fiton the pipeline is the same as calling
fiton each estimator in turn,
transformthe input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the…
As we head into the new year reeling from a climate of deep political polarization, I realized that I’ve never paid so much attention to the politics of this country as I have this past whole year — and that means I actually listen to The Daily podcast every morning now and even watched two or three live Democratic primary debates in their entirety.
The day after each debate, my newsfeed gets flooded with commentary, footage, and analysis of what the candidates talked about and even how much they talked. …
When I first started exploring data science towards the end of my Ph.D. program in linguistics, I was pleased to discover the role of linguistics — specifically, linguistic features — in the development of NLP models. At the same time, I was a bit perplexed by why there was relatively little talk of exploiting syntactic features (e.g., level of clausal embedding, presence of coordination, type of speech act, etc.) …
One of the most essential characteristics of a machine learning code pipeline is reusability. A reusable, sharable and extensible pipeline will ensure process and code integrity by enforcing consistent use of intuitive structural elements to program flows and can therefore enhance the data scientist’s development process, which is iterative by nature. In this article, I will demonstrate how to build a custom machine learning code pipeline from scratch using scikit-learn with an emphasis on the following two components:
Sr Data Scientist at Chegg | NLP & Linguistics