NLP-in-Production Series

Along with some insights to prevent common mistakes in computing sentence embedding

Image for post
Image for post
Photo by Katya Austin on Unsplash

We often need to encode text data, including words, sentences, or documents into high-dimensional vectors. The sentence embedding is an important step of various NLP tasks such as sentiment analysis and extractive summarization. A flexible sentence embedding library is needed to prototype fast and to tune for various contexts.

In the past, we mostly use encoders such as one-hot, term-frequency, or TF-IDF (a.k.a., normalized term-frequency). However, the semantic and syntactic information of words were not captured in these techniques. The recent advancements allow us to encode sentences or words in more meaningful forms. The word2vec technique and the BERT language model are two important ones.


NLP-in-Production Series

Owl API is a powerful word similarity service using advanced text clustering techniques and various word2vec models

Image for post
Image for post
Photo by Zdeněk Macháček on Unsplash

In natural language processing or NLP, most tasks are built based on extracting semantics of words, sentences, or documents. For example, we build an extractive summarization by extracting the semantics of sentences and clustering them based on their significance in a document. Or, we build topic modeling solutions by extracting the word groups that characterize a set of documents the best.

The first element of language with a meaning is a word. So, you can guess how important is to correctly extract the semantic relations of words in NLP tasks. One of the most powerful tools to extract word semantics is word2vec models. These word2vec models are trained for different contexts and provide high-dimension vector representations of words. …


NLP-in-Production Series

Along with an introduction to the newly released word similarity API named Owl

Image for post
Image for post
Photo by Harley-Davidson on Unsplash

The recent developments in natural language processing or NLP introduce language models that must be cautiously used in production. For example, the spaCy large English model, en-core-web-lg contains more than 600 thousand 300-d vectors. Or, the pre-trained word2vec-google-news-300 model contains 3 million 300-d vectors for words and phrases. When you want to calculate a metric across these high-dimension vectors, the solution may easily suffer from computation power.

In this article, I want to share how I speeded up the spaCy library to be used in the word similarity API. The spaCy does not support an efficient most-similar method, contrary to Gensim. I recently published a word similarity API, named Owl. This API lets you extract the most similar words to target words using various word2vec models including spaCy. Given a word, this API returns a list of groups of words that are similar to the original word in predefined contexts such as News or General. …


A great tool to collect YouTube data and run interesting data science projects

Image for post
Image for post
Photo by Rachit Tank on Unsplash

YouTube is one of the main sources of education, entertainment, advertisement, and much more. YouTube has so much data that a data scientist can use to run interesting projects or build products. If you are a novice or an expert data scientist, you definitely heard about sentiment analysis, one of the main applications of natural language processing. Sentiment analysis is being used, for example, to monitor social media or customer reviews.

When you search online, you can easily find several datasets for sentiment analysis that collected Amazon product reviews or IMDB movie reviews. Although, there are not many API services that let you work with online data. …


If you recognize its advantages, you will adopt it today.

Image for post
Image for post
Photo by Michael Van Kerckhove on Unsplash

To build a machine learning or ML product, you must train a large number of models with different parameter configurations. These models are trained using a training dataset that evolves over time. Plus, the metrics evaluating the model performances can also be changed according to the situation. Nevertheless, a data scientist or machine learning engineer must manage this complex process and be able to always deploy the best model suitable for each scenario. A technology that helps us manage this process is essential.

In software development, git technology along with various workflow structures are used to track code changes and manage software versions. This technology expedites the development process and enables us to collaborate easily. However, we still do not have a well-adapted management system in machine learning, mostly due to its complexity. …


A list of lessons that I learned over the years to manage its risks

Image for post
Image for post
Photo by sebastiaan stam on Unsplash

If you are a data scientist or an AI executive, I am certain this sentence is familiar to you: “We need more data to build a better machine learning model”. This is a factual statement however it must be used with more caution. Before conducting a large scale data collection, you must answer several important questions such as “How confident are we that new data will improve the model performance?” or “How can we gain more confidence about the quality of newly recorded data?”. A large scale data collection is an expensive process so you must make a solid plan to start conducting it. If the data science team can not take advantage of newly recorded data in a reasonable time due to any reason, company dynamics may be affected. …


Along with some insights to design your own ensemble classifier

Image for post
Image for post
Photo by Zany Jadraque on Unsplash

I write this article to describe the significance of ensemble methods in the industry through my own story. In this article, I stay away from explaining classical methods since various resources exist. Although, I want to show you the strength of ensemble methods and the way that you can realize it. I aim this by sharing a story of an ensemble method that we invented years ago and is still alive under Facebook ownership. The ensemble methods either a classic one or one designed by you can work much better than your expectation. I promise.

First, let me tell you how the story started. I was the lead in the machine learning team of a start-up company working on a smart gadget. The gadget was aimed to identify hand gestures based on muscle signals. The gadget had eight sensors sitting on the forearm and recording muscle signals. We were in the early stage and the product was not ready. Our prototype surprised us many times by its functionality in data recording. Plus, we were not certain about possible gestures that should have been targeted. Nevertheless, we had to design a gesture recognition engine from the early days since we had demos here and there. …


An important data science skill that you cared less before.

Image for post
Image for post
Photo by Masha Kotliarenko on Unsplash

When you want to measure the similarity between two objects either documents or images, you always encounter important questions such as “What is the right way of defining similarity?”, “How can we measure the similarity?”, and “How should we analyze the similarity metric?”. I am fascinated with the similarity concept for a long time and that is why I want to share some insights that I gained through years with you. I can not describe all the detail in an article but I try to shed a light on this topic as much as I can.

— Anything less than being identical subjects to interpretation.

We as a human have a complex system to interpret similarity. The concept of similarity, regardless of how it must be measured, varies across contexts and problems. If two objects are completely similar, there is no room for interpretation. However, anything less than being identical subjects to interpretation. An object can be anything such as a time series, a document, or an image. …


A deeper look into the most common questions in clustering.

Image for post
Image for post
Photo by Vladimir Mokry on Unsplash

When you want to cluster a dataset with no labels, one of the most common questions that you encounter is “what is the right number of clusters?”. This question often raises when you work with, for example, the k-means algorithm that required you to fix the number of clusters. I encountered this question many times, and I am sure you did as well.

The problem can become more controversial when you work with high-dimensional data. The real-world data are often high-dimensional and you need to reduce its dimension to visualize and analyze it. The clustering results can be different in original space compared to the dimension-reduced space. You used the same algorithm but you see a discrepancy. …


Through 3 optimization methods: dynamic programming, simulated annealing, and 2-opt.

Image for post
Image for post
Photo by Caleb Jones on Unsplash

I am sure you already heard about the traveling salesman problem or TSP. There are many applications for this problem and also many solutions with different performances. Here, I want to share my recent experience to solve the traveling salesman problem especially with the 2-opt method that is one of the easiest, yet effective, methods to solve this problem.

If you just want to read about the 2-opt, you can jump directly to the end of this article. You can also use the Python package that I developed below to solve a TSP.

In this article, I want to share my experience in solving a TSP with 120 cities to visit. The problem had to be solved in less than 5 minutes to be used in practice. I aimed to solve this problem with the following…

About

Pedram Ataee, PhD

An AI architect who is on a mission to inspire the next generation of innovators. Author of “Artificial Intelligence: Unorthodox Lessons” at amzn.to/3jgoKBN

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store