Natural Language Processing(Part 26)-Word by Word and Word by Doc

Coursesteach
5 min readJan 14, 2024

--

📚Chapter 3: Vector Space Model

Introduction

In this tutorial, you’ll learn how you can construct your vectors based off a co-occurrence matrix. Specifically, depending on the task you are trying to solve, you can have several possible designs. You will also see how you can encode a word or in documents as a vector.

Let me show you how you can do this. To get a vector space model using a word-by-word design, you will make a co-occurrence matrix and extract vector or presentations for the words in your corpus. You’ll be able to get a vector space model using a word by document design using a similar approach. Finally, I’ll show you how in a vector space you can find relationships between words and vectors, also known as their similarity.

Sections

Word by word Desing
Word by document Desing
Vector spaces

Sectio 1- Word by word Desing

The co-occurrence of two different words is the number of times that they appear in your corpus together within a certain word distance k. For instance, suppose that your corpus has the following two sentences. The role of the co-occurrence matrix corresponding to the word data with a k value equal to 2, would be populated with the following values. For the column corresponding to the word simple, you’d get a value equal to 2, because data and simple co-occur in the first sentence within a distance of one word and in the second sentence within a distance of two words, the row of the co-occurrence matrix corresponding to the word data would look like this if you consider the co-occurrence with the words simple, raw, like and I. In this case, the vector representation of the word data would be equal to 2, 1, 1, 0. With a word-by-word design, you can get a representation with n entries, with n between one and the size of your entire vocabulary.

Section 2- Word by document Desing

For a word by document design, the process is quite similar. In this case, you will count the times that words from your vocabulary appear in documents that belong to specific categories. For instance, you could have a corpus consisting of documents between different topics like entertainment, economy, and machine learning. Here, you’d have to count the number of times that your words appear on the documents that belong to each of the three categories. In this example, suppose that the word data, appears 500 times in documents from your corpus related to entertainment, 6,620 times in economy documents, and 9,320 in documents related to machine learning. The word film appears in each document’s category, 7,000, 4,000, and 1,000 times respectively. Can you get a sense of wherethis is going already?

Section 3: Vector spaces

Once you’ve constructed the representations for multiple sets of documents or words, you’ll guess your vector space. Let’s take the matrix from the last example. Here you could take a representation for the words, data and film from the rows of the table. However, I’ll take the representation for every category of documents by looking at the columns. The vector space will have two-dimensions. The number of times that the words, data and film appear on the type of documents. For the entertainment corpus, you’d have the following vector representations This one for theeconomy category, and that’s for the machine learning category.Note that in this space it is easy to see that the economy and machine learning documents are much more similar than they are to the entertainment category. Coming up soon, you’ll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them. So far you’ve seen how to get vector spaces by two different designs; word by word and word by document; by either counting the co-occurrence of words or the co-occurrence of words in the documents corpora. I’ve also showed you that in vector spaces, you can determine relationships between types of documents like similarity. Now you’re becoming more and more familiar with these vector spaces. You’ve seen several possible designs that you can use to solve a specific task. You’ve also seen how you can encode words or tweets as vectors. In the next video, you’ll learn about a new similarity metric that will allow you to compare these two vectors. The similarity metric is known as the Euclidean distance.

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

If you want to learn more about these topics: Python, Machine Learning Data Science, Statistic For Machine learning, Linear Algebra for Machine learning Computer Vision and Research

Then Login and Enroll in Coursesteach to get fantastic content in the data field.

Stay tuned for our upcoming articles where we will explore specific topics related to NLP in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and sharing with others!💻✌️

Note:if you are a NLP export and have some good suggestions to improve this blog to share, you write comments and contribute.

if you need more update about NLP and want to contribute then following and enroll in following

👉Course: Natural Language Processing (NLP)

👉📚GitHub Repository

👉 📝Notebook

Do you want to get into data science and AI and need help figuring out how? I can offer you research supervision and long-term career mentoring.
Skype: themushtaq48, email:mushtaqmsit@gmail.com

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter

Source

1- Natural Language Processing with Classification and Vector Spaces

--

--