How to chunk text into paragraphs using python
In this article, I want to show the approach that we are going to use in our project of podcast summarization. To summarize text correctly we first need to split a text into meaningful parts—paragraphs.
· General approach
· Step 1: Embedding
· Step 2: Dot product/cosine similarity
· Step 3: Identifying split points
· Algorithm at work
· Step 4: Create a paragraphed text
· Final result
General Approach
We need to turn text into something that machines can understand — vectors. In Natural Language Processing vector representation of a text is called — Embedding. There are two ways to create it:
- Create our embedding which is based on texts which will be similar to the ones that we expect our function to accept (I will cover it in the next article);
- Use pre-trained embedding.
The second option is faster and can already give us reasonable results. There number of pre-trained embeddings. In our case, we are going to choose one of these.
Right now we will simply go for the best overall performance embedding — “all-mpnet-base-v2”.
First things first — we load all the necessary packages and then start our process.
Step 1: Embedding
Now when we have a pre-trained model it is a pretty straightforward process. Let’s say we have some random text:
Let me tell you a little story. When I was a little kid I really liked to play football. I wanted to be like Messi and play at Camp Nou. However, I was really bad at it and now I’m not training at Camp Nou. I’m writing a medium article on chunking text.
Sad story, but it is not the point of this article. We want to turn this text into a vector representation:
Magic happened, we just turned our 4 sentences into a 768-dimensional world! How is this useful? Well, now sentences are vectors and we can check how close (i.e. similar) those vectors are in the 768-dimensions and there is a very simple way to do that — dot product.
Step 2: Dot product/cosine similarity
In simple words, a dot product will show how much one vector goes in the direction of another. If two vectors (sentences) point in the same direction we assume that they are similar. But let’s check this in practice.
This is an impressive result, giving we only used a few lines of code. We can see that the 5th sentence is going in a separate direction that the 4th one (-0.07). We successfully distinguished the meaning of the sentence about embeddings from sentences about football.
But, of course, there is a much better way to see sentence similarities all at once — create a similarity matrix. Sklearn has a handy function for computing similarities with the cosine_similarity function. Why not use the dot product? Good question. Well, when vectors have the same length (magnitude) there is no difference between dot product and cosine similarity. I only showed the dot product to explain how it works under the hood.
There is an interesting pattern we can spot there. The red square in the middle is a part where I talk about football. Now how would it look like if we changed topics two times? Let’s build our text up and plot results.
Let me tell you a little story. When I was a little kid I really liked to play football. I wanted to be like Messi and play at Camp Nou. However, I was really bad at it and now I’m not training at Camp Nou. I’m writing a medium article on embeddings. In this article, I want to show how are we going to split a text into parts. We first embed sentences. Then we compute sentence similarities. After that, we detect the split point in the text. After finishing this process we will go play chess with friends.
You probably already starting to get the pattern. We can see two different topics and their split points.
Step 3: Identifying split points
Now when something is easy to see for humans but not necessarily easy for computers. So we need to create some pattern to help it distinguish those change points.
- Take each diagonal to the right of the main diagonal, which will be similarity of some sentences to the next sentences;
- Each sentence has a different number of sentences in front, so we need to pad each diagonal by zeros at the end so that they are the same length;
- Stack those diagonals into the new matrix, so that we can apply activation;
- Apply activation weights to each row, so that the closest sentences have the biggest weight determining similarity. In that case, I use reversed sigmoid-activation with zero’s padded tail (in the code it will be more clear).
- Calculate the weighted sum of each row to create a vector representation of each sentence’s similarity to the closest ones in the text.
This is a much easier-to-understand representation of the flow of our text. Once again we can see that the 4th sentence with index 3 is our splitting point. Now we do the final part
6. Find relative minima of our vector.
“Graphically, relative extrema are the peaks and valleys of the graph of a function, peaks being the points of relative maxima and valleys being the points of relative minima. The combination of relative maxima and minima is called the relative extrema.” More on the link.
Here is the code for completing all of the steps:
Algorithm at work
Now, let’s change from small text to something that we are going to do in reality — chunking transcripts of long videos and podcasts.
Working with long texts
There was one thing I did not mention yet, but what is important — when you work with long texts you will have the problem that very short sentences create unexpected changing points. The shorter sentence is the lower similarity is possible. Generally speaking the shorter the text is — the less information it contains -> fewer possible similarities can be found.
Now there are lots of smart ways to deal with this problem but for the sake of demonstration we will use the most simple solution — we will shorten very long sentences and reduce very short ones.
Now we follow our steps.
- Embed sentences and calculate cosine similarity;
2. Identify splitting points;
Let’s zoom in on some parts so that we can really see what is happening.
Step 4: Create a paragraphed text
When we have out splitting points we are left with the easiest but most important part — implementing them into text.
Vuala — We have paragraphed text of 1 thousand sentences.
Final result
Let’s look at some of the splits we made to check if it makes sense. I’m not sure I can publish the whole text because of the rights so I took a few small parts.
In the year 1625, an Italian nobleman named Pietro de la Valet went on a tour of the Middle East .... At this time, travel in this region couldn't have been more dangerous. The Ottoman and Persian empires were at war, fighting over who would rule in Baghdad .... good bricks, most of which were stamped with certain unknown letters which appeared very ancient.-------------------------------------------------------------------
--------------------------------------------------------------------They departed in the dead of night and fled to safety. Across the desert .... What did the symbols on those broken pieces of clay mean. And if such a great city had once stood there, what in all the world could have happened to it.
My name is Paul Cooper, and you're listening to the Fall of Civilization's podcast. Each episode I look at a civilization of the past that rose to glory and then collapsed into the ashes of history. I want to ask what did they have in common. What led to their fall.
-------------------------------------------------------------------
-------------------------------------------------------------------
Each paragraph is separated with a new line, each new place in text is separated with “ — — “ and the content of paragraphs is shortened by “…”.
We can see that the first two paragraphs are nicely separated even though they follow the same thought and the second two paragraphs are precisely separated when the author starts to introduce himself. So overall I would say it's done a pretty good job.
Even though it's not always perfect, sometimes it misses a splitting point by one or two sentences like here:
Then I awoke, like a man drained of blood, who wanders alone in a waste. Thank you once again for listening to the Fall of Civilization's podcast. I'd like to thank my voice actors for this episode. Re Brignell, jake Barrett Mills, shem Jacobs, Nick Bradley and Emily Johnson. I love to hear your thoughts and responses on Twitter, so please come and tell me what you thought. You can follow me at Paul M.
This podcast can only keep going with the support of our generous subscribers on Patreon. You keep me running, you help me cover my costs, and you help keep the podcast ad free. ... If you enjoyed this podcast, please consider heading on to Patreon.com.
Here in the first paragraph, we can see that the first sentence got there by an error, however, the next two paragraphs are very well separated as the second one is gratitudes to the voice actors and the third one is gratitude to Patreon supporters.
What is next
- There are more sophisticated ways to do mentioned steps, which I will cover in coming articles;
- Now when we have paragraphs ready we can turn them into “Chapters”. This will be the next topic of my article, so please don’t forget to follow! =);
- In the end, we will finish the summarization of the Samarians video and publish the results with timestamps on Youtube.
Thank you all for reading! Please follow me on Medium and Linkedin, feel free to ask any questions. Star our podcast summarization project on GitHub if you liked the solution =).
This is the full code of the ready solution in the Jupyter notebook.