Summarizing Text using In-database NLP through the Integration of Hugging Face with MindsDB

Zubeen
6 min readJan 19, 2023

The last decade has been a witness to exponential growth in the availability of data in forms and types that no one could have imagined. A large part of this data is textual in nature and with NLP gaining popularity, processing this textual data the right way could find us applications and usages that were inconceivable earlier. Text Summarization is one such application that can process information and knowledge in a way so that it can be used in an effective manner.

So then how do you Summarize data?

To accomplish this task, you can utilize MindsDB’s NLP engine powered by Hugging Face, a leading open-sourced NLP library that provides a wide range of pre-trained NLP models that can be used to extract insights from textual data. But before getting into the nitty-gritty of Summarization let's first have a look at MindsDB, the tool that you will be using to achieve this.

MindsDB is an open-sourced tool that allows users to build, train, and deploy Machine Learning models for making predictions based on data. It integrates Artificial Intelligence and Machine Learning into databases to help teams working with everyday data, identify patterns, forecast trends, and train models. It is simple to use and easy to integrate into a wide range of applications. Accessing MindsDB is simple and there are two ways of doing it:

Pre-requisites

For the purpose of this tutorial, you can use the MindsDB Cloud. To do so, you first need to create an account on MindsDB Cloud by following the steps in the tutorial below.

Condensing text documents into summaries that contain the key pieces of information saves not only a user's time but also increases efficiency. Generating summaries of news articles, research papers, legal documents, and other long-form documents makes them more easily digestible.

So for the next step, upload your dataset to the MindsDB Cloud environment. You can use the CNN-DailyMail Dataset which consists of News Articles for which you can generate summaries. You can access this publicly available dataset from here.

Connecting your Data

To upload the dataset you can use the following steps:

  • Log in to your MindsDB Cloud account
  • Navigate to Add data section by clicking the Add data button located in the top right corner.
  • Choose the Files tab.
  • Click on the Import File option.
  • Upload the file (eg: news_summaries.csv), name a table used to store the file data, and click the Save and Continue button.
Uploading your Dataset to MindsDB Cloud
Uploading your Dataset to MindsDB Cloud

Once the dataset has been uploaded, the editor will provide you with some sample queries to verify your upload. Running the following query fetches the top 10 records of your newly uploaded newsummaries file.

SELECT * FROM files.newsummaries LIMIT 10;

The output would look something like this:

The dataset consists of two columns, text and ctext.

The next step would be to create a model. Currently, the MindsDB’s NLP engine is powered by Hugging Face and for text summarization, you can employ either of the two models:

To use the Google Pegasus model, you can use the following query:

CREATE MODEL mindsdb.hf_peg_sum_20
PREDICT SUMMARY
USING
engine = 'huggingface',
task = 'summarization',
model_name = 'google/pegasus-xsum',
input_column = 'ctext',
min_output_length = 10,
max_output_length = 20;

In the above query, specify the model name as hf_peg_sum_20, and the input column as ctext, which contains the complete text of the news articles. This is the column over which your model will iterate to generate the summaries. The min and max output lengths specify the word limits of the summaries that will be generated.

Upon successful execution of the above query, you can verify the model created, using:

SELECT *
FROM mindsdb.models
WHERE name = 'hf_peg_sum_20';

Running this query would give you details about your model and the output should look something like this:

The status of the model would initially be generating, and it would take a couple of minutes to complete processing. You can re-run the query to check the status again. Once done, it should change to complete.

To validate your model you can run a query with a sample text to see whether the summaries are being generated in the appropriate column or not:

SELECT *
FROM mindsdb.hf_peg_sum_20
WHERE ctext = 'In a shocking turn of events, it has been revealed that the
CEO of a major tech company was involved in a widespread fraud scheme. The
CEO, who has not yet been named, is accused of embezzling millions of
dollars from the company and using the money to fund lavish vacations
and luxury purchases. The companys stock price has plummeted in the wake
of the revelation, and the CEO has been fired. Many are shocked by the news,
as the company had always been seen as a leader in the industry and the
CEO was highly respected. The investigation is ongoing, and it remains to
be seen what the full extent of the fraud will be.';

Once the above query is executed, you will see a new column named SUMMARY in the output, along with the complete text of the news article as shown below.

The next step is to use this model to iterate over your dataset file newsummaries and generate summaries for the news articles. For this you can use the following query to test it over a single record:

SELECT * from 
(SELECT * FROM files.newsummaries limit 1)
as input
JOIN mindsdb.hf_peg_sum as model

The generated summary of the first record of your dataset will appear under the column named SUMMARY. The output should look something like this:

Summary generated for the news article in the ctext field

Conclusion

In the above example, you saw how easy it is to generate summaries out of complete text fields with the help of some simple SQL queries. You can extrapolate this to generate summaries over the entire dataset. You can also implement these steps for data stored in any of your Databases by integrating MindsDB into it, to fulfill your predictive needs. MindsDB’s open-sourced technology simplifies the process of using Machine Learning for those who may not have expertise in the field.

MindsDB is aggressively expanding it’s portfolio to support more and more models and if you wish to explore some other NLP Use-Cases that might include tasks like performing Sentiment Analysis, Zero-Shot Classification, or Translation of Textual Data, feel free to explore more at MindsDB.com Sign up for a free cloud account and get started in under 5 minutes! If you are stuck and need help, feel free to reach out on Slack or through the GitHub community.

--

--

Zubeen

Unleashing the Tech Enthusiast within: SWE diving into the realms of Golang, Flutter, Cryptography, and NLP! 🚀🔒📊📝