Greenplum for Data Science Blog Series Part 2: Advanced Text Analytics & Text Search on Greenplum with GPText.

Published in

Greenplum Data Clinics

9 min readMar 20, 2023

This blog is a follow-up (Part 2) to the Greenplum for Data Science Blog Series Part 1: Big Data Analysis with SQL and Python post. This article covers how to use Greenplum Text Analytics functions to tackle data science projects from experimentation to massive deployment.

Context

Greenplum reduces data silos by providing a single, scale-out environment for converging advanced analytic and operational workloads.

In the previous blog, we have seen how to harness the power of Greenplum and combine its SQL & MADlib with Python for Data Exploration.

Today, we will discover Greenplum's Advanced Text Analytics capability to process and search mass quantities of textual data using: Greenplum Text (or GPText).

Greenplum Text joins the Greenplum Database massively parallel-processing database server with Apache SolrCloud enterprise search.
It includes powerful text search as well as support for text analysis.
Enables text analytics for multiple document sources: Greenplum Text can index documents stored in Greenplum Database tables or records retrieved from external stores, such as HTTP or FTP servers, Amazon S3 or other S3-compatible storage, or Hadoop HDFS.

Dataset Overview

In the entire series, a Daily Financial News for stocks dataset from Kaggle with Licence CC0: Public Domain has been inspired to use. We will use Greenplum and follow the data science cycle (Data Exploration — Data Preparation — Data Modeling — Model Evaluation — Model Deployment) to train an NLP model to process Sentimental Analysis for news.

Previously in Part 1, we cleaned and preprocessed data and stored them in the ds_demo.financial_news table:

%%sql
SELECT * FROM ds_demo.financial_news LIMIT 1;

So, let’s get started by showing you how to index data and perform simple and advanced (complex) searches with GPText.

GPText Indexing

GPText brings text analytics capabilities to the Greenplum parallel platform. It uses the industry standard Solr libraries for analysing and indexing text. All of the functionality is exposed through SQL and can be combined with all the other features of the Greenplum Database engine.

Create an empty Index

Firstly, we create a GPText index on our table using the create_index() database function:

%%sql
SELECT * FROM gptext.create_index('ds_demo', 'financial_news', 'docid', 'content');

2. Enable terms table

Then, to go beyond just searching our documents, we will enable and extract the term vectors from the content field:

%%sql
SELECT gptext.enable_terms('warehouse.ds_demo.financial_news', 'content');

3. Populate the Index

Then, we will populate our data index:

%%sql
SELECT * FROM gptext.index(
    TABLE(select * from ds_demo.financial_news), 
    'warehouse.ds_demo.financial_news'
);

4. Commit the Index

Finally, we will commit the transaction:

%%sql
SELECT * FROM gptext.commit_index('warehouse.ds_demo.financial_news');

With the GPText Index populated and committed, we can now search our documents inside the database using a fully featured search engine instead of just regular expressions.

Simple GPText search options

To process a search, you can run queries with the gptext.search() function, which uses the following syntax:

gptext.search(<src_table>, <index_name>, <search_query>, <filter_queries>[, <options>])

For example, try searching for the word hsbcin the dataset:

%%read_sql search_example

SELECT t1.docid
    , t1.content
    , t2.score -- the TF-IDF score
FROM 
  ds_demo.financial_news t1,
  (
    SELECT *
    FROM gptext.search(
            TABLE(SELECT * FROM ds_demo.financial_news)
            , 'warehouse.ds_demo.financial_news' -- source table
            , 'hsbc' -- index name
            , NULL -- filter queries, if none, set this parameter to NULL
        )
  ) t2
WHERE t1.docid::text = t2.id
ORDER BY score DESC
;

Or, exercise a far more powerful feature of GPText, and extract the term vectors from the text search engine into a database table:

%%sql 
CREATE TABLE ds_demo.stock_market_terms AS
  SELECT * FROM gptext.terms( 
    TABLE( SELECT 1 SCATTER BY 1 )
    , 'warehouse.ds_demo.financial_news'
    , 'content'
    , '*:*'
    , NULL 
  );

Advanced GPText search options

Use AND to search for more than one terms

In the following example, we search in theds_demo.financial_news index for news containing both search terms complaint and hsbc. The rows=5 argument is a Solr option that specifies the top 5 results to be returned from each segment. On a Greenplum Database cluster with 24 segments, this query returns up to 24 * 5 = 120 rows.

%%read_sql df_advanced_search
SELECT a.docid, a.date, a.title 
FROM 
  ds_demo.financial_news a
  , gptext.search(
      TABLE(SELECT 1 SCATTER BY 1)
      , 'warehouse.ds_demo.financial_news'
      , '"complaint" AND "hsbc"'
      , NULL
      , 'rows=5'
    ) q
WHERE q.id::text = a.docid::text
LIMIT 5;

2. Use OR to search for either of two (or more) terms

This search finds news in thends_demo.financial_news index that contains either the search term jp morganor the termhsbc:

%%read_sql df_advanced_search
SELECT a.docid, a.date, a.title 
FROM 
  ds_demo.financial_news a
  , gptext.search(
      TABLE(SELECT 1 SCATTER BY 1)
      , 'warehouse.ds_demo.financial_news'
      , '"jp morgan" OR "hsbc"'
      , NULL, 'rows=5'
    ) q
WHERE q.id::text = a.docid::text
;

3. Using AND & OR with parentheses

The following example searches the ds_demo.financial_news table for News that contains the text stock either outperform or bullish, and filters out the Apple (AAPL) stock:

%%sql
SELECT t.docid, t.date, q.score, t.content
FROM ds_demo.financial_news t,
     gptext.search(
       TABLE(SELECT * FROM ds_demo.financial_news)
       , 'warehouse.ds_demo.financial_news'
       , '(stock AND (outperform OR bullish))'
       , '{stock:AAPL}', 'rows=5'
      ) q
WHERE t.docid = q.id::int4 
ORDER BY score DESC;

4. Using multiple AND with parentheses

Search for news mentioning the Apple x Shazam acquisition:

%%sql
SELECT t.docid, t.date, q.score, t.title
FROM ds_demo.financial_news t,
     gptext.search(
       TABLE(SELECT * FROM ds_demo.financial_news)
       , 'warehouse.ds_demo.financial_news'
       , '((merger and acquisition) AND (apple) AND (Shazam))'
       , NULL, 'rows=5'
      ) q
WHERE t.docid = q.id::int4 
ORDER BY score DESC
LIMIT 5;

5. Faceted field search

The faceted field search function gptext.faceted_field_search() breaks query results into multiple categories, with a count of the number of documents in the index for each category.

Here is an example where we count the number of news per publisher which mention the AAPLstock more than 40 times:

%%sql
SELECT * FROM gptext.faceted_field_search(
    'warehouse.ds_demo.financial_news', '*:*', '{stock:AAPL}', '{publisher}', -1, 40
);

6. Highlight search terms

GPText highlighting inserts markup tags before and after each occurrence of the search terms in a query. For example, if the search term is AAPL, each occurrence of AAPLin the field is marked up as:

<em>AAPL</em>

In the example below, the gptext.highlight() function returns News with the text vmware highlighted in the content field, up to one row from each segment. On a Greenplum database cluster with 24 segments, this query returns up to 24 rows, aka 24 * 5 = 120 rows.

%%sql
SELECT t.docid, gptext.highlight(t.content, 'content', s.hs)
FROM ds_demo.financial_news t,
     gptext.search(
       TABLE(SELECT 1 SCATTER BY 1)
       , 'warehouse.ds_demo.financial_news'
       , '{!gptextqp}vmware', NULL
       , 'rows=1&hl=true&hl.fl=content' 
      ) s
WHERE t.docid = s.id::int8
LIMIT 3;

7. Search non-default fields (Include and exclude terms)

This example searches for news that has crypto in the content column but that does not have bitcoin in the title column:

%%sql
SELECT a.docid, a.date, a.title, q.score
FROM ds_demo.financial_news a,
      gptext.search(
        TABLE(SELECT 1 SCATTER BY 1)
        , 'warehouse.ds_demo.financial_news'
        , 'crypto and -title:bitcoin'
        , NULL, NULL
      ) q 
WHERE q.id::int8 = a.docid
ORDER BY score DESC
LIMIT 5;

8. Proximity Search

Proximity search queries find documents that have search terms within a specified distance. The distance is measured as the number of term moves needed to make the terms adjacent.

The following search query finds News with the terms hsbcand barclays within five terms of each other:

%%sql
SELECT t.docid, s.score, t.content
FROM ds_demo.financial_news t,
      gptext.search(
        TABLE(SELECT 1 SCATTER BY 1)
        , 'warehouse.ds_demo.financial_news'
        , '"hsbc barclays"~5'
        , NULL, NULL
      ) s
WHERE s.id::int8 = t.docid
ORDER BY score DESC
LIMIT 2;

9. Surround query parser

The Surround query parser allows ordered and unordered proximity searches. The W operator specifies an ordered search, and the N operator determines an unordered search. The maximum distance between the terms is specified by prefixing the W or N operator with an integer, for example, 2W.

The following search finds news with the terms market, followed by bearish or bullish within 2 terms:

%%sql
SELECT a.content, q.score
FROM ds_demo.financial_news a, 
      gptext.search(
        TABLE(SELECT 1 SCATTER BY 1)
        , 'warehouse.ds_demo.financial_news' 
        , '{!gptextqp} market* 2W (bearish OR bullish)'
        , NULL, NULL
      ) q
WHERE a.docid = q.id::int8
LIMIT 1;

Named Entity Recognition with GPText

Greenplum Text includes Apache OpenNLP components that allow you also to use Named Entity Recognition (NER). Named entities examples include the names of people, organisations, and locations.

Enabling NER for GPText

You can refer to GPText Documentation to learn how to add NER support to a Tanzu Greenplum Text index.

2. Create and commit the Index

After adding NER support to the GPText index, let’s index the table ds_demo.financial_news and commit the index.

%%sql
SELECT * FROM gptext.index(
  TABLE(SELECT * FROM ds_demo.financial_news)
  , 'warehouse.ds_demo.financial_news'
);
SELECT * FROM gptext.commit_index('warehouse.ds_demo.financial_news');

3. Query Search: NER person

This query retrieves an array of locations for NER person terms in the news that contain NER persons.

%%sql
SELECT * FROM gptext.search(
  TABLE(SELECT 1 SCATTER BY 1)
  , 'warehouse.ds_demo.financial_news'
  , '_ner_person', NULL
  , 'hl=true&hl.fl=content&rows=10&sort=score desc'
);

4. Query Search: NER organisation

This query retrieves the content of news in the ds_demo.financial_news table with terms tagged _ner_organization highlighted.

%%sql
SELECT news_demo.docid
       , gptext.highlight(news_demo.content, 'content', hs) AS content
       , s.score 
FROM ds_demo.financial_news news_demo
     , gptext.search(
          TABLE(SELECT 1 SCATTER BY 1)
          , 'warehouse.ds_demo.financial_news'
          , '{!gptextqp} _ner_organization', NULL
          , 'hl=true&hl.fl=content&rows=2&sort=score desc'
      ) s 
WHERE news_demo.docid = s.id::bigint 
ORDER BY s.score desc
LIMIT 1;

5. Query Search: NER person and time

This query retrieves news containing NER person and time terms (forward proximity search).

This query performs a proximity search to find news with a person term followed by a time term within the next seven terms.

%%sql
SELECT news_demo.docid
       , gptext.highlight(news_demo.content, 'content', hs) AS content
       , s.score 
FROM ds_demo.financial_news news_demo
     , gptext.search(
        TABLE(SELECT 1 SCATTER BY 1)
        , 'warehouse.ds_demo.financial_news'
        , '{!gptextqp} (_ner_person 7W _ner_time)', NULL
        , 'hl=true&hl.fl=content&rows=10&sort=score desc'
      ) s 
WHERE news_demo.docid = s.id::bigint 
ORDER BY s.score desc
LIMIT 1;

6. Query Search: Specified & any NER organisation

This query retrieves news with a specified NER organisation (AAPL) and any NER organisation (unordered proximity search).

Like the previous example, this query performs a proximity search, but the terms can appear in the document in either order and must be within ten terms of each other.

%%sql
SELECT news_demo.docid
       , gptext.highlight(news_demo.content, 'content', hs) AS content
       , s.score 
FROM ds_demo.financial_news news_demo
     , gptext.search(
          TABLE(SELECT 1 SCATTER BY 1)
          , 'warehouse.ds_demo.financial_news'
          , '{!gptextqp} (_ner_organization_Apple 10N _ner_organization)', NULL 
          , 'hl=true&hl.fl=content&rows=10&sort=score desc'
      ) s 
WHERE news_demo.docid = s.id::bigint 
ORDER BY s.score desc
LIMIT 1;

Conclusion

To conclude, Greenplum Text includes powerful functions for text search, text analysis and Named Entity Recognition (NER) analysis. All the functions are exposed through SQL and can be combined with all the other features of the Greenplum Database engine.

Note: GPText also supports JSON index and will support rich documents (PDF and Word) in the next release.

In a later blog, we will discover how The Greenplum PL/Python extension can run machine learning and deep learning workloads in a Greenplum Database.

Thanks for reading! Any comments or suggestions are welcome! Check out other Greenplum articles here.