Term Frequency Statistics in PostgreSQL Full-Text Searches
By Digoal.
Term Frequency (abbreviated TF)-how frequently an expression occurs in a document or other body of text-and Inverse Document Frequency (IDF)-a measure for determining whether a term is common or rare in a given document or corpus-are common terms in the text analysis and text mining fields. The two of these are often used in cooperation with each other.
Previously, I wrote about how the capabilities of PostgreSQL when combined with the cosine and other linear correlation algorithms frequently used in text, image, and array similarity fields can serve as a basis of text and keyword analysis, and more specifically term frequency and inverse document frequency.
PostgreSQL supports the full-text search, and the text vector type, tsvector.
Now you may be asking how do you find terms or analyze term frequency in a pile of texts using PostgreSQL? Well, in this article, we will look at how you can use ts_stat
and MADlib
for term frequency and Inverse document frequency analysis.
Method 1: Using ts_stat
The first method that we will cover in this blog comes from the built-in function of PostgreSQL, ts_stat, which is used to generate lexeme statistics. For example, in a Q&A knowledge base, you could look into which particular term appears most and in how many texts does it appear?
The function ts_stat
is described in detail in these following documents:
Now that we’re all on the same page, let’s move on to gathering some document statistics. The function ts_stat
is useful for checking your configuration and for finding stop-word candidates, which are words that extremely frequent in a text or just in general-think 'the', 'of' and 'one', for example.
ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
In the above code, sqlquery
is a text value containing an SQL query that must return a single tsvector
column. ts_stat
executes the query and returns statistics about each distinct lexeme (or word) contained in the tsvector
data. The columns returned are:
word text
: the value of a lexeme (that is, a term or word).ndoc integer
: the number of documents (tsvectors
) the word occurred in.nentry integer
: the total number of occurrences of the word.
If weights are supplied, only occurrences having one of those weights are counted. For example, to find the ten most frequent words (excluding your stop-word candidates) in a document collection, you can do this method:
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
This next one is the same, but with counting only word occurrences with a weight A or B:
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
Testing This Method
To test out this method, you’ll want to first create a function to generate random strings;
create or replace function gen_rand_str(int) returns text as
$$
select substring(md5(random()::text), 4, $1); $$
language sql strict stable;
Then, you’ll want to create a function to generate several random terms.
create or replace function gen_rand_tsvector(int,int) returns tsvector as
$$
select array_to_tsvector(array_agg(gen_rand_str($1))) from generate_series(1,$2); $$
language sql strict;
postgres=# select gen_rand_tsvector(4,10);
gen_rand_tsvector
-----------------------------------------------------------------------
'21eb' '2c9c' '4406' '5d9c' '9ac4' 'a27b' 'ab13' 'ba77' 'e3f2' 'f198'
(1 row)
Next, create a test table and write test data.
postgres=# create table ts_test(id int, info tsvector);
CREATE TABLE
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10);
INSERT 0 100000
Following this, check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.
postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
e4e6 | 39 | 39
9596 | 36 | 36
a84c | 35 | 35
2b44 | 32 | 32
5146 | 32 | 32
92f6 | 32 | 32
cd56 | 32 | 32
fd00 | 32 | 32
4258 | 31 | 31
5f18 | 31 | 31
(10 rows)
Next, write another batch of test data, and check the term frequency to see how many times the term has appeared and in how many texts (records) it has appeared.
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10);
INSERT 0 100000
postgres=# SELECT * FROM ts_stat('SELECT info FROM ts_test')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
30 | 4020 | 4020
a7 | 4005 | 4005
20 | 3985 | 3985
c5 | 3980 | 3980
e6 | 3970 | 3970
f1 | 3965 | 3965
70 | 3948 | 3948
5e | 3943 | 3943
e4 | 3937 | 3937
2b | 3934 | 3934
(10 rows)
Method 2: Using MADlib
Next, we will look how you can use MADlib. MADlib, in reality, also provides training functions for term frequency statistics, which you can find here. You’ll need to understand these functions to be able to use this method.
Next, when it comes to Term frequency, tf(t,d)
is to the raw frequency of a word/term in a document, such as the number of times that word or term t
occurs in document d
. Note that for this function, 'word' and 'term' are used interchangeably-as they can both be understood as lexemes. Also, note that the term frequency is not normalized by the document length.
This is the main part of what we need tro use MADlib.
term_frequency(input_table,
doc_id_col,
word_col,
output_table,
compute_vocab)
To understand the code above, let’s look at the arguments in detail.
input_table
: The name of the table storing the documents. Each row is in the form<doc_id, word_vector>
wheredoc_id
is anid
, unique to each document, andword_vector
is a text array containing the words in the document. Theword_vector
should contain multiple entries of a word if the document contains multiple occurrence of that word.id_col
: The name of the column containing the document id.word_col
: The name of the column containing the vector of words or terms (lexeme) in the document. This column should of type that can be cast to.output_table
: The name of the table to store the term frequency output. The output table contains the following columns:id_col
: This the document id column (same as the one provided as input).word
: A word or term (that is, a lexeme) present in a document. This is either the original word present inword_col
or an id representing the word (depending on the value ofcompute_vocab
below).count
: The number of times this word is found in the document.compute_vocab
: This argument is Optional, and by default it is set toFALSE
. Flag this argument to indicate if a term is to be created. IfTRUE
, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is calledoutput_table_vocabulary
(suffix added to theoutput_table
name) and contains the following columns:wordid
: An id assignment for each wordword
: The word/term
References
- http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
- https://www.postgresql.org/docs/devel/static/functions-textsearch.html
- https://www.postgresql.org/docs/devel/static/textsearch.html
- https://www.postgresql.org/docs/devel/static/textsearch-features.html
- http://madlib.incubator.apache.org/