A Statistical Interpretation Of Term Specificity And Its Application In Retrieval

Sparck Jones, K.

Hafidz Jazuli
Classic Information Retrieval
4 min readSep 10, 2017

--

Easy to understand and awesome explanation of problems and solutions by Mrs. Jones. Her explanations make me sure that we should treated each each term in postings list as statistic product. Despite by its semantic meaning, we could built term weight variances measured by its level of important in specific collection. For example, a term “farm-” should have weight 2.0 for agriculture collection and 1.0 for computer science collection.

Old Definition of Exhaustivity and Specificity

Exhaustivity represent coverage of document’s various topics defined by the assigned terms. In the case, if more term assigned into document caused its exhaustivity increased, but size of indexing vocabulary is constant, then the chance of document matched to the various query is increased. Intuitively, this problem could break our relevance rank calculation because some new terms that have been added may not have reasonable level of important. A solution is define optimum level of indexing¹ exhaustivity for a given document collection. The idea is the average numbers of descriptors per document should be adjusted and avoid false drops² to maximize retrieved relevance documents.

Specificity represent level of detail of a specific term or semantic property of index terms: a term is more or less specific as its meaning is more or less detailed precise. Intuitively we know that general term “Computational statistics” should has larger collection distribution then specific terms, such as: “Monte carlo”, “Markov chain”, or “Local regression”. But, to decide which best fitted term based on most assumption is not enough.

New Definition of Exhaustivity and Specificity

Our job would be complicated if we interpreted index of terms as semantic property because this effort needed very skilled resources and required high quality assumption based on experienced. Our job would be easier if we could built controlled framework that capable to do natural language analysis in any document category based on statistic. Some researchers have been did this very well, for examples: Salton (1968), Zunde and Slamecka (1967), and more.

So, we need to redefine exhaustivity and specificity:

Exhaustivity is how many significant terms contains in a document. Significant terms are any terms that have very high contribution to describe main content of a document specifically. For example, we should not count such as: terms like ‘a’ and ‘the’ clearly are not significant term.

Specificity is how good the discriminant value of a term. Intuitively, a good term should not belong as common terms in a document collection. For example: terms like ‘inform-’ and ‘comput-’ are meaningless in the computer science document collection because they are very common used and distributed broadly.

Problems

Statistic results of Cranfield, Inspec and Keen documents collections.

Table 1 show the statistical result of three documents collection. We could conclude that:

  • Query tend to be used frequent term combination. This fact proved by “number of documents per term” is likely less than “number of documents per request term”.
  • We need to provide alternative substitutions for given terms in the query through classification or by statistic analysis to find associated term. This solution is needed to solve problem that “Number of retrieving terms per document” is likely less than “Number of terms per request”.
  • We need to do controlled term identification by exploited the good features from both very frequent and non-frequent terms, while remove bad terms that produced from them. This solution based on recall/precision graph (figure 1) of frequent and non-frequence terms of Cranfield document collection.

Term weighting as General Solution

The idea of term weighting is not new, there are some great research conducted by Salton (1968) and Artandi Wolfe (1969). Both of them showed that automatically assigned term weight perform at least as well as manually assigned weights. Unfortunately, both of them not included the relative weight of the term across the entire collection. To calculate term weight related to its relative frequency, we need to follow Zipf law, then our term weighting calculation will be log_2(N) - log_2(n) + 1, where N is total number of documents and n is total documents that has given term.

Experiment Results

Figure 2 showed recall/precision graph of terms and weighted term in Cranfield document collection. From the figure 2, we could conclude that precision could be increased if we applied weighting scheme based on relative frequent of term to its collection.

___

¹ optimum level of indexing could be achieved by determine averaged value of any possible term combination that expected appeared in query.

² false drop related to relative level of important of certain terms. For example, a term ‘programming’ may be not important for ‘computer science’ documents, but may be important for ‘mathematic’ documents.

--

--