A Statistical Approach to Mechanized Encoding and Searching of Literary Information

Luhn, H. P

Hafidz Jazuli
Classic Information Retrieval
2 min readSep 2, 2017

--

Very brief introduction of implementing statistic on mechanical (computer) in the literature searching by Hans Peter Luhn. In this paper, he was not provide any mathematical formulation to describe his idea, but using a novel narration about any problems related to literature searching. The main reason why he was introduced mechanized encoding using statistical approach is the intelligent of experts could be wasted by did repetitive work such as encoding many documents into related topic of information. If the expert could translate their knowledge of specific field into thesaurus and stored it into programmed computer to do literature search, than the work could be more efficient and the expert would be more on the creative task such as configuring computer to do better job.

Although Luhn used statistical approach, he saw the representation of information into author’s point of view (figure 1). He assumed that the author’s idea could be represented as how many a word used in the document. Since different author has different language style, two document or more could be represented same idea described by different words. To resolve that, Luhn described a notion as element that emphasised documents into ‘same communicative idea’. Notions should be related to each other within the sentence. In the other word, we could identify similarity between document based on distribution of notions. To identify notion, an expert built thesaurus-type dictionary that identify a group of terms based on their relationship or meaning (semantic).

Figure 1. Communication of idea

Luhn suggested that a computer has internal memory to store thesaurus and already programmed to do linear searching task. The computer could be operated by non expert user which economically advantage. The scanning result will be stored in magnetic tape and saved desired data such as distribution of term (notion) in the collection. After scan was ended, the experts should analyzed the result, then command to repeat scanning after modify some configuration if needed, or judge the scanning result is good enough.

Sometimes, using only thesaurus not enough to identify similarity between any documents. Such of configuration needed to produce better result. That is, the intelligent of expert needed to judged a term is a good discrimination or not, if not, they should find another similar term which has higher discriminant. For example, in the electrical engineering field, a term (notion) ‘electricity’ could be used as common term in all document collection which worthless as discriminant or a term ‘cortex’ could be rare. The solution are by expanding general term ‘electricity’ into specific related term such as ‘power’, ‘generator’, ‘motor’, etc; and unify a very specific term ‘cortex’ into bigger semantic group such as ‘brain’, ‘neural’, etc.

--

--