Several More Good Information Retrieval Papers Published Before 2002

A recent committee of senior SIGIR members met to choose the best papers from SIGIR before the first Test of Time Award (pre-2002). A similar initiative was conducted with Readings in Information Retrieval. That collection is now out of print but the papers in there are worth reading. Finally, the SWIRL attendees assembled a reading list for information retrieval students.

I want to take the opportunity to point out papers that would make good company in this crowd. A few caveats,

  1. Papers are drawn from those in my citation database, started when I began graduate school. I’m no doubt missing references in areas I have yet to publish extensively in (e.g. efficiency, user studies).
  2. I’m focusing on information retrieval papers, as opposed to those from other communities.
  3. Some papers are what I consider to be the first publication using a technique. Some papers are what I consider fundamental developments for a technique. Some papers are just neat.
  4. Unlike the pre-2002 committee, I’m including non-SIGIR material.
  5. I have been able to find digital copies of all of these papers except for the first Fairthorne paper, which can be found in his collection “Towards Information Retrieval” (Butterworths, 1961).

Here we go,

1948

  • Text classification and retrieval.
    R. A. Fairthorne. The mathematics of classification. In The Proceedings of the British Society for International Bibliography, volume 9, 35–42, 1948.

1956

  • Boolean representation of queries. 
    R. A. Fairthorne. The patterns of retrieval. American Documentation, 7(2):65–70, 1956.

1958

  • Information filtering.
    H. P. Luhn. A business intelligence system. IBM J. Res. Dev., 2(4):314–319, October 1958.
  • Lattice representation of boolean queries. 
    Calvin N. Mooers. A mathematical theory of language symbols in retrieval. In Proceedings of the international conference on scientific information, 1327–1364, 1959.

1962

  • Cosine similarity. 
    Gerard Salton. Some experiments in the generation of word and document associations. In Proceedings of the December 4–6, 1962, Fall Joint Computer Conference, AFIPS ’62 (Fall), 234–250, 1962.

1963

  • Query expansion based on a term-similarity matrix.
    Vincent E. Giuliano and Paul E. Jones. Linear associative information retrieval. Technical report CACL-2, Arthur D. Little Inc., 35 Acorn Park, Cambridge, Massachusetts, November 1963.

1965

  • Rocchio algorithm.
    J. J. Rocchio. Relevance feedback in information retrieval, Scientific Report23. Number 9 in . The National Science Foundation, August 1965.

1967

  • Interactive information-seeking behavior in libraries.
    Robert S. Taylor. Question-negotiation and information seeking in libraries. Studies in the man-system interface in libraries 3, Center for Information Science, Lehigh University, July 1967.

1968

  • Term-similarity.
    Michael E. Lesk. Word-word associations in document retrieval systems. In Gerard Salton, editors, Information Storage and Retrieval, number IRS-13, Chapter IX. Cornell University, Ithaca, NY, 1968.
  • Expected search length.
    William S. Cooper. Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems, American Documentation, 19(1): 30–41, 1968.

1971

  • The cluster hypothesis.
    N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7:217–240, 1971.

1972

  • Inverse document frequency.
    K. Spärck-Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972.

1973

  • Bayesian information retrieval.
    Jean Tague. A Bayesian approach to interactive retrieval. Information Storage and Retrieval, 9(3):129–142, 1973.
  • Simulation-based information retrieval evaluation.
    Michael D. Cooper. A simulation model of an information retrieval system. Information Storage and Retrieval, 9(1):13–32, 1973.
  • Clustering in relevant and nonrelevant documents.
    C. J. van Rijsbergen and Karen Spärck-Jones. A test for the separation of relevant and non-relevant documents in experimental retrieval collections. Journal of Documentation, 29(3):251–257, September 1973.

1976

  • Theoretical analysis of inverse document frequency.
    S. E. Robertson and K. Spärck-Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146, 1976.

1977

  • Pseudo-relevance feedback.
    R. Attar and A. S. Fraenkel. Local feedback in full-text retrieval systems. J. ACM, 24(3):397–417, July 1977.

1980

  • Formal specification of simulation-based evaluation.
    Jean Tague, Michael Nelson, and Harry Wu. Problems in the simulation of bibliographic retrieval systems. In SIGIR ’80: Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, 236–255, 1980.

1981

  • Spreading activation for information retrieval
    Scott Preece. A spreading activation network model for information retrieval. PhD thesis, University of Illinois, Urbana-Champaign, 1981.

1983

  • Session-based information retrieval as sequential decision-making.
    A. Bookstein. Information retrieval: a sequential learning problem. Journal of the American Society of Information Science, 1983.

1985

  • Query-time clustering.
    Peter Willett. Query-specific automatic document classification. In International Forum on Information and Documentation, Volume 10, 28–32, 1985.

1986

  • Nearest-neighbor cluster-based retrieval.
    A. El-Hamdouchi and P. Willett. Hierarchic document classification using ward’s clustering method. In SIGIR ’86: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 149–156, 1986.
    Alan Griffiths, H. Clair Luckhurst, and Peter Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3–11, 1986.

1988

  • Query suggestion.
    D. Harman. Towards interactive query expansion. In Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 321–331, 1988.
  • Large-scale Bayesian user feedback.
    Peter J. Lenk and Barry D. Floyd. Dynamically updating relevance judgements in probabilistic information systems via users’ feedback. Management Science, 34(12):1450–1459, December 1988.

1989

  • Information retrieval with neural networks.
    R. K. Belew. Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents. In SIGIR ’89: Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 11–20, 1989.
    K. L. Kwok. A neural network for probabilistic information retrieval. In SIGIR ’89: Proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 21–30, 1989.

1990

  • Meta-search.
    P. Thompson. A combination of expert opinion approach to probabilistic information retrieval, part 1: the conceptual model. Inf. Process. Manage., 26(3):371–382, 1990.
    P. Thompson. A combination of expert opinion approach to probabilistic information retrieval, part 2: mathematical treatment of ceo model 3. Inf. Process. Manage., 26(3):383–394, 1990.
  • Cross-lingual retrieval.
    Thomas K. Landauer and Michael L. Littman. Fully automatic cross-language document retrieval using latent semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, 31–38, 1990.

1991

  • Signal detection analysis of the probability ranking principle.
    Michael D. Gordon and Peter J. Lenk. A utility theoretic examination of the probability ranking principle in information retrieval. Journal of the American Society for Information Science, 42(10):703–714, December 1991.

1992

  • Iterated relevance feedback.
    IJsbrand Jan Aalbersberg. Incremental relevance feedback. In SIGIR ’92: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 11–22, 1992.
  • Demonstrated suboptimality of the probability ranking principle.
    Michael D. Gordon and Peter J. Lenk. When is the probability ranking principle suboptimal?. Journal of the American Society for Information Science, 43(1):1–14, January 1992.

1993

  • Krovetz stemming.
    Robert Krovetz. Viewing morphology as an inference process. In SIGIR ’93: proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, 191–202, 1993.
  • Combining multiple boolean queries from expert searchers.
    Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan. The effect multiple query representations on information retrieval system performance. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, 339–346, 1993.

1994

  • Local latent semantic indexing.
    David Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, 282–291, 1994.
  • Upper bound on performance of Boolean queries.
    Robert M. Losee Jr.. Upper bounds for retrieval performance and their use measuring performance and generating optimal boolean queries: can it get any better than this?. Information Processing & Management, 30(2):193–203, 1994.
  • Corpus-driven thesaurus.
    Yufeng Jing and W. Bruce Croft. An association thesaurus for information retrieval. UMass Tech Report IR-47, 1994.
  • Passage-based pseudo-relevance feedback.
    James P. Callan. Passage-level evidence in document retrieval. In SIGIR ’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, 302–310, 1994.

1995

  • Distributed information retrieval.
    James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In SIGIR ’95: proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, 1995.
  • Building hypertext on the fly.
    James Allan. Automatic hypertext construction. PhD thesis, Cornell University, 1995.
  • A correction on active learning for text classification.
    David D. Lewis. A sequential algorithm for training text classifiers: corrigendum and additional data. In SIGIR forum, 1995.

1996

  • Language modeling for information retrieval.
    Thomas Kalt. A new probabilistic model of text classification and retrieval. CIIR Tech Report 78, University of Massachusetts Amherst, 1996.

1997

  • Non-relevance feedback.
    Mark D. Dunlop. The effect of accessing nonmatching documents on relevance feedback. ACM Trans. Inf. Syst., 15(2):137–153, 1997.

1998

  • Improving pseudo-relevance feedback with strategic initial queries.
    Mandar Mitra, Amit Singhal, and Chris Buckley. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 206–214, 1998.
  • Novelty.
    Jaime Carbonell and Jade Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, 1998.
  • Web query log analysis.
    Bernard J. Jansen, Amanda Spink, Judy Bateman, and Tefko Saracevic. Real life information retrieval: a study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998.
    Craig Silverstein, Monika Henzinger, Johannes Marais, and Michael Moricz. Analysis of a very large AltaVista query log. Technical report SRC-TN-1998–014, HP Labs Technical Report, 1998.

1999

  • Information foraging.
    Peter Pirolli and Stuart Card. Information foraging. Psychological Review, 106(4):643–675, October 1999.
  • Information retrieval as machine translation.
    Adam Berger and John Lafferty. Information retrieval as statistical translation. In SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, 222–229, 1999.
  • Local match kernels.
    Owen de Kretser and Alistair Moffat. Effective document presentation with a locality-based similarity heuristic. In SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, 113–120, 1999.
  • Document expansion.
    Amit Singhal and Fernando Pereira. Document expansion for speech retrieval. In SIGIR’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, 34–41, 1999.
  • Economics of information retrieval.
    Hal R. Varian. The economics of search. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, 1999.

2000

  • Web query similarity.
    Doug Beeferman and Adam Berger. Agglomerative clustering of a search engine query log. In KDD ’00: proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, 407–416, 2000.

2001

  • Information theoretic analysis of inverse document frequency. 
    Kishore Papineni. Why inverse document frequency?. In NAACL, 2001.
  • Temporal summarization.
    James Allan, Rahul Gupta, and Vikas Khandelwal. Temporal summaries of new topics. In SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 10–18, 2001.
  • Query-based sampling.
    Jamie Callan and Margaret Connell. Query-based sampling of text databases. In Transactions on information systems, 2001.
  • Relevance models.
    Victor Lavrenko and W. Bruce Croft. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 120–127, 2001.
  • Performance prediction.
    Ian Soboroff, Charles Nicholas, and Patrick Cahan. Ranking retrieval systems without relevance judgments. In SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 66–73, 2001.