The Birth of the Web and Rise of Full-Text Search
In 1993, full-text search was like mRNA vaccines — long researched, partially implemented, and waiting for critical mass to catapult it into mainstream society.
The perfect brew? The rise of storage capacity, the first search engine, and the web.
Today, the most difficult part of searching is deciding which of 1 million results we want to read — even then, we hardly ever bother to go past the first page of results.
[Bad joke alert: Want to know where to hide a dead body? Page 2 of Google.]
The quality of search results is a real issue and the subject of ongoing research, but it is a different problem from finding information — a problem that’s been around since at least 2000 BC when ancient civilizations in the Middle East were archiving information.
1950s: Information Retrieval
We don’t need to go back quite that far to understand some of how full-text search came to be. The idea of searching text with a computer for information retrieval (IR) was first explored in the 1950s, “when IR as a research discipline was starting to emerge … with two important developments: how to index documents and how to retrieve them,” according to “The History of Information Retrieval.”
In the early 1960s, Gerard Salton developed the vector space model of information retrieval, which turns text into a numerical representation that is more easily processed using algorithms. Those three developments, indexing, retrieval, and numerical representations remain the foundation for informational retrieval today.
Research in information retrieval in the 1970s and 1980s advanced work done in the 1960s with the notable addition of term frequency theories — how often a word appears in a document helps determine what that document is about.
These methods were developed and tested using relatively small document sets because limits on storage capacity and processor speed naturally curtailed how many documents were computerized. This was such a concern that international research groups formed the Text Retrieval Conference (TREC) to build larger text collections, which helped refine full-text theories against different types of content. This was to prove useful with the rise of the web.
The Internet Connects Researchers — But Few Others
Research was also progressing on another front. In 1973, the U.S. Defense Advanced Research Projects Agency (DARPA) began to research how to exchange data through networks, according to The Internet Society.
The result was the Internet, a system of networks and communication protocols connecting endpoints — or infrastructure for data exchange.
By the late 1980s, the Internet was used to connect a relatively small number of universities, researchers, and government research partners, such as Rand Corporation, IBM, and Hewlett Packard. One of the earliest Internet communication protocols was file transfer protocol (FTP), which “required a minimum of handshaking, and even more crucially was tolerant of temporary interruptions during long file transfer sessions,” according to The Register.
Users accessed the Internet with a terminal that looked like a DOS command line — in a manual process that required use of an FTP client to connect to a remote FTP server. They would then ask for a list of files on the server, look through the file names, and download the files to see if they contained the information that they needed — over network connections that were thousands of times slower than what is commonly used today.
In other words, users had to rummage around files in the endpoint, though some FTP administrators added a downloadable directory list of sorts with file names and one-line descriptions of the files to make rooting about slightly more efficient.
Keep in mind that these FTP sites had content on them curated by a select group of “geeks.” It wasn’t every academic who knew what an FTP site was, let alone how to add content to one. Every FTP site reflected what its administrators thought was interesting or whatever occurred to them to add.
The process was not very different from asking for an inter-library loan, where you would ask a librarian to order a book from another library. The requested book would come in a week or two, and it might or might not contain information useful to you.
Those lucky enough to have an Internet connection were giddy because they were able to get whatever was on these FTP sites immediately instead of waiting weeks for an inter-library loan. Suddenly, the process of discovering if a document was relevant was exponentially faster.
Nevertheless, it was laborious — one had to look through files, download the files that seemed suitable, then see if it had what you wanted. Rinse and repeat, as needed.
The First Search Engine
In 1989, Alan Emtage, a graduate student at McGill University, was working to find software downloads on FTP sites. In an effort to make this easier, he developed Archie, a software program that indexed file names on the FTP sites he connected to.
Archie, which is still available, indexes the downloadable files of FTP sites and allows users to search the file names for exact word matches. Indexing makes a list of where every word or phrase can be found in a set of text — making search faster and more efficient.
Archie has “a crawling phase, a retrieval phase where you pull the information in, and an indexing phase, where you build the data structures that allow the search, and then you have the ability to search,” Emtage said in an interview with Digital Archeology.
Archie was, by all accounts, difficult to use, but it laid the foundation for all subsequent search engines and, perhaps, the web.
The Rise of the Web
History will long remember 1989 — it’s also the year that Tim Berners-Lee invented the web. In November 1990, he launched the first website, which described his vision of “The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.”
Berners-Lee developed the web to help scientists at CERN share information more easily. He also hoped to democratize the Internet and “meet the demand for automated information-sharing between scientists in universities and institutes around the world,” according to CERN.
In 1993, CERN open sourced the underlying code for the web, and web sites began to proliferate. Suddenly, the problem of small document sets disappeared.
A few search engines emerged between 1989 and 1993 to help web users find documents that matched their searches. The field was crowded with competitors by 1998, when Google launched.
Coinciding with, and perhaps enabling, the growth of the web, storage capacity grew, with the number of bits of information packed into a square inch of hard drive surface increasing from 2000 bits in 1956 to 100 billion bits in 2005, according to Mark Sanderson and W. Bruce Croft in The History of Information Retrieval.
Because we had more and varied types of documents to choose from, what we searched changed from the Archie days. Before the web, Internet users were a small population looking for a constrained set of documents. “There is no way to discover new resources, so resources have to be manually discovered,” Emtage said. So what you expected of search and how much you needed search was very different and for a much smaller universe of information.
Search, Storage, Behaviors Change
How we searched also changed. The basic process of modern search engines and Archie are not so different. They retrieve information, index it, and allow people to search. But the web allows for discovery, something that Internet users didn’t have. “With the invention of the web, you have the ability to discover things that you didn’t previously know, because of hypertext links between websites,” Emtage said.
Those web links allowed users to journey across sites without having to use an engine at all. The web spawned the potential for instantaneous serendipity in information retrieval, a capability that had never existed before. Who hasn’t gone down the rabbit hole of following link after link to suddenly find oneself in an alien corner of the web, reading about something wholly unrelated to what one first went looking for?
But serendipity can take you only so far. When you really want to use your research time efficiently, you want to focus on a narrow range of things directly relevant to your immediate need. The down-the-rabbit-hole process of following links, while valuable in a broader sense, doesn’t help you find or compare recipes for kung pao chicken, for example.
Between Archie and the modern search engine, several technological problems had to be solved.
- A scale problem: We had to solve the problem of indexing actual content, not just file names, at scale because there were 1 billion websites by 2014.
- A UI problem: We had to make it possible for largely untrained users to type in queries in natural language text, not in a rigid syntax of terms and Boolean operators. Engines also had to advance to be able to divine what relevant results would best answer the questions implied by those natural language queries.
- A quality problem: It’s easy to make an engine that returns 10,000 results, but such an engine isn’t going to be usable by people who work under limits of time and available attention. Search results must be ranked or otherwise limited in topical scope to be useful at all. Indeed, search engine optimization (SEO) as a business sector arose purely to make sure that a commercial entity’s desired answers (“buy product X”) would show up in the first few links search engines presented to users in response to topical searches.
- A language problem: How do you support search across every human language? The answer was Unicode, a standardized way to write down all of the letters of all of the alphabets of all of the languages of the world.
Web-based search engines index the full text of documents and employ many of the early information retrieval tactics developed in the latter half of the 20th century. They used term frequency, statistical analysis of word relationships, clustering and different kinds of faceted search to help return relevant results.
In 1998, Google introduced PageRank, an algorithm for determining which documents were more closely related to others, that tackles the problem of “relevance determination” by using the pattern of web links between pages as a proxy for relevance. While PageRank has since been superseded at Google by newer methods, its appearance changed the game in the commercial search space.
Search continues to improve with machine learning-aided approaches representing the state of the art.
Full-Text Search for the Enterprise
The field of information retrieval began on a small scale, providing government and academic searchers with access to small documents sets. The web brought the volume of content needed to make full-text search work across the great variety of digital human knowledge. Today, full-text search has come full circle as enterprises strive to serve internal and external searchers with highly relevant content.
Lucene was the first widely available open source full-text indexer–and it put full text indexing on everyone’s map. Created in 1999 by Doug Cutting, it is a software library that performs text indexing to allow for faster searching. Lucene will index and search in many different file types, including .doc, PDF, XML, HTML — making it ideal for enterprises with many types of data. Search engine applications can — and often do — add Lucene’s capabilities to their applications.
Lucene, on which Solr and Elasticsearch are built, has been foundational for building search applications in industries for many years. But there are other options emerging that don’t require you to roll your own full-text search.
While enterprises can leverage the advances in search that was perfected by the likes of Google, they still face problems of:
- having documents stored in many applications that don’t communicate directly
- content in different file formats
- multiple versions of documents with no idea of the authoritative version
- security concerns around who can access what content
- content sets that grow constantly and have to be indexed accordingly.
Employees often have to conduct multiple searches in different data applications to find what they’re looking for — if they ever do.
Introduction of Cloud Native Search
Cloud native-search that allows companies to index structured and unstructured data from different sources. The addition of machine learning models allows tracking of what users do in each search session, comparing that to the behavior of similar users, and presenting search results tailored to that behavior.
As full-text search for the web improved with larger and larger document sets, enterprise search improves with more and more user information. By capturing user signals, full-text search can become enriched by the “wisdom of the crowd.” This allows a predictive search — understanding what the user wants and providing them with recommendations.
Though information retrieval has been around for millennia, full-text search enables humanity to learn, discover and enjoy human knowledge in ways that were unimaginable even 30 years ago.
The future of search may well lie in images and in predicting what users are likely to need and presenting it to them before they have to ask — opening new frontiers in human information retrieval — that are well beyond the page rank.