Making the world’s problem solvers 10% more efficient
Ten years after a Google engineer empowered researchers with Scholar, he can’t bear to leave it
Anurag Acharya is the key inventor of Google Scholar, but the real origin of the project lies in his college years at the Kharagpur campus of the Indian Institute of Technology. The IIT is India’s version of MIT and Stanford combined, and has produced a long list of now-celebrated engineers and executives at Internet companies here and abroad. But even in that elite school, it was difficult for students to get hold of relevant scholarly materials. For Indian high schoolers, it was nearly impossible. “If you knew the information existed, you would write letters,” he says, “That’s what I did. Roughly half of the people would send you something, maybe a reprint. But if you didn’t know the information was there, there was nothing you could do about it.” Acharya was haunted by the realization that the great minds were deprived of inspiration, and the wonderful works that did have the impact they would have because of their limited distribution.
The eventual solution to this problem would be Google Scholar, which celebrates its tenth anniversary this November. Some people have never heard of this service, which treats publications from scholarly and professional journals as a separate corpus and makes it easy to find otherwise elusive information. Others have seen it occasionally when a result pops up on their search activity, and may even know enough to use it for a specific task, like digging into medical journals to gather information on a specific ailment. But for a significant and extremely impactful slice of the population: researchers, scientists, academics, lawyers, and students training in those fields — Scholar is a vital part of online existence, a lifeline to critical information, and an indispensable means of getting their work exposed to those who most need it.
But Acharya’s path towards its creation was a twisted one. He came to America for his doctorate and became an assistant professor of computer science at the University of California at Santa Barbara. He was successful but vaguely unsatisfied. He felt the problems he was tackling were not hard enough, were insufficiently sweeping to make a real difference. One day in 1999 he visited a colleague who had taken a temporary leave to work at an odd startup in Palo Alto, called Google. The visit came at a time when Acharya was reexamining his career, asking himself whether he was really grappling with hard, meaningful problems. The problem of search — essentially fulfilling Google’s mission at the time of organizing and granting access to the world’s information — seemed to be a problem worth solving. Especially since it resonated with his experience in his home country.
“Information had very strong geographical boundaries,” he says. “I come from a place where those boundaries are very, very apparent. They are in your face. To be able to make a dent in that is a very attractive proposition.”
He joined Google in 2000, and for several years took charge of the technology of Google’s indexing. This is the system that “crawled” through all the Web, gathering all its contents so the company could provide the equivalent of a back-of-the-book index of the world’s biggest tome. Part of his job was expanding the index, convincing not only web administrators but also publishers, businesses and government agencies to allow Google to crawl their data. He was also in charge of keeping the index fresh, a massive task that involved pushing computer science to the limit. The job was high pressure. The system was nowhere near as stable as it would become, and after a few years Acharya was burnt out.
“Either I have to leave the company or I have to do something that can be interesting to me, but is lower pressure,” he now recalls as his mindset.
So he got permission to work with another engineer, Alex Verstak, to create Google Scholar, a free and widely accessible service that would live alongside search to address the problem that so thoroughly vexed him as a student. There were a number of challenges. The ranking signals that worked so well in general search were not always the best for researchers seeking knowledge.
On the other hand, there were advantages of introducing search to this particular body of knowledge. Unlike with general search, Scholar does not have to make tough guesses about a user’s intent. Obviously, there’s no chance someone using Scholar to look for a good Mexican restaurant or the directions to someone’s house—he or she is seeking an article or authors from a bounded set of sources that match the query. What matters in Scholar are the sources of scholarship, the subject matter and the identity of the authors. The sources were fairly easy to identify (though not always to crawl). “Scholarship is not an undifferentiated mass as in Web search,” he says. “If everybody in the scholarly field believes something is a scholarly source, then it is a scholarly source, because we are trying to present to the users what they are looking for.”
Also, the nature of academic papers presented some opportunities for more powerful ranking, particularly making use of the citations typically included in academic papers. Those same scholarly citations had been the original inspiration for PageRank, the technique that had originally made Google search more powerful than its competitors. Scholar was able to use them to effectively rank articles on a given query, as well as to identify relationships between papers.
After a number of tests and tweaks, the team showed the prototype to Larry Page. The co-founder’s reaction: “Why is this not live yet?” On November 18, 2004, Scholar was indeed live.
Google Scholar was revolutionary for a number of reasons. Acharya and his team worked hard to get academic publishers to allow Google to crawl their journals. Since many of the articles unearthed by Scholar were locked behind paywalls, simply locating something in a search would not mean that a user could read it. But he or she would know that it existed, and that makes a tremendous difference. (Imagine setting off on a research project and finding out months later that someone had done the same work.) Google also pushed the paywall publishers to allow users to see abstracts of the work. The world’s biggest online archive of journal articles, JSTOR, offered only scans of articles, and had no way to separate the abstract from the whole piece. (Those accessing JSTOR through subscribing institutions could see full text.) So Scholar convinced JSTOR to provide its users to see the first scanned page of the article for free. “Often the first page has the abstract, or in older articles you have the introduction,” says Acharya, whose job title at Google is Distinguished Engineer. “That at least allows you to get a sense of it so you can decide whether you should put in additional effort.” Google Scholar will then provide the information that will help users get the complete text, whether online for free, downloaded for a fee, or in a nearby library.
(All Google users benefited from all that newly crawled information, too, as the company included those articles and books in its general search index.)
At launch, Google Scholar won wide acclaim, even from those generally skeptical about the company. Two well known library scientists, Shirl Kennedy and Gary Price wrote, “When big announcements come from Google and web engines, we often get nervous…. Not this time, however. This is BIG news and something that should have been around for years.” (There was some criticism, though. One complaint was that Google Scholar had no API to allow other services to access it. Others said that since Google didn’t share information like its ranking algorithm and all its sources, it fell short of a “scholarly” standard.)
Some in the research community favorably contrasted it to Google’s more controversial Book Search, which was launched at the same time. Scholar avoided the sort of copyright controversy that Book Search generated, despite the fact scholarly publishing world is a war zone, with an increasing number of academics lodging protests against powerful publishers who control the major journals. This is a conflict pitting profit against public good. It was the principle of open research that led Internet activist Aaron Swartz to download a corpus of JSTOR documents legally provided to MIT; the government prosecution of that act ended only with Swartz’s suicide. Google Scholar does not officially take a stand on the issue, but its implicit philosophy seems to endorse an egalitarian spread of information. In any case, when possible, Scholar tries to help negotiate around paywalls for non-subscribers by linking to articles in multiple locations — often, authors of paywalled works have free versions on their personal websites.
Originally some of the biggest publishers, determined to keep a tight grip on the academic work they typically don’t pay for (and then sell for huge sums), refused to let Google crawl their contents.
Over the years Acharya has worked hard to change their minds. “It is knocking on one door after another,” he says. “Elsevier took three, four, five years. The American Chemical Society was somewhat slower, but largely it is knocking on door after door after door.”
Acharya has kept knocking on doors, because from the very moment Scholar launched, he has been devoted to improving the product. “The first version worked well, but I was not happy with it,” he says. Working with Verstak and a small team, he has consistently added features (one particularly useful addition identifies related articles to the ones ranked for a specific search) and even expanded Scholar’s reach to ambitious new realms, most notably judicial case law in 2009. (This was described as “a shot across the bow of the multi-billion dollar legal publishing business” which previously controlled that public information.) Acharya’s role spans not only engineering but operations, partner relations, library liaison, contracts, and evangelism.
The engineering isn’t an afterthought, though. A lot of artificial intelligence is necessary to keep improving the system. For instance, Archaya and Verstak got a patent for “Identifying a primary version of a document.” (By the way, I found out this factoid by using Google Scholar.)
Another innovation of Scholar has been its ability to correctly identify the authors of books and papers, an important feature for those interested in the work of a specific researcher. “”Scholarship tends to have a lot of authors named as ‘Jay Smith’ — there are a lot of Jay Smiths out there,” he says. “And if you think that’s as easy problem, think of the name Huang — there are about 200 Chinese last names that cover 95% of authors.” Google tackles this problem by creating clusters of papers that are likely to be written by the same individual and, for the last step, asks the actual authors (who almost inevitably use the service) to identify which groups of paper are theirs. Asking users directly to create search results, seems very un-Googley, but as Acharaya says, “We can’t automatically solve this problem entirely—so we just give you a list of clusters, you say, ‘These are mine,’ and you are done. The rest is automated.” Knowing who the authors are, Google can create profiles of where they fit into academia—who are their coauthors, who they have cited, who has cited them.
Acharya’s continued leadership of a single, small team (now consisting of nine) is unusual at Google, and not necessarily seen as a smart thing by his peers. By concentrating on Scholar, Acharya in effect removed himself from the fast track at Google. He was part of a number of amazingly talented Ph.D. engineers that joined the company around 2000, and some of them are still doing work vital to Google’s core, pushing boundaries of computer science and artificial intelligence. He has the engineering chops to work with them. But he can’t bear to leave his creation, even as he realizes that at Google’s current scale, Scholar is a niche.
“I didn’t have the confidence that if I left it behind it would continue to be what I want it to be,” he says. “Normally you leave projects behind, because you do the next interesting thing. This seemed just too important to let my desire for a new project drive what I did next.”
Only at Google, of course, would the world’s most popular scholarly search service be seen as a relative backwater. Acharya isn’t permitted to reveal how big Scholar’s index is, though he does note that it’s an order of magnitude bigger than when it started. He can also say, “It’s pretty much everything — every major to medium size publisher in the world, scholarly books, patents, judicial opinions, small, most small journals…. It would take work to find something that’s not indexed.” (One serious estimate places the index at 160 million documents as of May 2014.) But like it or not, the niche reality was reinforced after Larry Page took over as CEO in 2011, and adopted an approach of “more wood behind fewer arrows.” Scholar was not discarded — it still commands huge respect at Google which, after all, is largely populated by former academics—but clearly shunted to the back end of the quiver. Not only was Scholar missing from the list of top services (Image Search, News, etc.) but bumped from the menu promising “more” services like Gmail and Calendar. Its new place was a menu labeled “even more.”
Asked who informed him of what many referred to as Scholar’s “demotion,” Acharya says, “I don’t think they told me.” But he says that the lower profile isn’t a problem, because those who do use Scholar have no problem finding it. “If I had seen a drop in usage, I would worry tremendously,” he says. “There was no drop in usage. I also would have felt bad if I had been asked to give up resources, but we have always grown in both machine and people resources. I don’t feel demoted at all.”
Acharya is now 50. He’s excited about adding new features to Scholar — improving the “alerts” function and other forms that help users discover information important to them that they might not know is out there. Would he want to continue working on Scholar for another ten years? “One always believes there are other opportunities, but the problem is how to pursue them when you are in a place you like and you have been doing really well. I can do problems that seem very interesting me — but the biggest impact I can possible make is helping people who are solving the world’s problems to be more efficient. If I can make the world’s researchers ten percent more efficient, consider the cumulative impact of that. So if I ended up spending the next ten years going this, I think I would be extremely happy.”
That satisfaction seems plenty for Acharya, especially when he thinks of the millions of people — everywhere from rural India to Mountain View, California — who have the world’s scholarship at their fingertips, for free. But will Google itself spring for at least a doodle on November 18, when Scholar turns ten?