BY GRANT TIMMERMAN — in collaboration with Akash Gupta, Kexiang Xu, and Mingjing Zhu
During my senior year in the University of Washington’s Computer Science Department, I had the honor of taking the Internet Systems Capstone class (CSE454) taught by AI expert, Dan Weld. The crux of the 12 week class was designing and implementing a system uses any of the following crucial components to almost all modern internet sites:
- Information Extraction / NLP
- Machine Learning
- Network Effects
We chose to build a tool called AlgoDB which allows any software developer to easily search for an algorithm and find copy-and-pastable code samples in a variety of languages. The following is a summary of our results.
The goal of our project is to create a search engine that would enable programmers to quickly look up source code implementations for any algorithm. The existing search engines are inconvenient for developers searching for algorithm implementations. Popular search engines like Google are too broad, source-code agnostic, and contains too much noise. Furthermore, sources such as Wikipedia or an algorithms textbook mainly give abstract explanations or pseudocode instead of copy-and-pasteable code. This is not ideal for programmers who would rather quickly discover a robust library or snippet of code than translate the algorithm from pseudocode which is prone to errors. Our goal is to eliminate time wasted from looking for online implementations, allowing programmers to focus more on the problems that they are trying to solve.
Our project is composed of two major parts, the algorithm and implementation database and the search engine front-end. The database stores all of our algorithms, categories and implementations, while the search engine takes user queries and responds with a list of relevant algorithms and implementations. In addition, the front-end has various functionalities to facilitate information discovery such as language filters, dedicated urls and sorted indices of algorithms and languages.
The creation of the algorithm and implementation database was done offline. To create the algorithm database, we needed to scrape data from Wikipedia, npm and Rosetta Code. We used a variety of techniques and database choices to make this process efficient.
For this project, we decided to use Wikipedia as the source of truth when it comes to algorithms. This makes accurately extracting Wikipedia algorithm concepts paramount to the success of the project. We scraped Wikipedia recursively starting from the Wikipedia category “Category: Algorithm”. From there, we indexed every subcategory and pages up to a cutoff depth of 5. We store pages and subcategories into our Elasticsearch database of algorithm and algorithm categories, respectively. Storing algorithm categories along with algorithms could pave way for future research involving relations between algorithms.
As we delved deeper into the subcategories of Algorithms, we started noticing wikipedia entries which are no longer algorithms. In order to filter them out, we employed the heuristic that a Wikipedia page is an algorithm if its first paragraph (usually called “summary”) contains the word “algorithm”. Later, we realized that many Rosetta Code implementations weren’t getting linked because the associated Wikipedia pages didn’t contain the word “algorithm”, so we relaxed this restriction and deemed that a Wikipedia page must be algorithm-related if it’s associated with a Rosetta Code implementation.
The next step for making our algorithm database required us to link algorithm names to implementations. Wikipedia provides a clean list of algorithm names, but rarely provides implementations. On the other hand, Rosetta Code and npm have a lot of implementations, but they might not necessarily be for algorithms. So, the main part of the algorithm database stage is to link algorithms to their implementations.
We tried a couple of different ways for linking. Initially, we tried using exact string matching for the NEL. The main problem with exact string match was that it was too strict. Differences in spaces (merge sort vs mergesort) or slight variations in name (Rabin-Karp vs Karp-Rabin) would fail to link, despite being very similar. We then moved onto matching with n-grams, which had better performance. Finally, we used Elasticsearch to do our linking, since Elasticsearch does n-gram tokenization and other processing techniques automatically, so it had the best performance.
Rosetta Code Linking
We chose Rosetta Code as our starting point for algorithm implementations because it contains thousands of copy-and-pasteable code of common algorithms implemented in a wide range of languages. There are a total of 780 tasks in Rosetta Code. Each task can either be implementing an algorithm, such as merge sort or binary search, or handling some specific tasks such as displaying a game or reading a file. In order to link the tasks to wikipedia algorithms while filtering out the tasks not implementing an algorithm, we extracted the wikipedia links in task description and the task name, and tried 4 different strategies to link the tasks to algorithms in wikipedia: exact string matching, fuzzy string matching, auto-suggestion of wikipedia, and suggestion of Crosswikis. We took combinations of these strategies and tuned parameters such as the fuzziness for fuzzy string matching and auto-suggestion of wikipedia, and then evaluated the linking with precision and recall, so that we would be able to choose a most stable version of NEL on Rosetta Code.
We also chose to index implementations in npm registry. npm is one of the many flourishing language package registries, containing 199,389 packages. Compared to Rosetta Code, npm is a lot less structured. In npm, like with many other package registries, each package a short description, a readme in markdown format and a list of keywords. From these data, we need to determine what algorithms it implements. If we can link npm packages accurately, we can easily generalize the approach to other package registries.
npm NEL is a challenging problem because around 90% of packages are utilities that do not associate with any algorithm. In addition, there’s no clean and easy way to filter them besides to design a clever linker that avoids these. In reality, it’s hard to improve the number of correctly linked packages without dragging in junk packages at the same time.
We tried 4 different strategies for npm linking: simple full-text search, keyword extraction, keyword extraction with Crosswikis, and reverse matching. While keyword extraction variants are able to index most packages with relatively high recall, it suffers from low precision, which is what we wanted to optimize for in our search engine. Thus we settled upon an alternative approach, reverse matching for the final deliverable.
In reverse matching strategy, each algorithm name is used as a search term to be matched against all npm registry packages using Elasticsearch. We then consider up to the top 30 matching packages for each algorithm above a relative score — the frequency of the matching term normalized by the length of the document. This way, we ensured that most matching package has high relevancy with the algorithm, at the cost of less relevant packages being omitted from the index.
To optimize the linking and search engine efficiency, we indexed Crosswikis, a dictionary mapping from queries to wikipedia concepts, and also the inverse Crosswikis, a dictionary mapping from wikipedia concepts to queries. We used Crosswikis to suggest the most possible Wikipedia page given a task name, and the inverse Crosswikis for npm linking experiments. We found in our experiments that using Crosswikis decreases the precision but increases the recall dramatically.
Search Engine Front-End
The search engine front-end is website built on Node with Express.js that acts as a simple search engine for algorithms. A user can type a free-form query about an algorithm they are looking for such as “Dijkstra’s Algorithm” or “Flood Fill”. Generic descriptions of algorithms also perform well such as “minimum spanning tree” (Kruskal’s algorithm), “greatest common divisor” (Euclid’s algorithm) will also provide relevant results.
After entering a query, your search is added to a permanent link in the URL and the most relevant search results that surpass a minimum ranking score are shown in a list of algorithms with descriptions and expandable details. The search queries the Elasticsearch database, providing a relative score for each algorithm based on fuzzy term frequency — inverse document frequency on the title, tag line (first sentence of description), and full text description. The ranker also weighs a title match three times as relevant than a description match.
Once the relevant algorithm ids are discovered, the backend must query the implementations, bucket them by language, and then serve a random sampling of those to the user in case the user presses “See more”. Some algorithms have hundreds of implementations and rendering all the implementations would cause a significant delay in browser responsiveness (a few seconds).
To get more details about a particular algorithm, you can click on the title of a search result to go to the dedicated algorithm page. This page expands to show the algorithm categories and a list of hyperlinked languages to their syntax-highlighted code snippets. The code snippets are either from Rosetta Code or from npm and are highlighted using highlight.js for the target language. Implementations from npm render the project’s README in HTML (which can include code, lists, 3rd party images and links).
In addition to normal searching and browsing results, on the footer and header of the website you can browse three other ways of looking through our dataset. The first way is by selecting a specific language. Clicking on a language will bring up to 50 random algorithms that have implementations in the language you selected. This is useful for understanding the syntax of a language. Besides the list of languages, you can view the full list of hyperlinked algorithms as well. Lastly, you can go to a random algorithm by clicking the “???” button in the header. This was useful for finding bugs in our NEL linking.
Experiments and Results
We evaluated the NEL for Rosetta Code and npm with the following procedure. We took 100 random samples from the set of candidate implementations, and then manually verified the correctness of their linked algorithms in each NEL version. A sample is a true positive if it has a linked algorithm that we’ve verified to be correct, and a false positive otherwise. If a sample did not have a linked algorithm, we verified if it is a true negative by determining if it does not have a corresponding Wikipedia entry.
This approach became problematic for later iterations of npm NEL, where the positive samples only made up 10% of the total population. In later iterations, each sample may only yield 0 or 1 true positives and false positives, making it virtually impossible to compute precision. Thus we evaluated precision based on 100 random samples from the set of indexed candidates instead. The recall values were still computed with the original procedure elaborated above.
NEL for Rosetta Code
In order to link the tasks in Rosetta Code to algorithms in Wikipedia, we extracted the Wikipedia links in task description and the task name. Given the two kinds of data, we tried 4 different strategies to link the tasks to algorithms in Wikipedia: exact string matching, fuzzy string matching, auto-suggestion of Wikipedia, and suggestion of Crosswikis.
Since we noticed that approximately 50.3% of the tasks have Wikipedia links in their task description, it would be optimal for us to simply link the task to the page that the task author has manually linked, but the reality is that there are sometimes irrelevant Wikipedia links for other purpose. In order to get the most relevant link, we tried to use exact string matching and then fuzzy string matching between the link title and the task name, but it would fail to link when an algorithm has alternative names. Another strategy is that given the task name, we can query for a possible Wikipedia page by using the auto-suggestion of Wikipedia’s API or our Cassandra database where Crosswikis is indexed.
We took combinations of these strategies and tuned parameters such as the fuzziness for fuzzy string matching and auto-suggestion of Wikipedia. We chose 6 most representative versions to evaluate on. Following the NEL evaluation procedure described above, we get the precision and recall for each of the 6 versions. We finally chose version 4.3 for our in-class demo.
NEL for npm
We experimented with 4 approaches to npm linking: simple full-text search, keyword extraction, keyword extraction with Crosswikis, and reverse matching. We used reverse matching for the final deliverable.
In the simple full-text search experiment, we tried for each package performing full-text searching on the database of algorithms using its text features. For text features, we used the entirety of the package description, up to the first 20 lines of the readme excluding code snippets and artifacts of markdown syntax. They are matched against the name, short tagline and summary of algorithms. We then compute the total score of each algorithm match with the sum of the relative search score of each text feature, and considered the algorithm matches above a fixed threshold value. We were able to index 19183 packages with this approach, yielding us 58.3% precision and around 40% recall. It became apparent that the indexed packages contained too many false positives for a search engine use.
During the experiment, we noticed that keywords were the biggest contributor to the correct algorithm matches. However, some packages didn’t provide accurate enough keywords, and others simply didn’t have any. That led to the idea of extracting keywords from the readme document, which can then be used alongside the original ones. This approach allowed us to index many more positive samples, yielding a much higher recall value of 87.5%, but a much lower precision of 8%.
The next iteration used Crosswikis to address the problem of some algorithms having alternative names. It also allowed us to exactly match package keywords and algorithms’ Crosswikis entries, further boosting the scores of the relevant algorithms. However, no improvements were observed in terms of recall and precision values, which lowered to 62.5% and 7%, respectively.
To achieve the high precision values required for search engine use, the next and final iteration used a reversed matching approach. Rather than matching against algorithms with features of each package, we tried matching the name of each algorithm against all packages (hence the name “reverse”), then considered up to the top 10 matching packages for each algorithm above a fixed threshold. This way, we ensured that most matching packages have high relevancy, at the cost of less relevant packages being omitted from the index. We were able to index 1750 packages with a whopping precision of 93%, at a cost of having virtually no recall because the fixed sample set didn’t contain any false or true positives. Relaxing the criteria to top 30 matching packages yields 2325 indexed packages with a precision of 94% and recall of 5%.
Full system results
To create our ground truth, we took a list of algorithms from scriptol.com, because it contained algorithms from different fields, such as cryptography, scientific computing, and graph theory. For each of the 363 algorithms, we searched the first 10 results from Google using only the Wikipedia corpus. We removed any result that did not contain “algorithm” in the opening paragraphs and anything that wasn’t a real Wikipedia article, such as the talk and category pages. After the filtering, we removed any algorithms that did not have at least 5 entries, ensuring we had a set of 123 good algorithms
We defined a document matching the correct result if the URLs matched and if the document contained the correct implementation, which was done by manual inspection. For queries with no relevant documents, we assignment them a score of 0. The two tables below show our results below. The left table considers only our search performance and ignores implementation, while the right table displays our final results.
The left table indicates that our highest ranked document matched the correct answer majority of the time. Since MRR stops increasing after the top 3 documents, our database does not contain the rest of the algorithms in the test set. Thus, one of the major improvements to our system would be to index more algorithms from other sources, such as GitHub or Stackoverflow. The final full results show that npm gives around a 10% increase in MMR and the high precision module still maintains most of the algorithms. Since full MRR is 20% lower, another improvement would be continue working on npm NEL to get more relevant packages.
There were a few surprises we had at the end of the project. Firstly, Crosswikis data (20GB) didn’t improve our Rosetta Code linking as we expected, but it gave us the opportunity to add algorithm aliases if we had more time. Secondly, we discovered that it is not trivial to make very efficient custom queries on a NoSQL database where you are trying to join many documents that have many-to-many relationships. We had to spend extra time optimizing our query flow by using caching, adding result limits, and using parallel querying to ensure search results would come in less than half a second. Next time, it would be better to think more carefully about our database schema. Lastly, we were pretty surprised how well the search worked; our MRR was relatively constant around 0.69. The reason why it wasn’t higher was because 30% of the algorithms used in the test set were missing from our corpus.
Overall, our end system is in a usable state and provides useful copy-and-pasteable code for a variety of languages, especially for common algorithms. This was very apparent in our test set, where common algorithms, such as the sorting and graph algorithms, almost always had correct links and implementations. It seems that our system relies on Rosetta Code for good results. In summary, our system would be an useful aid for a computer science student in an introductory data structures and algorithms course like CSE 332.
If we had more time, we would improve our data quality and search UI in a couple ways. A vast majority of algorithms that missed appropriate linking were not in our Wikipedia database. Rosetta Code does not have enough algorithm implementations for every algorithm name (for example “Rayleigh quotient iteration”, “Lanczos algorithm”). On the other hand, we have the opposite problem for our npm packages. Of the roughly 200,000 npm packages, many of them could be considered implementing an algorithm, but they don’t link because we don’t have the Wikipedia algorithm names (for example “binary-search-tree”, “tf-idf”). In addition, the way in which we find Wikipedia algorithms could be improved. Some algorithm pages in Wikipedia aren’t under the “Algorithm” subcategory and some pages under that category aren’t in fact algorithms.
In addition, we would improve the disambiguation of implementation titles with algorithm titles. For example, the implementations for task “Fork” groups many correct pieces of code around the system call concept of forking, however the description explains piece of cutlery meaning of fork. With regards to the UI, the current website displays a lot of information, much of which could be removed for a cleaner experience with less information overload. There are also some small bugs with links to languages and algorithm titles with non-ascii characters and room for further deduplicating languages and algorithm titles.
Build it yourself!
Project source code and documentation can be found at github.com/xkxx/throwtable. The README.md contains instructions on how to build and run the database and website.
Grant Timmerman is a senior in the University of Washington’s Computer Science Department interested in machine learning, programming languages, and mobile applications. While he’s not building 3D printers, Grant actively contributes to open source, gives talks about the hottest in web frameworks, and practices speaking Mandarin Chinese. For more interesting projects and experiments, visit Grant’s personal website grant.cm.