GitHub Releases Dataset of Six Million Open-Source Methods for Code Search Research

Published in

SyncedReview

3 min readOct 1, 2019

Regular web search engines like Google may be great for finding a restaurant, but they are lousy for locating a snippet of code. In a bid to help software developers and foster innovative code search research, GitHub last week announced the CodeSearchNet Challenge in a joint effort with California-based machine learning development tools startup Weights & Biases. A large dataset and several baseline models showing the current state of the art in code search have been released to help scientists build models for the challenge.

Faced with unsatisfactory code search results from natural language processing engines, researchers have in recent years been applying machine learning techniques to improve their code searches. They quickly realized however that, unlike natural language with GLUE benchmarks, there are currently no standard datasets suitable for evaluating code search processes. The CodeSearchNet Challenge aims to do just that — assess the effectiveness of different code search methods.

The challenge is detailed in an arXiv paper from GitHub and the Deep Program Understanding group at Microsoft Research that covers the background of the initiative and how the dataset was created. To encourage researchers and practitioners, GitHub will host a leaderboard to track progress on the challenge, ranking the code search methods based on normalized discounted cumulative gain.

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. It consists of 99 natural language queries with over 4,000 expert relevance annotations of likely results from the CodeSearchNet Corpus. The corpus itself contains about six million functions from open-source code spanning six programming languages — Go, Java, JavaScript, PHP, Python, and Ruby.

The fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, and includes:

Six million methods overall;
Two million methods with associated documentation (docstrings, JavaDoc, etc.);
Metadata that indicates the original location (repository or line number, for example) where the data was found.

In a press release, GitHub machine learning engineer Hamel Husain explained GitHub has also released a data preprocessing pipeline that can be used as a starting point for applying machine learning to code.

To evaluate the various code search models, GitHub first collected an initial set of code search queries and had programmers annotate the relevance of potential results, then used a standard Elasticsearch installation and their baseline models to obtain ten likely results per query from the CodeSearchNet Corpus. Programmers, data scientists, and machine learning researchers were then asked to annotate the proposed results for relevance to the query on a scale from zero (totally irrelevant) to three (exact match).

“We want to expand our evaluation dataset to include more languages, queries, and annotations,” Husain wrote. “As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.”

The CodeSearchNet dataset is available for download on GitHub. The paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search is on arXiv.

Author: Yuan Yuan | Editor: Michael Sarazen

We know you don’t want to miss any stories. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

GitHub Releases Dataset of Six Million Open-Source Methods for Code Search Research

Written by Synced