Facebook Releases AI Code Search Datasets

Facebook AI released a dataset containing coding questions paired with code-snippet answers, intended for evaluating AI-based natural-language code search systems. The release also includes benchmark results for several of Facebook's own code-search models and a training corpus of over 4 million Java methods parsed from over 24,000 GitHub repositories.

In a paper published on arXiv, researchers described their technique for collecting the data. The training data corpus was collected from the most popular GitHub repositories of Android code, ranked by number of stars. Every Java file in the repositories was parsed, identifying the individual methods. Facebook used the resulting corpus in their research on training code-search systems. To create the evaluation dataset, they started with a question-and-answer data dump from Stack Overflow, selecting only questions that had both "Java" and "Android" tags. Of these, they kept only questions that had an upvoted answer that also matched one of the methods identified in the training data corpus. The resulting 518 questions were manually filtered to a final set of 287. According to the researchers:

Our data set is not only the largest currently available for Java, it’s also the only one validated against ground truth answers from Stack Overflow in an automated (consistent) manner.

Facebook has recently published several papers on neural code search, a machine-learning technique for training neural networks to answer "how-to" coding questions. Software devs often turn to Stack Overflow to learn how to solve a particular coding problem, for example, how to fix a bug in an Android app. However, this isn't an option when working on code that uses proprietary APIs or less-common programming languages; in these cases there are few (or no) experts outside the programmer's own organization. Facebook and others have instead explored the idea of using source-code itself as training data to produce natural-language processing (NLP) systems that can answer coding questions.

Last year Facebook published a paper on an unsupervised-learning method called Neural Code Search (NCS), which was trained on the data collected from GitHub. This technique extracts words from source code and learns embeddings which map each word to a vector in a high-dimensional space. Embeddings often have the property of vectors that are "close" to each other in the vector space representing words with similar meanings, and relationships between words can be represented with vector arithmetic. An example of this is the word2vec model trained on Wikipedia, which when given the vector expression "Paris - France + Spain" returns "Madrid."

After the embeddings are learned, each Java method in the corpus is converted to a vector in the embedding space using a "bag of words" model; each word in the code is converted to a vector via the embeddings, and a weighted sum of vectors is assigned to the method as its index. This maps each Java method to a point in the embedding space. To answer a coding question, the question is similarly mapped to a point in the embedding space by converting each word in the query via embeddings and producing a weighted sum. The "answer" to the question is the Java method whose index is the closest to that point. The key idea is that both queries and code use the same embedding, and the training does not need any questions in the input data; it learns only from the source code.

One downside of this technique is that it does not learn embeddings for words that are not in the source code. Facebook researchers found that on Stack Overflow, fewer than half of the words in questions were also in the source code. This prompted the researchers to extend NCS with supervised learning, "to bridge the gap between natural language words and source code word." The resulting system, called Embedding Unification (UNIF), learns a separate embedding for query words. For this training process, the team extracted a set of question titles and code snippets from Stack Overflow using a process similar to that used to collect the benchmark dataset. This training dataset contains 451k question-answer pairs, none of which are in the benchmark. The UNIF system trained on this data outperformed NCS slightly when evaluated on the benchmark. Both systems returned the "correct" answer as the top result about one-third of the time, and returned it in the top-five results about half the time.

Although Facebook hopes its NLP research will boost programmer productivity, a user on reddit noted:

[O]ften times the most difficult part of programming is to describe exactly what I want to do in concise natural language.

Both the training and evaluation datasets are available on GitHub.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter