OpenAI recently announced Codex, an AI model that generates program code from natural language descriptions. Codex is based on the GPT-3 language model and can solve over 70% of the problems in OpenAI's publicly available HumanEval test dataset, compared to 0% for GPT-3.
The OpenAI research team described the model in a paper published on arXiv. Based on the same technology that powers GitHub's Copilot, Codex is a GPT-3 model that has been fine-tuned using publicly available Python code. To benchmark the system's performance, the team manually created HumanEval, an open-source test dataset of 164 programming problems consisting of a prompt for the model and a set of unit tests to check the validity of the generated code. When Codex generated a single solution for each problem, the unit tests passed for 28.7% of the problems; when allowed to generate 100 solutions for each problem, Codex generated at least one correct result for 77.5% of the problems.
In 2018, OpenAI first published a paper on generative pre-trained transformers (GPT), an unsupervised learning model that achieved state-of-the-art results on several NLP tasks; an 1.5B parameter model called GPT-2 was released in 2019. Last year, OpenAI announced a 175B parameter model, GPT-3, and during experiments discovered that it could "generate simple programs from Python docstrings," even though the model was not explicitly trained for code generation.
To develop Codex, OpenAI started with a pre-trained GPT-3 model. The team then collected Python code files from 54M public GitHub repositories, filtering down to a final dataset of 159 GB. The model uses the same text tokenizer as GPT-3. although the researchers found this to be suboptimal, as the word distribution in code differs from that of natural language. In addition, Python code contains significant whitespaces, so the team introduced an additional set of tokens to represent whitespace "runs."
Other previous code generation models in the literature are often benchmarked by using a fuzzy match of the output against a reference output; a BLEU score, for example. By contrast, the OpenAI team chose to use functional correctness for their evaluation, arguing that this is how human developers judge code. The particular metric used is pass@k, meaning that the model generates k code samples, and if any sample passes the unit tests, the model has solved the problem. The 12B-parameter Codex model achieved scores of 28.8% for k=1 and 72.31% for k=100, compared with 2.58% and 7.59% for TabNine's largest free model, and 11.6% and 27.74% for GPT-J.
Besides adding Codex to their own API, OpenAI worked with GitHub to incorporate the model into GitHub's Copilot code generation tool. Although neither Codex nor Copilot are open-source, there are several similar open-source projects. Last year, Microsoft open-sourced CodeBERT, a code-generation model based on BERT. In August of this year, NovelAI released Genji, a model based on GPT-J.
In a discussion on Hacker News about Codex, one commenter pointed out that despite performing well on benchmarks, the model performs poorly on interview and code competition questions:
This suggests to me that Codex really doesn't understand anything about the language beyond syntax. I have no doubt that future systems will improve on this benchmark, but they will likely take advantage of the AST and could use unit tests in a RL-like reward function.
OpenAI's HumanEval dataset is available on GitHub.