Large Language Models for Code by Loubna Ben Allal at QCon London

At QCon London, Loubna Ben Allal discussed Large Language Models (LLMs) tailored for coding. She discussed the lifecycle of code completion models, highlighting pre-training on vast codebases and the finetuning step. She specifically discussed open-source models, facilitated by platforms like Hugging Face. Resources consist of over 1.7k models on the HF hub and tools like StarCoder2 and Defog-SQLCoder. Using customization techniques, such as instruction tuning, offers tailored solutions but come with challenges like data bias and privacy concerns.

In recent years LLMs for code completion tools have caused a significant shift in software development practices. GitHub Copilot, introduced in 2021, showed as one of the first tools that LLMs can be used to improve developer productivity. However, they only offer a paid API, and you don't have access to the trained model or the data used to train it. This is why open-source alternatives, such as CodeLlama, BigCode, and DeepSeek Coder, are popping up. BigCode is an open-scientific collaboration, which has an open and transparent approach to both the datasets used to train the models as well as the model itself. In the Slack channel of BigCode are 1100+ researchers, engineers, lawyers, and policymakers. Model weights trained as part of BigCode are released with a commercial-friendly license.

Ben Allal explained that the dominant backbone of all models is the transformer model architecture, which is effective in understanding and generating human-like text. The process starts with an untrained model, which can be trained using a vast amount of examples by ingesting a corpus of code, often sourced from public repositories such as GitHub. She explained that HuggingFace hosts large datasets, such as The Stack and The Stack V2. These datasets contain respectively 6.4TB and 67.5TB of code.

After pretraining, the model undergoes supervised finetuning, where it is refined with a more focused dataset to improve its accuracy and relevancy in code suggestions. This phase is critical for aligning the model with specific coding languages or frameworks. Subsequently, Reinforcement Learning from Human Feedback (RLHF) is employed to align the model even closer to human preferences, ensuring that the generated code aligns with what humans want to see. Ben Allal highlighted several useful tools and papers during her talk:

StarCoder2 and StarChat2: the newer models are aware of the repository context and can process instructions given by the user. You can experiment with them online.
Defog-SQLCoder: A model outperforming GPT-4 in generating SQL queries, showcasing the potential for specialized models.
The LLM-VSCode Extension: It allows you to use an alternative to GitHub Copilot in your IDE.

Customization of code completion models can range from prompt engineering to continued pretraining on specific datasets. This enables the adaptation of models to niche domains or particular coding styles if there is a need for it. Ben Allal highlighted several papers that can help you here, such as "MagiCoder: Source Code is All You Need", and "OpenCodeInterpreter". She also explained that not all optimization techniques that seem obvious work, as is explained by the paper about SantaCoder. In the end, they settled for several filters for file types, but not very aggressive filters.

Last but not least, it is important to evaluate your model. There are currently multiple leaderboards for code generation. The Big Code Models Leaderboard looks solely at the performance of open-source models. The EvalPlus leaderboard also takes closed-source models into account. The last leaderboard to keep an eye on is the LiveCodeBench leaderboard: this model tries to prevent the leakage of test benchmarks by creating new tests frequently.

Access recorded QCon London talks with a Video-Only Pass.

About the Author

Roland Meertens

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Roland Meertens

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter