Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Large Language Models for Code: Exploring the Landscape, Opportunities, and Challenges

Large Language Models for Code: Exploring the Landscape, Opportunities, and Challenges



Loubna Ben Allal discusses Large Language Models (LLMs), exploring the current developments of these models, how they are trained, and how they can be leveraged with custom codebases.


Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face working on Large Language Models for code & synthetic data generation. She is part of the core team behind the BigCode Project and has co-authored The Stack dataset and StarCoder models for code generation.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Allal: My name is Loubna. I'm a machine learning engineer at Hugging Face. I work on large language models for code. I will show you how these models are trained and how you can leverage them for your own use cases. I work at Hugging Face. I did my studies in Paris, on engineering and deep learning. I mainly now work on LLMs for code and synthetic data.

How it Started: GitHub Copilot in 2021

Let's see a little bit how all of this started. Large language models for code have been there for a while. This topic became very trendy in the AI world when GitHub Copilot was introduced in late 2021. This is a VS Code extension by Microsoft, that autocompletes your code and it uses a model by OpenAI called Codex. This was a very huge breakthrough in the field, because this model was so much better than all the other code completion models before it, which were so much smaller and much less performant. This is important because it improves the productivity of the engineers who use it. Not all of them but some of them. For example, it can help you write unit tests or documentation, or even do more complex tasks with the recent LLMs that we have now. For example, in this blog post by Google AI, I think it was in 2022, they found a 6% reduction in code iterations, if you scale that to hundreds of thousands of engineers, this is a significant gain of time and money. This was very exciting, GitHub Copilot, AWS CodeWhisperer, and other models, but the issue is that they were only available through an API. You don't have the model checkpoints. You can't use the model to fine-tune it on your own use case. You also don't have information on the data that was used to train these models, so there isn't a lot of data transparency around. Also, the code to do the training and data processing is not available. All of this makes these results not reproducible. If you go to the Hugging Face Hub, where you can find all the open source models, and you search for the type code, you can see that now we have over 1700 models that are trained on code. Some of them are LLMs that include code in the training data, but a lot of them are pure code completion models. We've made a lot of progress in this field.

How Did We Get Here?

You might be wondering, how did we get here? This is the product of the community's work to train code models that not only can autocomplete your code, but that's going to also follow instructions, and thus we call them, instruction-tuned. If you go, for example, to this code leaderboard that we made, where you rank code completion models and compare them on different programming languages, you can see that on some benchmarks, the scores are pretty high. For example, solving more than 80% of the problems in the benchmark for some of the instructional models for code that are out there. There's a lot of progress in this field. There are a lot of interesting models that you can use.

How are Code LLMs Trained?

How are these models trained? If you want to train an LLM in general or an LLM for code from scratch, you should know that this requires a lot of resources. First, you need to have a lot of GPUs to be able to train the models in a reasonable amount of time. This can go from hundreds of GPUs to thousands, or even more. You also need to have a lot of data, because generally these LLMs, they require terabytes of data to train them efficiently. We should be able to scrape all these data, to filter it, and then to train on it. You should also be able to scale the performance. Because once you have the data, it's not like you're going to take the model to figure out which filtering makes sense for your dataset, which hyperparameters make sense for the model. That will require a team who's able to dedicate effort and time to do all of these experiments. Training your models from scratch are not for everyone. It requires a lot of resources. If we look at the technical details, these models, they usually all have the same architecture, which is a transformer model. First, you start from a model that is untrained. Then you show it a lot of data, it becomes a pretrained base model. Then you show it some data where you have ground truth. This is called supervised fine-tuning. After that, you can do another step called RLHF, which is alignment to get the model to hallucinate less, to generate less biased content and toxic content. The aim is to align it with human preferences. This is when you get a chat model, for example, like ChatGPT. ChatGPT went through all of these steps. For us, we want code models. We're not going to train on just the web, we need to train the model on code. Where do we get that? We get the dataset from GitHub. Then if you want to build a model that can follow instructions, then you need to build a dataset for the SFT step. It looks like this. For example, you have an instruction, write a function or solve a bug. Then you have the solutions which are the ground truth. Then you train your instruction-tuned model. That's how you get a chat LLM but specific to code.

The Landscape of Code LLMs

We need a lot of resources to train these models from scratch, and not everyone can do that. We're very proud to be part of these people who train these models for the community through a project called BigCode. In this project, we release the stack dataset, which is the largest open dataset of source code. We also released two families of code generation models, StarCoder and StarCoder2, in different sizes, along with an instruction-tuned model called StarChat2. Other players in this field are Meta with their Code Llama models, and DeepSeek Coder with the DeepSeek models. There are also other models, for example, from StabilityAI, CodeGen from Salesforce, and other LLMs. All of these models are open. How open depends on the license that they are released under. You can find the checkpoints, and you can adapt them to your own use cases.

I've told you some of the things that we developed in BigCode. What is BigCode, actually? It's a collaboration that is between Hugging Face and ServiceNow. Our goal was to have an open project where everyone can join and can help us train these models in an open and transparent approach. For example, in our Slack channel, we have over 1000 people who joined, these are researchers, engineers, but also lawyers and policymakers, because we really also care about the data governance and privacy aspect. We wanted to invest some time into that. The pillars of this project are three, full data transparency. For example, if you want to use our models, you can know exactly on which data they were trained. This data is public, and you can inspect it. We'll also open source the code for processing the datasets and also training the models to try to make our work reproducible. The model weights are released under a commercially friendly license.

The motivation behind this project was to try to encourage better practices in the field of AI and LLMs in general. These practices are not always respected in the closed source field. For example, you might have a model whose weights are public, but the details about the training data are not disclosed. This might be because they're afraid of lawsuits, but also to not give up their competitive edge to others. Sometimes the model weights are not public, for example, which is the case for ChatGPT. If you want to use these models, so for example, even use GitHub Copilot, the issue is that you will have to send your data to a third party, because you're just calling an API. In a lot of cases, your data might be sensitive so you don't want it to be sent to third parties. That's when you want something that is deployed on premise, and that is secure. All of this makes this work not reproducible, and doesn't encourage progress in the open source field. What we're trying to achieve with this project is to have, for example, public data. The data that we trained on is available and people can inspect it. If they want to be removed and not be included in our future trainings, they can just fill a form and opt out. The model weights are public for fine-tuning. You can also deploy them in premise. This is the dataset we released. How we built it is that we basically scraped all of GitHub, and then we filtered the datasets for licenses we can use. Then we did additional filtering, like removing files that look similar in a step called deduplication. We also have a tool called, Am I in The Stack, where you can go, you can just type your GitHub username, and you can check if any of your repositories are in our dataset. If you don't want to be in the dataset and in the model trainings, you can just fill a form and we'll make sure to not use your data.

This is the first model that we trained. It was released last April. It's called StarCoder. It has 15 billion parameters. It was trained on 500 A100s for 24 days. When it was released, it was the best code generation model. Last month, we released a new dataset called The Stack v2, which is much larger than the first version of the stack, and also still the largest open dataset of source code, in collaboration with Software Heritage. We also trained a new model called StarCoder2, which is much better than StarCoder, and also better than a lot of other models there. For example, if you want a 15B model, the best model is StarCoder2. It even outperforms other models like Llama 34B. It's good on code, but also math. Another thing we added in this model compared to StarCoder1 is RepoContext. Before, when we were training, we would just grab files from GitHub and randomly concatenate them and train on them so we lost the repository level structure. For StarCoder2, we made sure to keep files that are in the same repository next to each other during the training. This StarCoder2 model is aware of repository context. If you were to use it, for example, in a VS Code extension, and concatenate files from your repositories, it can give you completions that are in other files and not necessarily in the file that you're editing. We also built an instruction-tuned version of StarCoder2, called StarChat2 in collaboration with the Hugging Face H4 team. This model is available on a space that's Hugging Face, You can query it. It can also complete your code. It can also follow instructions, not just on Python, but also on other programming languages, because StarCoder2 was trained on more than 600 programming languages. This creates a BigCode ecosystem where people would take the dataset that we released because it has so many programming languages, and develop new models on top of it. This can be new pretraining from scratch, for example, like StableCode or CodeGen 2.5, and other models. Or people would just start from the models with these, for example, StarCoder and StarCoderBase, and they would fine-tune them for their own use cases. For example, here you can see that there's WizardCoder, which is an instructs tuned version of StarCoder. There's also Defog SQLCoder, which is very interesting because it outperforms GPT-4 on SQL, and they just started from StarCoder, and they fine-tuned this on a lot of human annotated data for SQL. They were able to outperform GPT-4 on that. This shows the power of open source, when you release models and tools for the community, they can use them and build new things that even you haven't thought of.

Systems Surrounding LLMs

Let's see how you can go from a model to an API or a tool like VS Code extension. First, you have the model and then you wrap it around, then inference the endpoint with an API, then you have a chat interface. Each time you go to another level, you need to add new parameters. For example, when you go from the model to the API, you need to add moderation and compute. When you go to the chat interface, you need to have a system prompt for the model to control its behavior. You also need to have some hyperparameters to have a very nice user experience. This is for chat models in general. If we go to code models, you can maybe just swap the last component of a chat model with the code interface, that could be a VS Code extension, or JetBrains, or something like that. Hugging Face released two things for that, HuggingChat, which is like ChatGPT but only uses open source models. It's free to use. We also have an extension called llm-vscode. We deployed StarCoder there and we also deployed other open access models, like Code Llama and DeepSeek. You can all find them there if you want to use an alternative to GitHub Copilot that uses open models.

Current Customization Techniques for LLMs

Now let's go back to this slide. I said that to train code models from scratch, you need a lot of resources. The good news is that if you only want to customize existing models to your code base, or to new datasets, you don't need as much resources. This is possible thanks to fine-tuning. The reason why we would like to fine-tune and not just train from scratch might be related to the resources. If you don't have enough GPUs, or if you don't have enough data, because training from scratch requires a lot of data to get good performance. Or if the model that is already out there is good at your task, but you just want to improve the performance a little bit, you want to reduce the hallucinations, you want to include information that is more up to date, that's when you would go to other solutions than just training from scratch. If you want to adapt an existing code LLM or LLMs in general to a new task, there are some less intensive ways and other ways that are more intensive. The easiest one would be to just do prompt engineering. You would take a chat model and just tweak the prompt to get it to solve your task. Another thing that is a little bit more complex is in-context learning where you would add examples in the prompts to teach the model new behavior. For example, there are a lot of papers who are trying to use models, for example, ChatGPT, and try to teach it new libraries that probably were not included during the training, just by adding documentation to the context to see if the model is able to pick up new skills that it wasn't trained on. Another thing is tool use. I'm going to show later how to do that. Then there's fine-tuning. Then there's continued pretraining, which is like fine-tuning, but it is so much longer.

For prompt engineering, as I explained, you just need to change the prompt and add instructions to get the model to follow what you want. Some examples are, for example, few-shot where you add examples of the task you want. There's also chain-of-thought, where you try to get the model to solve a problem using step by step reasoning. This is, for example, very useful for math problems where instead of just asking the model to solve it, you ask it to reason step by step and split the problem to smaller, easier problems. The other thing that is also very interesting to use with code LLMs is tool use. For example, if you pick a general code LLM or just LLM in general, they might be very bad at arithmetic. If you can plug in a calculator to do the computations for you, or just add an interpreter to interpret the code, that might help you improve the performance a lot. Other techniques that are used are, for example, RAG, retrieval. You can do that with chat code LLMs, if you wanted to retrieve some documentation and add it to the context, or if you want to add more up to date information to the context. For fine-tuning, you would start from a base pretrained model, and then you can show it some SFT data. For example, for LLMs, this can be any domain, finance, medical domain, or summarization, and then the model is able to pick up this task. The only drawback is that, you need to have this SFT data that has pairs of questions and answers to be able to train the model on that. If you have it, you might be able to adapt the model to your own use case.

Customize Code Models: Code Completion

Now let's see what are some good practices if you have a code base, and you want to take an existing code LLM and adapt it to your own code base. Here, I'm going to put this blog post which is very interesting, by engineers at Hugging Face who tried to build a personal copilot. They took some of the Hugging Face libraries that were released after StarCoder training. They built a new code base, and got the model to learn these new libraries and to follow our practices for coding. In this blog post, they go through the steps of data processing, model fine-tuning, and also provide the code to be able to reproduce their results. If you want to do that, I think the first step is you need to prepare your dataset. The steps for preparing the dataset are very similar to the steps that we use for pretraining. When we were pretraining StarCoder, we first gathered all this data from GitHub. Then we did some filtering to remove files that wouldn't help the model during pretraining. We did another filtering that is very important, it's called deduplication, where you remove files that look similar. Because for LLMs, this really hurts their performance if you show them files that are very similar early in the training. It's very important to remove duplicates before you start the training. For filtering, I think if you have a custom code base, maybe you already are cleaning it. If it's not the case, you might want to have some filters to remove autogenerated files, and maybe filter configs, filter SSV data because that's not something that's going to teach the model how to code. After you've done all this filtering, then you can tokenize the dataset. I forgot another step, which is PII detection. Because in public repositories, although people shouldn't put their SSH in the API keys, you might be surprised that there's still a lot of them on public repositories on GitHub. We try to filter all of them and we release the tools that we used. We also try to remove names and emails. You might also want to do that before you train your model. Otherwise, you might risk having a model that would spit out an API key or someone's name when you use this during inference. I put here some links that you can use to check how to train these kinds of models. We also have resources for data deduplication with a library called datatrove. We also have published all the code that we used for our own data preprocessing for StarCoder models.

Let me go back to this PII reduction step because this might be interesting for you. First, we started by using RegExs to detect these keys in names and emails. The RegExs were not very good at catching some types of keys, so we tried to hire human annotators to annotate the dataset for secrets and PII. We released this dataset. It is gated, you have to fill a form because we don't want to expose people's information. Then we trained the named entity recognition model. You just show it a file and it is able to detect where is the name, where is an email, and where is the key. Then we run all of these on our dataset. It was quite intensive, because it's still an NER model. It's a neural network. If you want to run it on terabytes of data, you need a lot of GPUs. For us, I think it cost around 800 GPU hours on A100 machines. If your database is smaller, maybe it will be faster to run this pipeline. The code is also available on the BigCode project GitHub repository. One other thing that you might want to do is fine-tuning, but it's not just fine-tuning on 10,000 samples or 100,000 samples, it's fine-tuning on dataset that is large itself. Why would you want to do that? For example, if you want to take a model that is very good at Python, but maybe not very good at Swift. I think most models now focus on Python, because it's the default programming language for machine learning. There's a lot of hype around it. Maybe you want to adopt it for a low resource language. If you have a dataset for that, you can train the model on this dataset. In the stack, we also have a lot of programming languages, so you can take one of these languages and continue training the StarCoder models, for example, on them. A lot of people try doing that and it seems to help for some low resource languages. There are even benchmarks to test the performance, not just on Python, but also on these languages. There are even datasets that were developed for these low resource languages like MultiPL-T dataset.

Customize Models: Chat Assistant

We saw how to fine-tune on your custom code base, and also how to do continued pretraining on a dataset that is a little bit larger. Now we'll see how to build a chat assistant. If you have a code model, and you want to turn it into a chat assistant that could help you with your coding related questions, you need to follow what we call instruction-tuning. You need to have a dataset of instructions and all the answers to these instructions. This could be used for bug fixing, or for just code completion too, or a lot of other use cases. Here there's a paper called Magicoder, where they try to use just normal GitHub code from the stack, and they used another LLM, in their case it was GPT-4, to generate instructions and the answers to these instructions, just from some code files. This is a good approach. Because if you were to just select few topics, maybe you were losing diversity, but because you start from a general core dataset, you get files for different topics. You just need an LLM that can generate good instructions and solutions that actually work. I think they even tried to execute the solutions to see if they are correct, and they don't generate errors. They have a refinement pipeline to only keep the instructions and solutions that are actually relevant, and that will help them during the training. There's also the CodeInterpreter paper where they also try to do some code interpretation to increase the quality and remove the files and the instructions that fail. If you see their evaluations, they are able to get some pretty good performances with their approach. For example, in our StarChat2 model, how we built the instruction-tuning dataset, we just took a lot of instruction datasets, not all of them were made for code. Some of them are just general instructions, for example: write me a poem, who is the president of the United States? These are instruction datasets that are used for LLMs. It also helps to include them with other instruction datasets that are just dedicated to code. This way, your model doesn't lose its ability to know normal facts and also English. The model is available in this demo, as well as all the datasets that we included in this, even an alignment handbook where you can use these techniques to do the fine-tuning. One other good thing about fine-tuning techniques is that, recently, they are very cheap. You can find techniques like LoRA and PEFT. You can fine-tune 7B models in just a few hours on one GPU. Compared to two years ago, it was so much expensive to train these models, because you basically needed to train all of the weights, but now it's really manageable.

How are Code LLMs Evaluated?

We saw how these models are trained, how you can fine-tune them, but you might be asking yourselves, ok, I trained the model, how do I know it's good? How the models are evaluated in the code domain is that we have some benchmarks. We test on these benchmarks to see what is the score that you would get. For example, there's a benchmark called HumanEval. It is basically some function signatures. You ask the model to complete the function implementation. Then in the benchmark, you have unit tests for each problem. Then you execute the solution that your model generated against these unit tests to see if any of them pass. Then you report an average of the problems that pass. This is a lot intensive compared to just natural language evaluation because you need to first generate the solutions, and then you need to execute them, which might take some time. Luckily, the benchmark is not very big, but that's also a drawback because you only test on a small benchmark. It might be hard to see the differences between models that are very close just because you don't have a lot of problems in your benchmark. There are some other benchmarks that are larger, but they all require this generation and then execution step. One thing to be also aware of is that what you're going to execute is code that is generated by an LLM. It might contain malicious code, so we don't want to execute it on your machine. It's better to do it on a sandbox, or at least a Docker, to make sure it doesn't alter your system. This benchmark is not just for Python, I think it was translated to 18 other programming languages. We can find this, for example, even for Lua, Swift, and Rust. What I showed you was a leaderboard to rank these models to compare them. There are a lot of leaderboards for code now. There's also this one called EvalPlus, where they compare not just open models between each other, but they also compare to other closed models, like Claude 3 and GPT-4. There's also this LiveCodeBench. It is very interesting because a lot of people argue that maybe some of the code models are doing very well on the benchmarks, because they were trained on them. This is what we call data leakage or data contamination. Because everyone could have taken a benchmark and put it on GitHub, so if you just scrape GitHub and you train your model, there's a very high chance that you probably trained on the testing benchmark. It's very normal to have a model that is good at the benchmark you're testing on. Usually, we try to do decontamination. For example, for the StarCoder models, we checked all of our training data to make sure that none of the test benchmarks are included. This is very hard to check if people don't say that they explicitly did that in the paper. There are a lot of models who say nothing about their data processing pipeline. This leaderboard tries to only use very recent repositories on GitHub, and build a benchmark out of them. There's a very low chance that the current code LLMs saw what is in this benchmark. You could also compare the open models, but also the closed models.

If you train a code LLM on your custom code base and you want to evaluate it, I think you should first start by the standard benchmarks to make sure you're not losing performance on the language you're training it on. If you did the fine-tuning for a specific use case, for example, you want the model to follow a certain style when programming, you probably want to build a new benchmark that would test this ability. For that you would need to have human annotators, or you can use other powerful LLMs, maybe even closed, to build this benchmark. A lot of people are doing this because it's not enough to just run on the standard benchmarks that don't necessarily test what you want to implement. Another thing is to do just wild checks, test the model and see what it generates. Maybe the most efficient one would be to deploy the model, and then have some users test it. For example, you could deploy the model in a VS Code extension, and then have a number of software engineers test it. Then you can use metrics like the acceptance rate. For example, how many times is the code generated by the model accepted by the user? If this acceptance rate is high, it means that it's probably the generations are useful, and the people want to keep them. You can do multiple iterations and see when the acceptance rate declines or improves to judge the quality of the models you just trained and deployed. This is still an intensive approach. If you want to really test if your fine-tune is working, you need to go through that.

How are Code LLMs Served?

The last thing we're going to see is how these LLMs are served. You can train them, fine-tune them, evaluate them, but then you need to serve them and maybe deploy them to hundreds of users, to hundreds of thousands. Depending on your use case, you might choose some options, or others. We have some inference endpoints that you can use. If you don't have an engineering team that will do all the MLOps side of stuff, you can just purchase an endpoint and we'll take care of the deployment. You can just query this endpoint in the pay as you go. Otherwise, we have an open source library called Text-Generation-Inference. This library tries to take the most popular models and deploy them in a very efficient approach. You have your GPUs. You can use this library in the model and then be able to serve it. What's interesting in TGI is that it has tensor parallelism implemented. If the model is very large, and it cannot fit in one GPU, it can split the model on multiple shards and be able to load it. For example, we use this with very large models like Falcon 100 and 80 billion parameters in which it was able to work efficiently. It also has token streaming. It means that when you send the query or a prompt, you have the option of not waiting until the generation has ended and then showing. Token streaming is like, you show the tokens in the text as it is generated. This is very useful because it improves the user experience. A user doesn't have to wait until the full generation has ended, but it can also see the generations as they come. If the user doesn't like what is being generated, they can just stop. This reduces the perceived latency. There's also metrics for monitoring in TGI, and it is production ready. Once you launched it, and you tested it, you're sure that it's not going to fail in production. It is what is being used now in our inference endpoints. It also has techniques for quantization and optimization. It supports most of the popular Code LLMs, but also general LLMs. There are other open source libraries by the community that are similar, for example, vLLM, which also offers similar implementations. You can test both and see which one is good for your use case. If you want to explore more models and datasets related to code, you can go and search on the hub and find the ones that are relevant for your use case. You can also build demos to showcase your models and test them, and use GPUs if you don't have them.

Future Directions + Beyond LLMs

Let's see what the future directions of this field are. I think what we really need is to have better open source models and high-quality datasets. Even though we made a lot of progress, we're still, for example, behind GPT-4 in code completion, and other models like Claude 3. We still need to investigate a little bit more what is missing to catch up and close the gap to closed source.

We also need high quality datasets. For example, the stack was a good step towards democratizing training code models, because it is now available, and everyone can pretrain code models from scratch if they have the resources. We also need better data transparency and governance. That means telling the users exactly which data was used for the training, and also alerting about the possible biases in the dataset regarding privacy and security. We also need better evaluations that not only focus on the high resource languages, but also the low resource ones. Currently we focus on functional level, so there are some benchmarks to test that the repository context works, that the model can actually retrieve information from another file and generate it. We also need better evaluations that test also class implementations, not just functions. Overall, just evaluation that catch things that are more complex. Now we're making some progress towards that, but I think there's still room for improvement. The last thing is that we need better smaller specialized models, because not everyone has enough resources to deploy 15 billion model or 7B model. If we could generate smaller models that are better, that would be good too. Here, I tried to show you a little bit how you go from just data on GitHub, to actually products that you can use, for example, HuggingChat and the VS Code extension that we have. We saw that between the two, you have to do a lot of things, not just train on a lot of GPUs, but you have to do a lot of data curation, data governance. You have to also work on inference to be able to serve a lot of users, and evaluation, and also fine-tuning.

Questions and Answers

Participant 1: All the code that you suck in somehow from GitHub, you say, you take care that the licenses are ok, so there will be no GPL code in all those large language models. What about other licenses? What is a permissive license? Because almost all the licenses I know, say that you must leave intact copyright header, that you must add the license file to the code that you're producing? Is there any discussions in the community how to deal with that?

Allal: If you use MIT or Apache, that doesn't mean that you shouldn't attribute the author if the model generates exactly the same code. That's something we thought about. We implemented code attribution tools. For example, if you use the VS Code extension with StarCoder, we have membership tests that when the model generates something, we go and check the dataset. If we find that it is an exact copy of something that was in the dataset, we have like a red alert. If you click on the link, you can find exactly which repository that was from, and then you can attribute the author. That tries to help a little bit with the code attribution side. We tried to develop some tools to help the users who are using the models to attribute the authors if the model ever generates exact copies of what was in this training data.

Participant 2: In training the model, do you care about the quality of the code, say like if the code is neat enough and the algorithm's computational complexity.

Allal: Yes, so for when we were training on the dataset, I think you can implement a lot of filters to only keep the code files that you believe have a higher quality. We did a lot of experiments and a lot of ablations to try to find the adequate filters. We found that you can filter but you shouldn't filter too aggressively. Otherwise, you will end up losing a lot of data and you wouldn't have enough to train your model. For example, we have a paper called, SantaCoder, don't reach for the stars! That's because we tried to use stars as a filtering approach. For example, we kept only files that had more than five stars. This significantly reduced the size of the dataset, and we ended up with a model that was like the worst of all the models we trained. Some filters might seem good to you, but maybe for the model training, it's not worth using them. For example, filtering on stars is not a good idea. Although we might be thinking that repositories with a lot of stars probably have a higher quality. Now we use some basic filters to remove autogenerated files. For example, we count the average line length, and if we find that is very long, for some specific programming languages, maybe it is autogenerated. We also try to remove data, for example, CSV and JSON. We only keep a smaller subset, because that is not code, just data. We have other filters like that, but they are not too aggressive.

Participant 3: My question is regarding obviously, the legal and ethical considerations around scraping data from GitHub. For example, I know in the past there, there's been legal cases. One notable one would be the legal case between LinkedIn and hiQ, where hiQ was scraping data from LinkedIn using fake accounts. I think LinkedIn took an injunction against them. What considerations have been made in that sense? Is there some agreement between yourself and GitHub, on the scraping of the data, or did you just go ahead and scrape the data?

Allal: There's no agreement between us and GitHub, because we only use the repositories that are public. Then once we scrape them, we filtered out, for example, licenses that don't allow commercial use or GPL code. Then we trained on a subset of the data that we scraped. We have also this opt-out tool, so we can consider, for example, giving users a choice to decide whether they want their code to be included in the pretrainings or not. In this opt-out tool, users can fill a request. If they see that some of their GitHub repositories are in the dataset, and they want them removed, they can ask to remove that. You also have the code attribution tools. These are the three things we try to consider. We don't have an agreement with GitHub because the code was public.

Participant 4: Do we have any mitigation against the risk of moral collapse, where you put generated code back into the stack and where you train the new models on AI generated code?

Allal: You're talking about using synthetic data, what the model generated to train again on it.

Participant 4: Yes. Then, with a couple of cycles, the model collapses and then just spit out random things.

Allal: Yes. I think maybe if we're talking about the same study, in that study, they used a very small model, I think it was OPT 125 million, and they found that it collapses. Now a lot of people are training on synthetic data, a little bit models that are larger, and we haven't seen that that happens. The worst thing that happens is that the model's performance just does not improve on the task you're training on. Maybe in the future, when we have a lot of these cycles, that will happen. Now that hasn't really happened.

Participant 5: If we come to a future where most of the code is AI generated, and this is for sure being used to train more of these models, what do you think will happen?

Allal: I think we're going to have to wait and see what happens. If we see like the code that is being generated, a lot of it actually might be higher quality than if you take the average of what is on GitHub. It's not just garbage code. I think maybe something we can do is to have things to detect and distinguish AI generated code from code that was used by humans. For example, I saw a very recent study where they tried to see which papers have been written by ChatGPT, and they used the word delve. They found that after ChatGPT was released, the number of papers that used the word delve just increased. ChatGPT has the tendency to generate that. That's a silly filter, but it still detects what was generated by a model. Maybe for code, if we can have some watermarking approach or something that will help us distinguish AI generated from non-AI generated, that would be very helpful. I think that's still under exploration and we haven't made much progress on that. It's hard to predict what will happen when we will have more AI generated code than human generated code.

Participant 6: Looking forward to the future, can you talk about how we might get to the point where you could generate multifile or multi-project size completions with one of these tools?

Allal: That's something everyone's looking forward to. Because, for example, when GitHub Copilot was introduced, I personally was using it just for documentation in tests, it was not very good at generating new code, coming up with new ideas. Now it has improved a lot, so you can use it for other tasks that are more complex. I think we're still a little bit far from having something like an agent that can work on your whole code base, and just, you give it a question. You don't need to say where exactly it has to change the code. I think that is the end goal, to have something that can take in multiple files and do the required changes. One step forward to that is training on repository context, for example, which we did for StarCoder2 compared to StarCoder1. If we see things, for example, like Devin, if you saw, it is this AI agent that can change anything. I think that means that we're making progress towards that. We just need to find the right recipe for that. I think that means that you need to have good instruction-tuning data that not only asks the model instructions about specific code snippets, but something about the code base. You also should start from a base model that is already very good, because that will significantly impact the performance of what you get. I think that means having better base models and better datasets that you train on, and also better evaluations to be able to track if you're moving forward or not with regards to that aspect.

Participant 7: Training Your LLM seems to be quite an involved process. Do you think there'll be a time where when I'm in an enterprise and I've got a couple of Git repos that I can just say, "Model, please ingest these models." Then the model trains itself on my code without me going through this whole song and dance here?

Allal: I think there are a lot of startups who are trying to do that, because that's a use case, I don't want to be involved in all of this training, just give me a service that I can also train in. I think there's one probably called Poolside. They're trying to have that. Because, basically now we have all the components. There's the model, there's the fine-tuning scripts, and there's your data, and we know which filters will help. We just need someone who will combine these components. Maybe that's a product that someone can develop, and you can use.

Participant 8: I think you touched in your talk, but I didn't get the guidance, with respect to fine-tuning, you talked about your libraries, that you did fine-tuning for the libraries. Can you give some guidance of how many million lines of code would you need to fine-tune? Let's say I have my own frameworks and libraries internally in my enterprise and I want to fine-tune a model, roughly, what are we talking about? Do we need a million, 5 million, 10 million, 50 million lines of code?

Allal: I can give you a number, but that number does not really apply. It depends on your use case, which model you started from, what you're trying to adapt it. I think usually for fine-tuning on new programming languages, and not a new one, one that the model has already seen, but you want to improve the performance a little bit, maybe tens of gigabytes of data should do the trick. If you want to just do like instruction-tuning or get the model to follow instructions, you need much less. I think people are just using tens of thousands of samples. Maybe a few gigabytes of data could work if it is just taking a model that has already seen that language and you just want to adopt it. It should work. For example, if you go to the blog post that I mentioned, I think they had even less because they just took the Hugging Face libraries. I don't know how many we have, but they just compiled them into datasets, and they found that it works to train on that. One other thing that made it that you need less data is these new fine-tuning techniques that don't take the whole model and change the whole way it's doing the fine-tuning, they only change specific weights and specific layers that they add. This means that you don't need a lot of data for the adaptation. You will need to try and see at what threshold your model gets better.

Participant 9: I have a question around if you've done any work on deprecation of libraries or stuff like CVE. I've got loads of examples, where ChatGPT will generate a package JSON or dependency file within the library or API call that it's referenced and the next bit of code is deprecated, or doesn't exist in the v2 version? Have you done any work around how to avoid that happening in code generation models?

Allal: I haven't personally worked on that. That is also hallucination for LLMs, when you would ask the model, who's the president of the U.S.? They would say Obama, because this is something that was deprecated. This is old. One of the techniques to solve that is to use, for example, retrieval, RAG, to retrieve information that is more recent and add it in the context. In your case, you could try to retrieve the documentation for the new version of the library and add it to the context to tell the model, these are the new things that are changed, or you can also add the logs of the changes. That is something that is worth exploring. Otherwise, you would need to fine-tune on recent code. These are two things, either retrieval or new fine-tuning.

Participant 10: We've had a truly explosive growth in AI over the last few years, do you think we're close to plateauing or is this just the start?

Allal: I think we've made a lot of progress. We still have a long way to go, first by matching the performance of the closed models, but then trying to address some issues that even these closed models still have, which are the hallucinations, the biases, and also all the data governance issues. We think we're off to a good start, but we still have a lot of work. Like you asked, we want code models that can act as agents on programming code bases and not just complete a simple function. We still have a long way to go, but I think we're off to a good path.

Nardon: Do you have an experience on using code generated to improve an existing code base, and how this is working well or not with code LLMs?

Allal: I think, yes, this can happen implicitly when you have some engineers who are using the model, and then they use the code that it generated for what they will push, for example, to production. That's an implicit improvement. Otherwise, you can maybe try to use the chat models, and maybe integrate parts of your code base and try to get feedback on these different components. It's still, I think, maybe early stage because we still don't have models that can act on the code base level, but they still all act on the file level. Maybe it's early stage still.


See more presentations with transcripts


Recorded at:

May 23, 2024