Live from the venue of the QCon London Conference we are talking with Omar Sanseviero. He talks about his work at Hugging Face, the limitations and biases of machine learning models, and the carbon emitted when training large scale machine learning models. We also discuss democratizing good ML practices, making it easy for everyone to train and use large scale ML systems.
Key Takeaways
- The Hugging face open source ecosystem for transformer models already hosts over 30.000 models and supports TensorFlow, Pytorch, Spacy, and other machine learning libraries.
- Model cards describe how your model is trained, how many parameters it has, and what kind of limitations it can have.
- Models can have inherent limitations and biases depending on what data the model is trained on. Working together with a diverse group of people can give a more diverse dataset and can result in a better model.
- Training and using large machine learning models can lead to large carbon emissions. Although only really large models have a big impact it is important to be aware of the impact machine learning has on the earth.
- HuggingFace helps to democratize good ML practices. This means making it easy for everyone to train models, explore models, and use them for a wide variety of applications.
Subscribe on:
Transcript
Roland Meertens: Welcome to the InfoQ podcast. My name is Roland Meertens and today I am interviewing Omar Sanseviero. He is a machine learning engineer at Hugging Face, and he works on the intersection of open source and product. He gave a talk at QCon London, and I am actually speaking to him in person at the venue of the QCon London Conference. In this interview, I will talk with him about Hugging Face and the work he's doing there. But, we will also touch on some societal topics such as the impact deep learning models have on carbon emissions and what it means to democratize good machine learning. As I said before, Omar presented at QCon London, and he's also presenting at QCon Plus. If you registered for any of these conferences, you can view his video there. We will also release his presentation on the InfoQ website. Keep an eye on that. Now, on to the interview itself.
Welcome, Omar, to QCon London. How are you enjoying the conference so far?
Omar Sanseviero: Hey, it's going great. I'm engaging it quite a bit. The talks have been great.It has been an amazing experience.
Roland Meertens: You gave a talk yesterday. Can you talk a bit about who you are? What your talk was about?
Omar Sanseviero: Sure, of course. I'm Omar. I'm a ML Engineer at Hugging Face, which is a open source startup that is trying to democratize good machine learning. The talk yesterday was about open machine learning, which is about how we can do machine learning in a way that everyone can use these models. That research that is being done is something that is accessible to everyone, and that is very transparent. It's about transparently communicating what these models do, how they work, what could be the impacts of these models. Yeah, we talked quite a bit about this, about good practices, about finding models created by the community about collaboration. It was a fun experience.
Roland Meertens: What kind of models are there right now on Hugging Face? I think you guys are mostly specialized in transformers, or do you have all kinds of different models?
Omar Sanseviero: That's a great question. How we began a couple of years ago, maybe for a bit of context, we have a platform called the Hub. The Hub allows anyone to upload any ML model or data set or demos. By now we have over 30,000 models. And, the growth has been a bit while. About a year ago, we have 6,000 models on the Hub.
Roland Meertens: 6,000 models? That's already big.
Omar Sanseviero: One year it has multiplied times five. It's quite exciting. Originally, what we had a couple of years ago, three years ago, we launched a open source library in Python called Transformers, which allows anyone to load a transformer model from the Hub, which is the central platform, with a single line of code. And then, they were able to use these models. That's how we began and Transformers have their origins, their roots, in NLP. They come from the Natural Language Processing domain. But, they have expanded quite a bit to other domains.
Now, they are being used in computer vision, in audio, or speech, and also in reinforcement learning time series.So it has expanded quite a bit. Most of the models on the Hub right now, maybe about 20,000. Two thirds are transformer models. But, now we're also supporting other open source libraries in the ecosystem. So you have your Scikit-learn, TensorFlow, PyTorch, SpaCy, or really any other machine learning library. You can just upload your model to the hub and use it as a central platform with which you can collaborate with the community.
Roland Meertens: If I download a model from Hugging Face, does this then work with all these different backends? Or, do you have specific models for specific machine learning libraries?
Omar Sanseviero: I’m a Machine Learning Engineer working in the intersection of Open Source and Product at Hugging Face.
Previously, I was a Software Engineer at Google working in ML models and infrastructure that power multiple features in Google Assistant. I was also a 20% Product Manager in the TensorFlow Graphics team.
I have expertise in education with over 6 years of experience. I'm passionate toward education-related projects, and I've worked closely with different online communities helping people enter the world of Machine LearningWe have different models for different libraries. What we have in the platform is that we have tags, so you can easily discover and filter models depending on your use cases or interests. There are some libraries that have interoperability. For example, there's a library called sentence transformers or adapter transformers. These are two other libraries. They are able to load models that are for transformers. But, for example, Scikit-learn, of course, they cannot simply load that transformer model. But, in the case of certain libraries, you can. But, what is really important to us is that people can easily find the best model for their use cases. We are working in good search experience, so people can find between these 30,000 models, which are the best models for their use cases.
Roland Meertens: What are popular models on there right now?
Omar Sanseviero: That's an excellent question. I think that what is quite powerful about transformers is that you can easily pick a model from the Hub, and then do something called fine tuning or transfer learning. And then, you can just modify this model for your own use case. Modify is probably a bit of a oversimplification. But, you can really pick a model from the Hub and just with a bit of pure data, you can modify or fine tune, or just turn a bit more this model for your own specific use case.
What many people do is they go and pick a very classic model such as BERT. BERT is a very common language model, that then they can just do something like desk classification. But, there are many, many popular models. In the audio domain, there's a model called Wave2Vec2. This is from Facebook. This is being used quite a bit for audio. You can see it the same way as BERT. It's very common, pretrained model in NLP. Wave2Vec2 is a very common pretrained model for audio.
Roland Meertens: So, you get the right features or you get at least a nicely calibrated feature space, once you enter an audio file on this case.
Omar Sanseviero: Yeah, exactly. What Wave2Vec2 do in this case is that once you pass an audio file, it will map this to a embedding or a hidden states that then you can use for your end task. You can use Wave2Vec to do automatic speech recognition, which is you pass an audio file and then you get the transcription of that face. You can see it as speech to text. But, there is also work to do TTS, which is text to speech. You generate synthetic audio. But, there are many, many other applications. For example, audio classification. You pass an audio, and then you might classify which is the language being spoken here, or which is the speaker speaking here, things like this.
Roland Meertens: Also, I think I saw speaker verification being used, where you have two vectors where you can compare length and spaces to see if it's a similar person, or the same person actually in this case.
Omar Sanseviero: Exactly. Something that is very fun about all of this is that since we're mapping to these feature spaces, which are dnr, just vectors or arrays of numbers, you can then start to compare vectors of different modalities. You can start to compare images or the embedding of an image with the embedding of a text or a paragraph or a document or an audio file. And then, you can start to get a gross modality search. You can start to have a search systems in which you might pass an image and you might get retrieved all of the text relevant to this image or the other way around. You search funny cats, and then you will get funny cat images.
Roland Meertens: I really liked it about the CLIP model from Open AI, that you can easily build natural language image search system, where you can just align the bendings of your text to the embedding space of the images. That's really great. But, yeah, you say what you're working on, this is open source ecosystem. How is this work? Are many people publishing models or are many people collaborating to make models better?
Omar Sanseviero: There are many, many, many efforts going on. I've been working at Hugging Face for about a year right now. And, we have many efforts. In the team, we really don't train models from scratch. We do have a science team and we are turning some models and doing some open science work. We do have research that is trying to push the state of the art with a very big effort called big science. We can talk a bit more about that. But, from the open source team, what we are doing is we are building collaborations with the open source ecosystem. This goes in many, many ways. One is that as soon as there is new research related to transformers, we are adding these new architectures to the transformers library, so people can easily train these models and use them.
Because, what happens very often is that researchers share GitHub repository. If they do, the code might not be super clean or might not follow good practices, or it's not intended for production. And, that's totally okay. That's how academia research works. But, once you want to start using these models in a real world setting, you really need to have good practices. We have collaborations with many open source libraries, as we were talking before. We have collaborations with about 15 or 20 different open source teams from other companies, or just from volunteers that are working external to Hugging Face.
That is quite exciting. And then, there are people that are just training models and pushing the state of the art. For example, there is a very cool group called Somos NLP, which is focusing NLP for Spanish. This a group of volunteers that they just like to push the wonders of the state of the art in NLP in Spanish. They are training very interesting models, organizing events, creating content in Spanish. It has been quite exciting. From the Hugging Face side, we try to support them in whatever they need. Sorry, a few examples.
Roland Meertens: It's basically really the state of the art models. The best of the best published by the big companies. But then, replicated in a more reproducible way by volunteers and other people.
Omar Sanseviero: In the team, we do have people that are implementing these models in the transformers library. It's not that it's done just by people outside us. But, yeah. We are trying to replicate these model implementations in a good standardized way with common interfaces that we work with similar models for the same task. If you want to use a model for text classification and you want to try different architectures, you can very, very, very easily change from one model to the other. Even if they are different models, you can do that. That's quite powerful.
Roland Meertens: How does it work in terms of the data? Do you also host the data sets or do people download their own data sets or are all the data sets by OpenAI, for example, public?
Omar Sanseviero: Yeah, that's a great question. Many of the data sets for these very, very famous large models, such as GPT 3, neither the model, neither data sets are public. The papers, for example, do explain quite a bit about how this data was gathered. There is information about it, but the data was not published itself. Most of the time, this is data script from the web. They go through Reddit it, for example. It's just for many, many websites. It's not just Reddit.
But, they go to many websites, Wikipedia, Reddit, and other platforms and they scrape data from the web and they use this to train these models that will then gain a statistical understanding of the language. From the Hugging Face side, what we do is that we have something called data sets. This is a big branch in the two. We have a open source library in Python called data sets and what it allows is to load data sets with a single line of code. With a single line of code, like load from Hub. You can load the dataset with the speed you want. It's quite nice because you can work with huge data sets of many terabytes. You can easily do filtering or mapping or do any operations you want to do about it.
And then, what we have is the Data Sets Hub. Data Sets hub is in the Hub. It's a platform that allows people to share their data sets. What most people do is that they just share a script to load data from other sources. What usually happens is that people might have their dataset stored in Google Cloud or Amazon services. And then, they will have a script on the Hub, in the Hugging Face Hub, that will load this data set and do the splitting, for example, or any post processing that is required. They also do the documentation for their data set on the Hub.
But, some people also share their data sets directly on the hub. That means if their dataset are CSV files or JSONs or whatever, they will upload those files to the Hub and they can also use that if they want. So, you can store the data set on the Hub. What is very nice is that, on the Hub, you have a data set preview thing, which allows you to explore and take the dataset directly on the browser. Without having to run any code or download the dataset, you can just explore a bit, this dataset, directly in the web browser.
Roland Meertens: I think the other thing you mentioned in your talk were these model cards, where if you have a model that you have some basic information about how it's trained, what it's trained on, what the model does, can maybe tell a bit about that.
Omar Sanseviero: Yeah, sure. The concept will sound quite obvious for anyone that has done a lot of software development. You document your code. You should also document your models. But, in practice, almost nobody was doing this. This is a concept, a model cards that was going by Google, Mitchell specifically, and her research group three years ago in 2019. The idea here is that you can have an artifact that will document your model. Of course, how you document that model is completely different as how you document your software. But, usually you have things such as what this model is. What is it supposed to do? How it was trained. Which data was used to train this model. How it was evaluated. Maybe a snippet of code to run this model. Showing how to run inference, because if you don't do that, people want be able to use your model.
And then, you might have things such as limitations and biases. Most if not all models have some biases and we can talk a bit more about this in a second. It's very important to document that this model might not work for certain minority group, for example. If not, what can happen is that a company or anyone can go and pick this model and try to use this model for this use case, without knowing that there were these limitations and that will just lead to many different issues.
Roland Meertens: How much do people know what the limitations of a model are? Because, I personally, when I try something, I frequently discover the weirdest things not working, which you don't really expect. Do people really also explore what the model captures in the latent space?
Omar Sanseviero: This is something that some people are doing, but not all people. What we're trying to do, we have a research group completely focused on this, is create tools to allow people to easily explore and really dive into the model and how it works, both model and data sets. We are trying to create those to enable researchers, but also practitioners, to really explore how their model works.
There are many explainability libraries that try to show, for example, the tension mechanism and how the model is working and which words are important ones for the model. But, in practice, this is not super actionable, let's say. We're trying to create tools that give actionable feedback that will help people to easily know what is going on here. But, what people do is they give or they provide high level description of which bias system model might have. This is something that does exist already.
Roland Meertens: Yeah. What do you mean in terms of biases? Like specific things? That it's good at recognizing specific things, which it's better at recognizing?
Omar Sanseviero: This is quite depending on the use case. But, let me give you an example with GPT. GPT is are a very famous model that predicts which is the next word. If you say, I don't know, "Today's a very X day," for example. Today will be a very, and you put X, it might say, I don't know, sunny or nice day, something like that. It will predict which is the next word. For example, if you say this man is working as a... Then next word that the model will predict might be architect, doctor, something like that. For example, with woman, it will say something like, "This woman is working as a..." And, one of the predictions might be prostitute, for example.
How is this happening? Well, the model was trained with lots of data script from the web. What is quite powerful about these huge transformer models is that you don't require any label data. That means that you don't need to spend a lot of money labeling data. You can just grab as much data as you want. You can pass it to the model and the model will, in an unsupervised way, learn the patterns in the data. And then, as we were talking before, people sometimes just go grab this model and then they do fine tuning or transfer learning to train a new model. What is interesting is that these biases are also learned with the transfer learning. This fine tuned model will also have the biases from this source or general model. Even if you didn't have any data that would be concerning in the fine tuning stage, if this first model had biases, this fine tune model will also have biases. This a very clear example of what biases would mean here.
Roland Meertens: I guess that in this case, if you mostly download the Reddit data, for example, that's not a very accurate representation of the world. I mean, maybe even in worst cases, it's not an accurate representation of what you actually try to predict.
Omar Sanseviero: We have an effort called Big Science. Big science started as an effort that Hugging Face proposed with the French government. The French government is giving us a couple millions of compute power from their supercomputer in Paris. The idea here is to train a very, very, very, very large language model, but in a fully transparent, open science and open source way. This is something that Hugging Face is pushing, but it's not an effort from Hugging Face. Right now, it's a collaboration between 700 researchers from different universities. So, anyone that is a researcher in academia or industry can just join the effort. It's quite interesting because what we're trying to do is put all the questions up front. Which biases this model might have? How are we collecting this data? We had a lot, a lot, a lot of work.
This effort started March of last year. We just started to train the model last month, like three weeks ago. We spent eight, nine months just exploring the data, defining which data sets we want to have, which languages we want to work with, cleaning all of this data, making sure that there are no biases exploring this data. It was quite interesting and it was an effort between many people from different backgrounds. For us, that's quite important. When you want to train these very large powerful models, if you just go and get all data from the web, you will have many issues. It's very hard to get or find these issues if you really don't work with this data.
Roland Meertens: Are there any practical things you can do as a machine learning practitioner when setting up a data set? Are there specific tips and tricks to try to prevent these biases? Or, is it just something you have to keep in mind when deploying your model that they are there?
Omar Sanseviero: I think you need to do both. The second thing you mentioned about just keeping in mind, making sure, is something that you will always need to do in machine learning. It's very, very easy for your model to have biases. Especially when your models are impacting humans' lives, you need to be especially careful about how you're using this model. Because, of course, if your model is just doing something very silly or not having a huge impact or something that is in a feature that won't have direct impact in a human life, maybe that's okay. But, when you're impacting, I don't know, something related to healthcare, something related to loans, then you need to be extremely careful. But then, in terms of if you're bringing a model and you want to be careful about this, you need to be very careful and really dive into your data before going and training a model.
And, it can be quite easy, really. You can just go and see how it fits your data. Many people train models without really spending one hour just looking at their data. Even unrelated to biases, this is a good practice in machine learning in general. Because, for example, if you have different ways in which people are doing spaces or around optimization method of your data, you might end up with terrible results just because you didn't really look into your data. Once you have your model trained, there are many things you can do. For example, what you can do is, maybe I will go a bit too deep on this, but you can do things such as looking at data in which the model prediction and the label diverge quite a bit.
So, if your model is predicting zero point 99, but the label was zero, that means that the model is terribly wrong or the label is wrong. That's an easy way to just get some data that might be interesting to look at. If you're using a threshold, wondering, you might want to look data near the threshold, because those are samples in which the model might be struggling a bit. You should probably go because there might be an issue with the model or an issue with the data.
And then, you can do other things. For example, you can look at data based on how much loss it causes. Once you train the model, you can pass a lot of data and see which was the change of the loss based on each of these samples. The ones with the largest loss, those are the ones making a biggest change in the model. You probably want to take a look at those samples as well. Maybe those are too practical, but yeah. Those are nice ways and it helps, not just for biases, but it helps to find interesting data in your data set.
Roland Meertens: I think at least it will probably give you a hint on what extra data to collect and how to do it. Another thing I sometimes read is that if you know where your bias might be coming from, just giving it as an explicit feature into your model. If you know that maybe people from certain areas are actively being biased on school results, just adding the area as a feature might actually teach the model that this feature's unreliable, that you have a different representation for these people.
Omar Sanseviero: I think in terms of biases there are many things that you can do. But, I don't think there is a single solution for every problem. Right now we are talking a bit about text data. But, once we talk about image data, for example, then it gets complicated as well. Or, if you're talking about tabular data, so tables for example, it happens a lot that you have some columns or some features that are strongly correlated with some, I don't know, with gender for example, or some race. Even if you're not feeding the model... Woman's level, woman are zero. I don't know. Something. But, you might have a feature that is strongly correlated. I think really exploring your radar is extremely important if you want to avoid biases, which if you are deploying models in production, is something that you should do to be responsible.
Roland Meertens: Oh, indeed. The other thing I think you mentioned in your talk were the carbon emissions. If you're training massive models, carbon emissions can be quite large. Your computer power can be quite large. Can you say anything about that?
Omar Sanseviero: Sure. I think this is linked as well to the model cards. When you share models and you published the model card, which is documenting how this model works, you should also document which are the ecological implications of your model. Right now what we have on the Hub is that in the model card, you can specify in the metadata, which is some information, some metadata about the model, you can specify which are the CO2 emissions. At least to the transformers, we have added callbacks and functions that allow people to very easily track the CO2 emissions while the model is being trained.
Now, maybe to take a step back and talk a bit about what this ecological impact mean, or why are we talking about CO2 emissions now. These models are very, very, very large. They are really huge. We were talking about hundreds of billions of parameters for some of these cases. There are models that are just two or 3 billion parameters or some millions. But, training these models can have a bit of impact. Right now some of the largest models might have the impact maybe of a flight, an intercontinental flight, or maybe the lifespan of a car in the US, for example. But, this is for very large models.
These very large models is not something that everyone is training. Most people are training very small models that really don't have as much ecological impact. But, it's very important to document this. There is something called the loss of scaling, which shows that as you grow the model, the better the performance will be. The trend that has been happening already for the last four years is that people are just going with larger and larger and larger models. Just two days ago, today is Wednesday, so on Monday, Google released a model with 500 billion parameters, I think. And, of course, if we keep seeing this trend, and I think we'll keep seeing this trend, the models will just consume more CO2 just to train them. And then, when you put them in production, they will also consume more compared to a Scikit-learn model with a... I don't know. Which is super small.
Roland Meertens: Like GPT3 models. You can't run on your laptop. You can't even run them on a single machine. You really have to have a data center and have an API to interact with them.
Omar Sanseviero: Yeah, exactly. Exactly. Even if some of these models are open source, as you're saying, you need to have many GPUs just to load the model. Yeah. It's very important to document them. There were some other talks yesterday about this. It's quite interesting. I would suggest you to look at that talk. But, I think something important is that this doesn't mean that we should not train these models. But, we should make sure that we are training models that are solving relevant issues and that we are being careful about reducing work that already exists. This concept of fine tuning or using pre-trained models really saves, not only in money, but also in ecological impact. You can just go fine tune a model, and it'll just take you a couple of minutes or hours in a GPU and that's it. But, if you want to train one of these large models, it will take a lot of money, hundreds of thousands of dollars and many GPUs.
I think it's quite important to try to leverage the existing work. If you train models, even if they are small, starting to document the CO2 emissions is, I think, a good practice that we'll start to see more and more. Because, models are becoming larger. When we are comparing our benchmarking models, also measuring not just the accuracy or the precision or just a single number, but also look at metrics such as how fast this model is, if it can be deployed or what's the CO2 emission to train this model, are things that are important to compare between different models. If you have a model that is extremely small and have 2% less in precision, which can happen when you do distillation, maybe you can take that hit. Maybe that 2% less precision is completely okay if that means that it will be saving in CO2 emissions, and a lot of money in compute as well.
Roland Meertens: Yeah, yeah. That's indeed a good point. The last thing maybe you mentioned in your talk was a version control for deep learning models. Would that be a solution that you always start already with existing features and only have to train tiny bits?
Omar Sanseviero: Having version control is a good practice. What I was talking at my talk, it was comparing how we do software to how we do machine learning. When you're working in a software project, you won't just go and release your large software related project, or your, I don't know, open source library. You will work and iterate on it and do an iterative approach with it. We try to do the same with these models. For example, a model you might train for 100 epochs or 100 iterations. So, instead of just pushing the last model, you might want to save every 10 epochs, for example. This is a good practice because maybe your last model was not the best one, but a previous one. So, having this is a nice way to be able to just go to different versions and really compare the different versions.
What some people have been doing already for many, many years is that they might have some call back. For example, in Keras, I think many people do this and you just save the best model. But then, we're going again to this work in which you just look at one or two numbers. You just look at the best model with precision. But, there are really many numbers that you should look at when you want to push a model to production. So, having version control is important. If you are turning a model for a few weeks, for example, you might want to just keep pushing models to pick up and then you might want to keep updating the metrics. At one point, you might be able to compare all these metrics across time. That's quite useful. And then, if your model just breaks, you can also just go back to a previous version.
Roland Meertens: For example, if you would normally train your model for a hundred epochs and you update your data set, you want to continue trading on the 50th epoch. So, your latent features are already relatively well, but then still have enough space to learn, right?
Omar Sanseviero: Yeah, yeah, exactly.
Roland Meertens: Do you also see that's something on Hugging Face where people keep building on top of existing models to keep building more advanced models? Or, is this something which is relatively new?
Omar Sanseviero: It is happening. This is happening. I would not say it's happening that much, but I'm seeing this more and more. I think this trend will just keep going up.
Roland Meertens: Yeah. And then, maybe last thing, once you talked a bit about democratizing good machine learning, can you say something about what you mean with that?
Omar Sanseviero: Sure. Yeah. Really I think this is a very complex topic and we could talk a few hours about this. But, something important to keep in mind with machine learning is that, as a software engineer, it's very easy to see machine learning as, okay, we're solving this problem and that's it. But, at the end, we are impacting humans' lives. It's very important to always keep that in mind. When you are doing machine learning, there are many aspects to this. For example, from the research perspective, when you are doing machine learning research and you are proposing or publishing these huge models that nobody can train except these large institutions, then you're making science or research not accessible to everyone. If you want other researchers to be able to evaluate your model, you cannot do that.
When we are talking about these huge models, people that are working in low digital resource languages, for example, if there are not that many data sets in Spanish, people won't be able to go and train any model. These are problems that you always need to keep in mind. And, not just that. You need to involve people that are being impacted by them. Something that I talk quite a bit in my talk, was about demos. Building demos that people can just interact and play with the models. Something we are seeing is that people are creating demos for their models. And then, they are sharing those demos with people that will be impacted by the systems. This helps because most likely in a software related team, you will have a group of... If you are a team of 10 people, nine of those will be men, or eight of those statistically speaking. Most of them will be white men.
And then, you will have a set of people that will have a very narrow view of the world. When you do these kind of demos and you start them with people from different context, different backgrounds, even if they are not ML engineers, you can start to get interesting perspective. Couple of weeks ago, someone released a image captioning model and it was working extremely well with white people. But, with black people, it was labeled such as, I think, it was gorilla. And, it was similar to what happened with Google in like five years ago. It's very important to really make sure that people from different contexts play with these models, just because of this. Involving people that will be impacted, or any stakeholder, is extremely important. That's a huge part of good machine learning, democratizing good machine learning.
We try to involve people both from the, okay, we have a model, so let's validate it and let's see how it works with different groups. That's something that we do. We try to do these open science efforts to really enable people to do things. And then, we are doing many different things to enable people from underrepresented groups to be able to use machine learning. We had a very fun effort with Yandex in July of last year. It was published at NeurIPS this year. The idea was to do distributed computing.
These people didn't have GPUs. What we wanted to do was train a very large model for their local language. What we did instead was you could see it as distributed computing. Each of these persons provided a bit of their compute power from their own computer. And then, we had 40 or 50 people just with their computer open all the time. At the end, we had an open source model owned by them and it was published in a paper and it was also owned by them. It's in an organization in HuggingFace that they own.
Everyone collaborated on this. Even if they were not writing any code or anything, they were just running some cells of code. They were contributing to this larger model. This is an interesting effort because this could enable people from languages or from universities or communities that really don't have access to GPUs or any fancy compute power. This could enable them to train these larger models. This was a state of the art. So, this was quite interesting.
Roland Meertens: I think especially the underlying data can be so important. Because, for GPT 3, you can see the underlying language distribution. You can also, when you're playing around with the model, it works perfectly for English. And then, once you get into more obscure languages like Swedish Dutch, the amount of content on the internet in that language is maybe a bit lower, but also harder to find. There's not that many Dutch speaking subreddits. Everything is influenced, but it's good to try to involve those people. I think you mentioned those Spanish people, who know exactly where the Spanish language resources are.
Omar Sanseviero: Exactly. Yeah, exactly. Even then, Spanish is quite interesting because Spanish, I think it's the second most spoken language. It has 500 million speakers. Or, I think it's much more than that probably. But, it's interesting because the Spanish from Spain will be completely different than the Spanish from Argentina or Spanish from Mexico. You need to involve people not from Spain, but also people from different countries in Latin America to be able to get a very distributed, very fair data set distribution. So, you get data from these different sources, from these different Spanish variations.
If you just train Spanish models from Spanish from Spain, then it will be a biased model. Because, it will be just focusing the Spanish from Spain. That's a key topic as well. Really involving the people to get these data sets. So, if you just speak English or you just live in a English speaking country, and then you want to train a large model for Italian without involving anyone that speaks Italian, you probably cannot do that. You will get some numbers that will show you that the model is doing great. But, if you don't have any idea of how this model is really working, that's not great.
Roland Meertens: Also just getting a feeling for what are relevant resources. Every time I'm learning a new language, you are trying to figure out where is the content, where is good quality content. As a non-native speaker, you have no clue about all the cultural aspects or the social aspects of a new country. Involving a lot of people is so important.
Omar Sanseviero: For example, with Big Science, which is this effort with 700 researchers I was talking about, we are targeting seven languages, I think. I'm not that involved in the project. But, between those languages, I think there was French, German, and Italian. Something that they did was a data set catalog. Even before creating any data set, they just explored many, many, many potential sources of data.
New places, they then analyze a bit the licensing or data quality and if there were tools to access this data from the publisher of this data. This was a distributed effort, because of course for each of these language, there was a group of native speakers and people that were living in those countries that were exploring which are the best sources for this data. And then, we collected it and then we did all the next relevant parts. But, as you were saying, really finding the right sources for these languages needs to involve people from these languages.
Roland Meertens: So, if there's people listening who want to get involved in this, is there any way to get started?
Omar Sanseviero: You mean to Big Science?
Roland Meertens: Yeah. Or, any other projects which you can recommend people to check out, help democratizing our systems?
Omar Sanseviero: From the Big Science effort, you can just search Big Science. For that project in particular, you need to be a researcher. You need to be either at a company or at a university. But, really, there are many, many ways in which you can contribute. As a first step, I always recommend just go to the Hub and see which are the data sets available. Because, there are 3000 data sets. See which are the models available. And, many, many times, if you are already doing machine learning, you can potentially already open source your work.
Of course, it really varies of what you are doing within the machine learning ecosystem. But, there are many things that you can do to contribute back to the community. Everything we do is open source. If you're interested in data sales or in transformers, you can go to these libraries and contribute with a new model architecture or a new data set, if that's something you would like. I think there are many potential venues in which you can contribute to the ecosystem.
Roland Meertens: Sounds good. Thank you very much for this talk. Enjoy Qcon London. Have a good day.
Omar Sanseviero: Thank you. Thank you for invitation.
Roland Meertens: That was the interview with Omar. Thanks again, Omar, for participating. I hope you enjoyed listening to this in-person interview recorded at QCon London and thank you very much for listening to the InfoQ podcast.