Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Open Machine Learning: ML Trends in Open Science and Open Source

Open Machine Learning: ML Trends in Open Science and Open Source



Omar Sanseviero discusses the trends in the ML ecosystem for Open Science and Open Source, the power of creating interactive demos using Open Source libraries and BigScience.


Omar Sanseviero is a Machine Learning engineer with 7 years of experience. Currently, he works at Hugging Face in the Open Source team democratizing the usage of Machine Learning. Previously, Omar worked as a Software Engineer at Google in the teams of Assistant and TensorFlow Graphics. Omar is passionate about education and co-founded AI Learners.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Sanseviero: My name is Omar Sanseviero. I will talk about open machine learning. I would love to begin talking a bit about the history, at least over the last few years. Three years ago, OpenAI released GPT-2. GPT-2 is a large language model. It has 1.5 billion parameters. It was trained on 8 million webpages, and GPT-2 is trained with a single simple objective, predict which is the next word. OpenAI iterated on GPT, they launched GPT-3. GPT-3 was extremely impressive, it was able to generate entire website layouts. Also, it was able to generate code based on network strength. If you've played with Copilot, this is the technology backing it up. Although the results are quite impressive, OpenAI decided not to release the model due to concerns about malicious applications of the technology. This is understandable, but at the same time, this caused a big issue in the science world. You cannot reproduce the results. This lack of reproducibility makes it extremely challenging to scientifically, rigorously evaluate the model.

Two years ago, a community called EleutherAI, it's a collective of volunteers, they decided to build an open source GPT. You would expect that'd be a group of academics, but this was actually a Discord server, and the initiative started as a joke. Then it turned out to be serious. EleutherAI has been able to do very interesting open science and open source work. They have released large GPT models such as GPT-J and GPT-NeoX. They have been able to open source huge datasets, such as The Pile, and there are projects around art generation and much more. It became a huge collaboration. They have published over 10 papers. It has been a quite exciting alternative for open science, compared to GPT-2 or GPT-3.

BERT 101 - State of the Art Model Explained

Another very famous language model is BERT. BERT was created and was publicly released by a research lab at Google. Bringing a large BERT model from scratch was a quite compute intensive task and required lots of data and lots of compute resources, that is money. Although Google released the code for training the model, very few institutions were able to have resources to do a training from scratch. Something interesting happened then, BERT was considered state of the art, so, many research groups were training a BERT model to have a baseline to compare their new research proposals. Each time a research lab wanted to do these baselines, they weren't training a BERT model from scratch. This led to a lot of issues. For example, people sometimes used different languages, different scripts, different hyperparameters, so it was hard to reproduce the exact same results as the ones published in the original paper. This also has ecological and financial impact. If you're training the model again and again, you are, at the end, having direct impact. This also means that just the research labs that have significant compute resources can actually create these baselines, so if you're talking about universities that don't have the amount of compute resources to train a BERT model from scratch, they couldn't do it.


There was an open source alternative created after some time. At Hugging Face, we created a library called Transformers with the idea of easily sharing pre-trained Transformer models. Transformer is the architecture of BERT. GPT-2 is also a type of transformer model. The idea with the library is quite powerful. It enables people to load models from a shared place with only a line of code, hence solving the reproducibility issue and making transformers available in the hands of everyone. The project was launched almost three years ago, it was launched in 2018. The open source adoption has been extremely wild. Right now, it has almost 60,000 stars in GitHub, thousands of organizations and users are using transformers to share their work.

News Summarizer

Let's say you want to build a news summarizer. The summarization task is quite simple. The idea is that provided a large text you can think a news article, a blog post, any large text. The machine learning model will make a summary of it, that is a small summarization of the original text. Before going and training a large model for this, the first thing you will want to do is collect and clean a dataset. There you go, you might have to use hundreds of thousands of dollars to collect, clean, correct the dataset, explore it. Then after that you can use the dataset to train the model. How do you train the model? You need to pick the right one. There might be many different architectures. You might need different teams trying different explorations. You might want to train a model from scratch. Even then, if you get the model after, the metrics might not be good enough and you might need to go and collect more data.

Is There an Easier Way?

Similarly, can we do better? Can we do any alternative that is out there in the open? What I just described before in this news summarization use case can be quite simpler. If you've used GitHub before, your approach when doing software related projects most likely to explore, find repositories with tools to solve some of the problems you have. You don't want to reinvent the wheel. You will collaborate with your team by having a shared repository. Once you're ready, once you have a project that might depend on other open source projects, you might open source your work for the whole ecosystem to use. We can do exactly the same with machine learning. Why do we have hundreds of people training the same model for the same thing, again and again, how many summarization models there might be out there? What if we could instead have a central platform through which people could collaborate, explore, and discover models and datasets? That's where Hugging Face comes in. The Hugging Face Hub is a free open source platform with over 30,000 models and 3,000 datasets in which people can easily collaborate in their machine learning workflows.

Transfer Learning

Before going a bit more in this, something that is quite important in the transformers world is transfer learning. Transfer learning has been actually quite impactful in the last few years, not just for NLP, but also for computer vision and other domains. What's the whole idea in transfer learning? In the classic supervised learning, an example of machine learning, that's probably the most common one, you will grab a dataset with labeled queries or labeled samples, and you will train a model to generate the predictions. Let's say now that you have a second domain, you will now again train a model from scratch to solve this particular task. With transfer learning, the idea is to extract knowledge from a source task, from a domain A, and then be able to apply it to a different task. For example, let's say that you will train a large language model that will get a statistical understanding of a language, let's say in Spanish, then you can fine tune-this model to solve a particular Spanish related task, let's say summarization. Right now, since you are using transfer learning, you will transfer the knowledge of this base, original, large model. Then, since you already have this pre-trained model, you don't need that much data, you don't need as much compute resources. It's actually much cheaper, takes very few minutes or a few hours at most. It can be quite powerful. This is not just for NLP, this is also for computer vision. This is what Convolutional Neural Networks are doing. You can also do it for speech, biochemistry, time series. Reinforcement learning is starting to use Decision Transformers as well. We don't know yet but there will likely be other domains in which people can apply transfer learning and transformers.

Brief Tour Through Open Source Family

We'll take a brief look at the open source ecosystem, you can observe some of the GitHub repositories that we have. We'll dive into some of this. Across our libraries we have over 1000 different contributors without which the libraries wouldn't exist. That's very important, because this is not a couple of libraries maintained by a single company, but by a huge community of different contributors. Across all of these libraries, we have over 60,000 stars in GitHub. Let's go with the first important pillar of this platform, of the hub. One key aspect is sharing models. The model hub is a free open source central repository for models across different frameworks. People can share transformer-based models here, but also from other NLP frameworks. If you use AllenNLP, Flair, or spaCy, or vision frameworks such as timm, or PyTorch Image Models, or even other fields, from speech, ESPnet, SpeechBrain, PyAnnote, all of these are different open source libraries. The hub actually just passed a milestone, there are 30,000 models shared by the community. The hub serves as a central sharing point for all these models. It shines in multiple aspects. It enables reproducibility. With just a single line of code, people can go and load one of these models. By sharing checkpoints, users can simply load the model, test, and evaluate models. This enables people to actually reproduce the results sharing their inner papers. Then, with the concept of transfer learning or fine-tuning, people can easily pick a model from one domain for a certain task, and adapt it for their own particular needs. This is not just for English, there are over 180 languages on the hub.

A Small Tour Through the Features

Let me just quickly show you a bit of this. This is going to the Hugging Face website. As you can see, on the left, people can filter for things such as question answering, summarization, Fill-Mask, which are different tasks that machine learning models can solve. If you click the +14, you can find the different domains for which you can apply these models. You can search for image classification models or reinforcement learning models or translation. Then, you can also filter based on the library, based on the dataset, based on the language. Here at the right, you can see the different models out there. I see this one, GPT-2. Actually, OpenAI released a small version of GPT-2. It was not a huge model, but a smaller version of it. What you are seeing here is the model repository. The concept is quite similar to GitHub, you have Git based repositories that have version control. If you click files and versions here, you can see all the files in this repository. You can click here the history, and you will be able to explore all the history of this model repository.

Then, going back to the model card. What's the model card? A model card is an excellent way in which people can document what the model does. It has things such as the model description, which are the intended use cases and limitations. It might have snippets of how to use this model. Even a section on limitations and biases. These models were trained many times with lots of data from the web, such as unfiltered content from Reddit. The models can have, of course, many biases and generate very concerning quirks. You need to be very careful in how you use these models. It has model data on how the model was trained, which data was used for training, which are the evaluation results. All of this is actually quite exciting. This is a way in which anyone can go here, read this, understand the model. They can go and see the actual files for this. Or if they want, they can click here, using transformers, and with these three lines of code, they can load this model in the transformers library.

This was the first thing mentioned here, model card, version control. The third one is interactive widgets. You can actually play with the model directly in the browser. It may take a few seconds the first time because it's loading the model. You can actually play with the model directly in the browser without having to run a single line of code. This is quite powerful, especially if you are exploring which model to use. It has things such as TensorBoard hosting, so if people share TensorBoard logs, they can freely host their TensorBoard logs directly in the browser. This is a nice way to track how different models have been. The last example I would like to show is the evaluation results. Evaluation results allow people to self-report metrics on a specific dataset. Thanks to a nice integration with Papers With Code, you can actually compare all of the different models for a given dataset for a certain task. This is quite powerful. People can even do things such as report CO2 emissions. You can use this with any library that you love. If you're a TensorFlow user, if you are a PyTorch user, if you are using a higher-level library or a more specialized library, you can use this as well.

The usual workflow is that you can go, you can find an open source model in the model hub, which was published either by a practitioner or by a researcher. You can find and pick this pre-trained model. You can then go and take a dataset. You can go and search for a dataset that will be interesting for your use case. Then you can do the fine-tuning of your model with this particular dataset. The open source philosophy is that once you are done with this, you will open source your model for anyone else to use. This will contribute back to the community that helped you by having pre-trained models that you were able to use.

Datasets Hub

We were talking about datasets. The next question is, how do you get these datasets? Actually, the hub also contains datasets. It is a catalog of datasets shared by the community which you can as well load with a single line of code. It contains well known datasets such as SQuAD or the GLUE benchmark, as well as many other datasets for classification, question answering, summarization, language modeling, and more. As of now, the Dataset Hub has over 3000 datasets. As for the model hub, each dataset comes with versioning using Git, so you can do reproducibility, you can load different versions of the dataset and more. There are two key components to the Datasets Hub. There is the platform, the web UI, which has the largest hub of ready-to-use datasets for hundreds of languages and different tasks. There's also a Python open source library called datasets, which allows you to load a dataset with a single line of code. There are no RAM limitations, so if there is a huge dataset, which is terabytes, there is something called streaming that allows people to just load the data as needed. It allows to have very fast iterations and querying. If you don't want to share your dataset with the whole world, you might want to host your datasets through Amazon Web Services, or other places. This is also enabled, so people can also use the datasets library or datasets that are not hosted on hub. That's totally fine.


I would like to talk about the third last pillar, which are probably the most interesting. By using open source libraries such as Streamlit or Gradio, people can easily create interactive demos that allow anyone to try out a machine learning model. This increases the reproducibility of research. By having them share in the web, anyone can just go and try this out. Here you can see a small gif of a demo in which the user selects an image, clicks submit. Finally, they get that classification result that says it's an alligator. How do you build such a demo? How do you share your model as a web app? Until recently, people had to learn some new tools. Many of our engineers actually don't have that much experience in web technologies, for example, or you cannot probably expect people that are focused on writing papers, being able to know about Flask, Docker, or JavaScript, CSS, and more. This can actually discourage people that don't know how to use these technologies. For each part of this stack, there are different tools. This makes things even more complicated. After you train a model, you need to deploy it with tools such as Docker. Then, you might want to store incoming samples with SQL. Then you will want to build an interactive user interface with frontend technologies such as HTML, JavaScript, CSS, or Svelte, or any other advanced frontend framework.

Gradio is an open source Python library that allows to take each of these steps in a single path, in a single pipeline, which makes things extremely easy. You might expect the code to be actually super-complicated, but what you see at the left has the three key components. There are three parts. The first one is this classify_skin_image, that's a prediction or classification function. The idea is that this function will take an input and will output an output. Then you will have the types of the input and the types of the output. In this example, the input is an image and the output is a label. Then once you have this, you can launch this interface from anywhere you want. This can be from the command line. This can be from a Jupyter Notebook. This can be from a Colab notebook. The result can be seen here at the top right. You have a web interface that anyone can go and try it out. You put an image, you click submit, and then you get this image response to a benign or cancerous skin photo issue.

Let's build a Gradio demo ourselves, actually. This is really quite simple. I will do this right now live. Here, let's see these three first parts. First, we will install Gradio. That's already done. Then you will have a prediction function. In this case, I'm not loading any ML model. I'm just doing a Hello World or Hello name. The idea is the same. You have a function that takes an input and has an output. Then the last part is the interface, which does all the UI. Let's see first, what is that? My name is Omar. I will click Submit, and I get Hello Omar. What's the code for this? First, you have a Gradio interface that's actually quite simple. You have this prediction function, in this case, it's greet. This prediction function takes an input and an output. The input in this case is a single text. The output is, again, another text. That's it. Once you click launch, you get this. You can take screenshots, if you want. You can clear. You can do many things. You can adapt the UI if you want something fancier. You can have multiple inputs, multiple outputs. You can run many models in parallel. There are actually quite a few alternatives you can do within this. This is the simplest example. What is very nice is that at the end, the syntax is extremely simple. All the complexity relies in this prediction function. Let's say that you have a TensorFlow model. What you can do is just load the model before. You can load a model here. Then you can just write the inference within the prediction function. This shows you how generic or how flexible this is. No matter if you have a spaCy model, a TensorFlow model, a PyTorch model, no matter what you are using in the Python world, you can very easily use a prediction function to create a nice web interface.

Let's see a very quick second example. This example is actually using GPT-J, which is this model from EleutherAI. The interface is a bit nicer. As you can see, it has a title. At the bottom, you can see a couple of examples. For example, I can click here. "The moon's orbit around Earth has," you can click Submit, and then at the right I will get which is the output of this interface. The moon's orbit around Earth has one of several shapes. Then after this, all of the text is completely generated by the GPT-J model. In terms of how this works, this specific syntax is using an already existing model in the hub. It's using something which is called the inference API. It is using that under the hood. The idea is pretty much the same, you have an input, you have an output, and we'll have a prediction function that will take care of everything in between.

Let me show you a couple of demos to have an idea of what things you can do, because this is not limited to text. You can do things such as face stylization with JoJoGAN. JoJoGAN is a one face stylization. In this case, you're seeing a picture of me being stylized as a Disney feature. This is the demo that we were talking in the third use case. Let's say that you want to do a summarization of a news article. What you can see is that at the right, the user can paste a URL for a news article. At the right the output will be a short summary of the article from the link. This third one is a voice authentication model and demo from Microsoft. This is using WavLM plus X-Vectors. The idea here is quite fun. You will upload two different audio files. Then the model will determine if this is the same person speaking in the two audio files, or if these are different persons. This is quite nice. This shows you that these demos are not limited to NLP or to computer vision but also extend to things such as audio. This is the last example I would like to show. This is GLIDE from OpenAI demo. This allows people to write the text and the image on the right is fully generated by this model called GLIDE.

The Spaces Platform

We've been able to create demos. We've been able to run them in Colab. All of this is quite nice. We built a demo. Now the question is, how do you share this with the community? For this, there is the third pillar in Hugging Face, which is called Spaces. Spaces is a central platform through which people can upload their own machine learning demos and share them with the rest of the ecosystem. It was launched in October of last year. Again, all of this is open source and free. This has, by today, over 2000 different spaces created and shared by the community. Let's say that we want to create our own space, so we can just create a new space. You can put any name you want. You can use Streamlit, which is another Python library for creating demos. You can use Gradio, or if you want to go a bit hardcore, and you do want to use HTML and JavaScript or something such as TensorFlow.js, you can go with Static, which is just custom HTML and JavaScript. You can make this public or private.

Again, all of this is based in Git, so you can actually go and do git clone in your computer. Here are some instructions of how you can do it. If you don't want to download them, you can do everything directly in the web browser as well. You can just create a file, which is something like Let's do the same demo we had over here. Let me copy these lines, so the prediction function. Let me import Gradio. Finally, let's copy the interface. As you can see, the code is exactly the same. I have not changed anything. You have the import gradio. You have the prediction function. Finally, you have the interface which specifies the prediction function, the input, and the output. This takes a couple of seconds the first time because this is loading. At the end, you have this web interface that you can try it out. My name is Omar. We have the exact same result as we were having before. You can do things, again, since this is based in Git, you have version control. You can see the logs if you want to see the logs. You can make it public or private. You can even automatically embed this in other websites, if that's something that you would be interested in. As you can see, this was quite easy to do. You can even specify your own requirements, so if you're using other third-party dependencies, you can use that without any issue.

The Machine Learning Turning Point

We have seen that these demos have become quite popular, and have enabled people that are not from the machine learning world necessarily, to access and play out with these models. For example, a couple of months ago, there was a new space for AnimeGanv, and everyone in social media, in TikTok, in different places, were trying out AnimeGanv. This was quite interesting because this increased the audience of the machine learning model. We think we are in a turning point in usage of machine learning. Until now, people that wanted to try out a model were normally ML engineers, ML researchers, software engineers, and if people were sharing models, or let's say scripts, people had to go to GitHub, people had to open Colab, people had to run actual code. This was already a huge barrier of entry for other people. Now with Spaces and with these nice demos, anyone who can use a graphic user interface, or a browser can access these demos. This is quite powerful. If you have a model that might have some biases, this open, transparent, public open source approach will enable the community and a more diverse set of users to try out your model. This will allow you to find biases in your model, issues, and other things.

From a research perspective, this will also make your work public to everyone to try out. If you compare just a paper, or a paper with a nice, open interactive demo that anyone can use, the interactive demo with the paper will be potentially much more impactful and will help people understand what the model is doing. This will also help avoid cherry picking, increase reproducibility, and much more. Through these three pillars, models, datasets, and demos or Spaces, this enables anyone in the community to share their work, collaborate with others, work in a team, just with the same Git-based workflows that people are used to for software engineering with things such as GitLab, or GitHub. People can do this for machine learning as well and share their models with the ecosystem. There won't be a single company that will solve NLP or computer vision, it will be a community thing. These are things in which everyone needs to get involved. I hope you were able to learn a bit more about the open source alternatives and tools that you can use right now to share the amazing work that probably many of you are doing.

Questions and Answers

Breviu: Why is OpenML so important? You've talked a little bit about the history and a lot of the challenges and things that Hugging Face has really made so easy that used to be so hard. I'm curious like more of your perspective on OpenML and where this is going.

Sanseviero: Our goal as a company is to really enable collaborative machine learning. This is very ingrained in the vision of the founders and other team members. That sounds extremely broad, and maybe a bit ambiguous. What we want to do is to really enable anyone to work and collaborate in machine learning. What that means is that if you want to use models, you can easily share your models with others and others can easily use your models, which is extremely important for transfer learning. This also means that if you want to access datasets, they should be public, and you can use them. This is not a rule that everything should be open source, or not all datasets should be open source. There's, of course, medical datasets that have lots of very sensitive data. The idea is that, for example, there's lots of research that might publish datasets, or very interesting models, but it's very hard to actually use this data, or very hard to replicate the results of these papers. What we're doing is really making this extremely accessible to everyone. Then we launched Spaces which also makes it extremely easy to access demos and to show demos to people that might not have a technical background even, or a ML background.

Now we are starting to explore not just a technical audience, but also audio-visual, non-technical people to understand what machine learning is. For example, something that we did a couple of months ago is this thing which is called Task Pages. Task Pages are a very nice way to understand different tasks. This is more for non-ML people. For example, what is question answering? You can go here and you'll get a small schema of what question answering is. I think we are starting to increase our scope. Originally, we were mostly focusing in NLP. Now we are focusing in really, all the machine learning ecosystem, which is huge. Now we are doing computer vision, and audio, and speech, and reinforcement learning, and other things.

Breviu: I love that you have that task, because most models are like they're trying to answer a question. Every model ends with a question, a pointed question. The democratization of it where people can come in, and you don't really have to know what you're doing, you can do this implied AI thing where you can grab these models and just start using them. It's so powerful. You talked about all those issues, and how you've gotten over those with these tooling.

Are the biases declared upfront as a model is checked in? How do you make sure they are ethically sound?

Sanseviero: There is this tool called model cards. Model cards might show sections such as intended uses and limitations, how to use, they might have a few other things. These files under the hood, they are really just Markdown files. This is just plain Markdown, it's not any fancy text. Right now, we don't have any automatic validation, but we do have a few projects ongoing on how to do automatic bias evaluations. We are starting to work in those to analyze, which are the biases that models might have. Right now, what we do is that we encourage everyone to add these kinds of sessions such as intended uses and limitations. Meg Mitchell is one of the most prominent researchers in the ethics for machine learning field. She joined the company a couple of months ago. We are doing both technical work and also research in the more high-level aspects of this. Again, we don't have any automatic thing. Right now, anyone can share ML models. What usually happens is that people just use models that are purely documented. Similar with code, you don't go and take any Python library, but instead you really look for documented code and documented models. Most researchers nowadays when they publish any model, they are already creating nice model cards that document the biases that they might have.

Breviu: I like that there was a talk about ethics too, because I think so frequently in machine learning ethics were an afterthought. If you think about all the issues that have happened, it's like, look what we can do, but we didn't really think about, who could this hurt? What do we need to consider there with our models? There's actually a talk on this track where we're going to talk about different tooling, where the person developing the tool, or the model can actually look into the ethics of the model and take on that ownership because you are posting it. It's almost like a GitHub, like you're posting your model, your code, and everything out there for other people to use. When you do that on GitHub with regular code versus this is machine learning, you do that same thing. You put a ReadMe in. You have information on how to use it, and those types of things. It's your code. It's on your user name too. I feel like putting that ownership on the people that are submitting things is part of it too. You guys are creating this platform for awesomeness. I think it's great to think that you guys are hiring ethics people to make sure that you can automate some of that, but then it's also on, I feel like the individual as well.

Sanseviero: You could say something a bit similar with GitHub. It's not like all the repositories uploaded by the users are being analyzed if they are meeting ethical criteria. We are working very practically right now in having very clear guidelines on what things can be shared. How should people document these biases, because it's extremely frequent at least most of the time that models will have biases. It's extremely important to have very clear documentation to really help users to easily create a better documentation for their own models.

Breviu: Can you compare OpenAI with OpenML?

Sanseviero: OpenAI is a company. It's a very large company that has been creating some very powerful models. GPT-3, for example, the inferencing that GPT-3 has. Very popular in the last couple of weeks is DALL·E. The DALL·E 2 model that has been creating these extremely amazing images. This is OpenAI, which is a company that was founded a couple of years ago. What they did, when they began, they did open source some of their models at the beginning. The original GPT was open source, but after that, with GPT-2 they decided to not release the models due to concerns with ethical issues. That was the public explanation given by OpenAI. What they have been doing, though, is that they have been making all of the research public at least, so that means that the paper to reproduce the results are out there. This has many issues, because this means that the research is out there, but you cannot reproduce it unless you're willing to spend a couple of millions of dollars. At least from the more research perspective, it's at least an improvement to previous methods in which no one was open sourcing anything at all, nor making their research public.

There are, though, efforts from the community to replicate things. For example, with GPT, I mentioned GPT-J, now there's OPT, which was launched by Meta, which is a very large GPT model as well. Similarly with DALL·E, it's an ongoing effort, which is called DALL·E mini by a community member called Boris. This is pretty much the same concept that you write an input, for example, astronaut riding horse in the moon, and these images are generated by the model. This is a smaller version of DALL·E. Right now, there are two very large efforts in the community to create a DALL·E model on the same scale as the DALL·E from OpenAI. I think OpenAI is a company, OpenML is really a culture of making things open. It's making open source code, open source models, and really a collaborative mindset.

Breviu: I think that the DALL·E model is super cool. Then using large language models in the computer vision space, I also think is really interesting because CNNs or Convolutional Neural Networks have really been the main way to do computer vision for so long. Then, now the way that they're starting to use transformer models to solve computer vision problems, or create computer images is just so cool.

Sanseviero: We're starting to see transformers being used. Vision transformers are for computer vision, so we transformers models being used for image classification, object detection, but also for reinforcement learning. There are Decision Transformers, and there is also for audio and speech. Wav2Vec 2.0, for example, is a very good model to do automatic speech recognition, which means taking an audio and then outputting the text, even in tabular data and time series. There are very interesting efforts in applying these large transformer models in other domains.

Breviu: If we use pre-trained models, can we get good results in a specific domain with more limited data, like fine-tuning a large model with a small or medium sized dataset in a specific domain?

Sanseviero: Yes, that's exactly what transfer learning tries to achieve. Actually, transformer models are extremely good at least in NLP. If you pick BERT, for example, which is a very large language model, and then you fine-tune it with your own dataset, even if you just have a couple of hundreds of samples, it's really very little data, you can already get some very cool results. Even then, if you don't have any data at all, there is one model called zero-shot classification. This model, you can see here on the right, there is this small widget. What zero-shot classification does is that you input a text, so, for example, I have a problem with my iPhone that needs to be resolved as soon as possible. This model was not trained to classify based on a fixed set of labels. It was not trained with these labels. For example, urgent, not urgent, phone, tablet, computer. The model was never trained to label based on this. What you can do with zero-shot classification models is that you can specify the labels that you want in the inference time. You never trained the model directly with these labels, but instead now they're able to do it in inference with zero-shot classification. That's really just an extra piece of information, but what you said is perfectly correct. That's what most people are using transformers nowadays for. They pick these large transformer models, they do fine-tuning with their own small or medium dataset, or their own specific domain. For example, now I want to use BERT for research, or now I want to use this even for code. There are some large pre-trained models for code that you can do fine-tuning for your own specific programming language, for example.

Breviu: That's actually the demo that I did, I used Hugging Face Hub, and I used a model that Microsoft open sourced, that was a distilled BERT model. It was very much smaller, but it was task agnostic. Then I used a dataset to actually train it into a sentiment analysis model. You could take this distilled, much smaller model, fine-tune it with that dataset. Then I use quantization to actually make it smaller, and then deploy it to web. I actually used the Hugging Face tooling that they have to do exactly that. It's so powerful. Because before, you'd have to start with nothing, and it would take so much longer, you'd have to have so much more tooling. Then, also, the model that was trained is now open source. You could just go grab that model, and use it again. I think having these model zoos, where you can just go and grab pre-trained models, makes the applied AI space grow even more too. Because you can just go in there and use what's there and apply it to your problem without having to understand all the different operators and machine learning frameworks.

Is it better to fine-tune the word embedding, or just take the general word embedding and train just the transformer?

Sanseviero: When you train the whole transformer, you are also training the embeddings. You just input the text or the tokens, and the model then will learn both the embeddings. It will learn how to compare the text to an embedding space. This actually opens some very interesting applications. There's a very famous library called SentenceTransformers, which allows us to map sentences or to create embeddings not just for words, for tokens, but for whole sentences. What this means is that now you can map directly one full sentence, or paragraph, or document, or essay into embedding of 256 or 512 numbers. This is really a vector, and now you can do comparison between vectors. This is extremely powerful because if you want to do semantic search, for example, people are using this in production systems right now. Or if you want to do retrieval, or clustering, or paraphrase mining, or image search, this kind of stuff, it's something very interesting. In any case, the transformer models are the ones that learn the embedding while you're training it. It's not something that you need to pre-train before.


See more presentations with transcripts


Recorded at:

Mar 10, 2023