InfoQ Homepage Presentations Modern NLP for Pre-Modern Practitioners

Modern NLP for Pre-Modern Practitioners

View Presentation

Speed:

Download

37:02

Summary

Joel Grus discusses the latest in NLP research breakthrough, and how to incorporate NLP concepts and models into a project.

Bio

Joel Grus is a research engineer at the Allen Institute for Artificial Intelligence in Seattle, where he works on AllenNLP, a deep learning framework for AI researchers. He wrote the beloved O'Reilly book Data Science from Scratch and the beloved blog post "Fizz Buzz in Tensorflow". In his spare time he does stand-up comedy and makes livecoding videos. Oh, and he doesn't like notebooks.

About the conference

QCon.ai is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.

Transcript

Grus: I'm Joel and I'm going to talk to you about "Modern NLP for Pre-Modern Practitioners." You know what they say, "True self-control is waiting until the movie starts to eat your popcorn.", but they don't say it to computers, because the computer thinks that sentences about something that's called true self-control that's waiting- what is it waiting for? It's waiting for the movie to start to eat your popcorn. When it parses the sentence, it says, "Oh, this is a sentence about a movie that's starting to eat your popcorn." It does that because natural language understanding is hard.

Part of what I'm going to talk to you about today is why it's hard and how it's hard and how we're getting better at it. We are getting better at it with an asterisk, and that asterisk is we're getting better at it as measured by performance on tasks we're getting better at. The tasks we're getting better at also has an asterisk, that asterisk is that these are tasks that would be easy if we were good at natural language understanding. Therefore, if we're getting good at them, maybe that means we're going to get natural language understanding, correlation and causation.

I'm Joel, I work at Allen Institute for Artificial Intelligence on a team called AllenNLP, where I build deep learning tools for NLP researchers. I'm not a researcher myself, I'm a software developer, but I work closely with researchers, I know what kind of problems they think about, I know what kind of solutions they're working on, that's what I'm going to tell you about today. I did write a book, "Data Science from Scratch." It's not as timeless as I wanted it to be, because there's a second edition that's coming out in a couple of weeks. The first one wasn't timeless enough, and this is what O'Reilly books are going to look like starting around now, so it's a new visual.

The book really represents a deep belief I have, which is that I can't tell you how to use a tool if I think you don't understand what that tool is doing. That's kind of the premise of the book, to some degree, that's the premise of this talk, too. By the end of the talk, I want to tell you how to take the ideas that modern NLP researchers are thinking about, and how to put them into practice in your work, but my own personal code of honor does not allow me to tell you how to put those things into practice if I don't think you understand what they're actually doing.

A lot of this talk is going to be about what are the researchers thinking about and what are their tools and models actually doing. Then at the end, we'll get into how can you actually take that and put it into practice. Hopefully, that's a good flow. Finally, I co-host a podcast called "Adversarial Learning." It's not about adversarial learning, it's about learning things and it's kind of adversarial, but it's good and you should check it out.

I mentioned I work on this library called AllenNLP. It's primarily aimed at NLP researchers. It's designed to help them learn experiments quickly, but one thing that might be of interest to non-researchers is that we have a pretty comprehensive demo of a lot of state-of-the-art NLP models. You can go and play with them, put in your own inputs, check the outputs, see what these models are good at. And more interestingly, see what they're not good at. A number of these slides will have examples of these models, showing where they do really well, and then where they fall over.

Tasks That Would Be Easy If We Were Good at Natural Language Understanding

Before I interrupted myself, I was going to talk about tasks that would be easy if we were good at natural language understanding. I'm just going to take you through a few of these tasks, I don't want to go through all of them. One is parsing, we'd like to parse sentences, we'd like to know that the movie is not actually going to start to eat your popcorn. We need to understand what are the different parts of the sentence and what do they mean and how do they fit together?

Another problem we like to solve is named-entity recognition. I have some text, and I want to extract entities from it, and what type are they. Here's the sentence, it has a date, it has a person, it has an organization, which isn't quite right but it's close enough, it has a facility, it has a geographic entity. These are not just academic problems, if you can solve the same named-entity recognition problem, and you have a bunch of contracts, you can extract the important terms. If you have a bunch of invoices, you can figure out what information you need on there, and so on, so there are a lot of real practical applications to a lot of these NLP problems.

A related problem is co-reference resolution. Once we've identified these entities in our text, which ones refer to the same thing? Here, this model that said that all the blue boxes are referring to Paul Allen, all the green boxes are referring to Seattle, all the pink boxes are referring to Bill Gates, and so on. This is another problem that people work on.

Another one with real obvious practical applications is machine translation. I was in Japan last week for spring break and in Japan, everything's in Japanese, it turns out. Machine translation is super useful, or it would have been if it had worked well, which it didn't. This is me pointing my camera Google Translate at a menu at a sushi restaurant I went to. Then when I tried to order the balm vinegar, they looked at me like they have no idea what I was talking about. If it had worked well, I bet I would have had a really good meal instead of what I had, which I'm not sure what it was.

Another problem with very obvious applications is summarization. There's a lot of texts out there and who has time to read it? There's Twitter to check and there's Facebook to check and all that. We want to build computer systems that can take in a bunch of text and say, "Here's these four paragraphs," but what they're really saying is, "Attend QCon.ai." This is not actually a computer model, I did this one myself, but this is another problem if we can solve it, it has a lot of real obvious applications, writing headlines for articles, or just taking documents and saying, "Here are the important things you need to know."

Another one that's obvious is text classification. Here's the news article, is it real or is it fake? You can think of like lots of examples of where you would need to classify text into one class or another. If you can come up with a real good algorithm for determining whether news is fake or not, the world will beat a path to your door.

Another way of checking whether computers are really understanding language is give them some text and ask them a question about it. Here is the sentence that I'm speaking at QCon.ai. You can ask it who is speaking at the conference? This is a real model trained on the squad data set for the NLP demo, and it correctly identifies that I am speaking at it. Well, it doesn't really understand the sentence or hasn't learned some clever tricks. Let's ask it a question that doesn't really have an answer in the sentence, how many hotels are there in San Francisco? Well, the center doesn't tell us but it will dutifully guess, it looks like 55 because that's a number, it's a number question. Maybe there are 55 hotels in San Francisco, I didn't actually count so it could be right by accident, but there's something still missing from what it's doing.

Another problem is textual entitlement. We give a premise, "Joel is giving a keynote about modern NLP," and a hypothesis, "The audience is enthralled by his talk." We want to predict, is it the case that the first sentence entails the second, or do they contradict each other, or they're just completely unrelated? I asked the model, and it said, 95% chance unrelated, which is not ideal for me, but it's better than contradiction, so I'll take what I can get.

Here's another really interesting one, it's called Winograd schemas, it's a mix of co-reference resolution and common sense reasoning. Here are two very similar sentences. “The conference organizer disinvited the speaker because he feared a boring talk,” and, “The conference organizer disinvited the speaker because he proposed a boring talk.” And the task is to figure out, who does he referred to in each of these sentences? Well, in the first one, it's probably the conference organizer who fears the boring talk and in the second one it's probably the speaker who proposed a boring talk. To figure out which is the right answer, you think it would require somewhat subtle understanding of the relationships between conference organizers and speakers and how they relate and what their jobs are, so this is actually a really hard problem for computers to solve.

Finally, one that you hear about a lot is language modeling, which is basically take some text and give it a model of what language looks like, predict what the next word should be given this text. Here I gave it, "Is artificial intelligence dangerous?" The answer is “clearly”, and I wanted to find out what does this AI think, and it gave me kind of a mixed answer. It said no, 26% like that, but yes, 25% like that. I don't know exactly what to conclude from that but “no” won out by a little bit, which is good.

There are a lot of other tasks too, and I could be up here for 40 minutes talking about NLP tasks, but then I’d never get to the interesting part, which is how do we solve them and what can you do with those solutions? So that's what I want to get to now.

If you are good at natural language understanding, which most of you I think probably are, then you'd be good at pretty good at those tasks and if you look at those tasks, those are probably things that you can do. That means that if computers get good at each of these tasks, then a little bit like the archetypal drunk looking for his keys under the street light, we get good at natural language understanding, which is me being unfair, because, one, these tasks are valuable on its own merits. If you can translate language, that's a really valuable thing. If you can identify the entities in a sentence that's a valuable thing too. The other thing is that, as you'll see, when we talk about what people are doing to solve these problems, it is likely that they're getting us closer to actual natural language understanding. They are resulting in tools and techniques that are useful, both for researchers and for practitioners.

Pre-Modern NLP

Before I get to modern NLP, I'm just going to talk a little bit about what I'll call pre-modern NLP, which is just kind of like old school stuff if you looked in the '80s, in NLP textbook, it's what you would have seen. There was a lot of linguistics like focusing on what are the rules of language and if you're of a certain age, which I am, then maybe you had a diagram sentences like this in school, and this is giving you like elementary school nightmare flashbacks, but this is the kind of thinking that went into a lot of it.

Things like formal grammars, where we'd actually write out, here is what language looks like. A sentence has a noun phrase and a verb phrase, and a noun phrase can be decomposed into an adjective and noun and so on, so being very explicit and formal about some of these things. Also, a lot of modeling using hand-crafted features, so you would say, “I would like to use unigrams and diagrams, and maybe I'll do some stemming.” and you would sit down and sort of plot out, "Here are the features of the text I want to use in order to make predictions about it."

Then finally, you would see rule-based systems, where you have really elaborate rules that will tell you “Here's how I classify text, here's I assign tags, here's how I make sense of things.” so you can see that these are kind of very labor intensive and fragile and not always data-driven.

Modern NLP

That brings us to what I want to talk about, which is modern NLP. What are people who are researching and doing NLP in this century, thinking about and focusing on? I identified five themes that really encompass a lot of the current direction in NLP. One theme is neural nets and low-dimensional representation. Before you would have had these handcrafted features, and you'd use one hot vectors of them. You'd have huge vectors of 10,000 features of zeros and ones, and then maybe put a linear model on top of them. Now, you'll see a lot of denser, lower dimensional representations that capture a lot more and using neural nets to either learn features automatically or replace linear models or things like that, so we'll see quite a bit of this.

A second theme is putting things in context. A lot of the real innovation that's going on is around taking words, representatives, vectors, or however, and coming up with a way to give them a context to capture what they really mean in relation to the surrounding words. The third theme is big data. If you think about it, there's so much text that's available out there in digital form right now, there's all of Wikipedia, there's all of Twitter, there are millions of web pages, there are billions of blog posts. What can we do with all this text data that's just sitting there ready for us to make sense of it and learn from it that we didn't have 20 years ago?

A fourth theme: use all the computing power. BERT, which we'll talk about in a bit, was trained on 16 cloud TPUs for four days. If you went back 10 years ago, there was no such thing as a cloud TPU, at least not publicly. We have so much computing power now that we never had before and this is driving a lot of the things that we're able to do, as well. Then finally, the fifth theme is transfer learning, which is the idea that we can take a model that's trained to do one thing, and maybe it's trained on a lot of cloud TPUs and a lot of data and use it to solve a lot of other problems. If someone who has more computing power and more data than us has already done the work to train it on the big problem, then we can do just a very small amount of work to get it to work on our problem. Those are the five broad themes in what I'll talk about.

Word Vectors

One of the kinds of foundations of what I call modern NLP, and one that you're probably all familiar with, is this idea of word vector, so word to vector love. As I mentioned, taking, representing words, not as a vector of zeros and ones where the one says which word it is, but instead as dense vectors in some lower dimensional space, where hopefully, we set it up so that similar words get similar vectors. It turns out that when you do this, a lot of times you get some really interesting properties, where if you take the vector for “king” and subtract off the vector for “man” and add in the vector for a “woman”, then you get a vector that's close to the vector for “queen”. Similarly, you find that the difference between the vector for “walking” and “walk” is similar to the difference between the vectors for “swimming” and “swim”.

That not only are you embedding your words in this lower dimensional space, but the actual orientation towards the space captures some aspects of their meaning. Using these vectors was a really big innovation. How do you get these vectors? Well, basically, you cast is a machine learning problem. We have a bunch of text and we want to solve the following problem. Given a word, predict which words are most likely to appear near it. When you build, basically, a machine learning model to do this using embeddings and then a linear model on top of those embeddings, what happens is when you get good at predicting what words are going to appear near it, those embeddings make really nice representations for the words, so that's what happens here.

When you have word vectors, what can you do with them? You want to feed them into a model and do something. For instance, you might want to predict the part speech for a given word. Here's part of a sentence, and I hid some of the words from it, but “Something official, something heads, something, something, something.” I turned “official” into the word vector for official, and I turned “heads” into the word vector for heads.

I'd like to use those vectors to predict what is the part of speech of those words, which is the thing you might want to do. What's the part of speech of official? It could be a noun, or it could be an adjective. What's the part of speech of heads? It could be a verb, or it could be a noun, so knowing the vector for that word might not help you that much. If the sentence is “unofficial”, the key is “heads for Baghdad”, well, then the “official” is a noun and “heads” is a verb, but as the sentence is, “The official department heads all quit.” then, “official” is an adjective and “heads” is a noun. Clearly, there's something more going on here, than just having one vector per word.

That's where we get into this idea of context. We can give even a simpler example, which I know you've heard of. I have the word bites, the word man, and the word dog. Each of these has a word vector as well. And now I want to understand something about that sentence. But is that sentence "Man bites dog" or is that sentence "Dog bites man"? Because those are two very different sentences and one is news and one is not.

If you imagine for a second, we had a way to somehow give context to these word vectors, so here we have this green kind of context layer that sees not just each word, but also the words and contexts that came before it. Now we can say that heads, when we've already seen you and official ETFs is likely to be a verb, and we don't just have to use this word vector in isolation like that.

A lot of the innovation in NLP has been how do we add context to word vectors like that? Similarly, if I want to classify a sentence, is this fake news or not, I don't just want to take the word vectors and add them together or take their average, because "Man bites dog" is probably fake news and "Dog bites man" is probably real news, to use the same example again. If we have some way of adding context to these vectors and building up that we know that “dog” comes after “a man bites”, then that's probably fake. If “man” comes after “dog bites”, then it's probably not fake.

Recurrent Neural Networks

Probably the most common way of doing this kind of context is using recurrent neural networks. After we have some word vectors, we have this hidden layer, where at each step, we take a combination of the previous hidden layer or the previous context and the next input word. We build up this state, one step at a time, taking the words and knowing what went before. Here, in this particular example, you have a matrix that you multiply the word vectors by and matrix to multiply the last hidden state by and you add them. This is a very simple RNN that you would want never use because, basically, it overrides the whole hidden state at each step, and it makes it very hard to learn long-range dependencies.

You have these variations called LSTMs and GRUs, which use a lot more parameters. Basically, it learns at each step, which part of this input should I remember, which part of my hidden stage should I forget, which part of my hidden stage should I pass on the output. These are more complex and slower, but they actually do a better job of learning long-range dependencies.

Then you might be thinking, "Okay, why am I only learning context in this forward order?" Every example I've given with this context is, what were the words that came before me? If I want to translate a sentence, I probably care also about what are the words that come after me, that's really important to understand the meaning of a sentence too. One thing you can do that's pretty easy is you have one LSTM that adds context in the forward direction, and you have another one that adds context in the backwards direction and you just kind of can in them, and you get the left context and the right context. This helps out with a lot of things as well.

One cute thing you can do with this, that people like to do and share, is feed in characters instead of words, and build a little model of what do, in some domain, words look like, startup names or rock bands, and then generate new examples of them, so generate startups, generate M&M lyrics. If you want to build a model from which hilarity ensues, these are often a good choice. Yes, hilarity heaps into it. Another way of adding context is using convolution, so if you're doing image processing, you use convolution to move a Patreon, an image, and see what's going on in a neighborhood of, say, a certain pixel. You can do the same thing with sequences of word vectors, and to accomplish and see what's going on in the neighborhood of a certain word. This also allows you to develop these kinds of local context.

When you're doing machine translation, you have a sequence that comes in and you want to produce a sequence that comes out. You might have a sequence of Chinese characters, and you want to generate the corresponding sequence of English words. One way that this can be done is, again, you build up this context going forward, you take the final context, and you feed that into a different RNN basically, that takes that context and generates words one step at a time. There's something that's kind of unappealing about this idea, and that's that you're collapsing the entire input sequence into a single vector, and then trying to use that to capture the entire output.

Then we have this notion of attention, where, instead of saying, “The first step of my decoder is going to look at the last step of my encoder,” I get to look at all the steps of the encoder. I get to learn waves that says, "How much attention should I pay to each of them?" so that each word pays attention to different parts of the input sequence, and this causes things to work quite a bit better.

Large Unsupervised Language Models

A lot of the focus recently has been on these large unsupervised language models. I mentioned this as one of the tasks; you take a text input. “When Joel started talking about language models, the audience got extremely…”, and then you predict what's the next word, so maybe “excited”, “emotional”, “confused”, “intrigued”, “angry”, “agitated,” a couple of you, it seems like, so the model knows what's going on.

What we can do with these large models, and what we have done, is produce basically word embeddings that depend on context. Imagine the sentence, "The Broadway play premiered yesterday." You want to get a word vector to represent "play." You can also imagine "The Seahawks play football today," and you want get a word vector to represent "play." The “play” in “play football” and the “play” in “Broadway play” are very different words, different parts of speech, and they mean almost completely different things, so it's not ideal to have the same vector representing both of them. A lot of this innovation is, “Okay, how can we make it so that play gets a word vector that depends on all the words around it?” And so “Seahawks play football” gets a very different word vector from “Broadway play”.

ELMo

That brings us to ELMo. There have been a number of proposed ways for doing contextual embedding. Cove was one of the first, there's one for ULMFiT, ELMo, the Open AI transformer model. I'm going to talk about ELMo, because, one, it was developed by my colleagues at Allen AI, and, two, because I'm the one who suggested the name ELMo for it, which is probably going to be like the biggest impact I'll ever have on the field of NLP. It's almost certainly going to be the biggest impact I'll ever have in the field of NLP.

What ELMo does is it trains a forward language model, so basically predict the next word, and it trains a backward language model, predict the previous word on a lot of data. When you give it a sentence, you run each word through each of those language models and take the corresponding hidden states as the contextual embedding. This is a little app I built for our hackathon last year, to explore what are the ELMo embedding actually doing? When you're talking about word to vector glove, you can say, "Given a word, what are the other words that are closest to it?" With these contextual embeddings, you can ask, "Given a word in context, what are the other words in context that are closest to it?"

Here's the sentence: "You can drink the whole can of soda if you like." So “can” appears twice in that sentence, but meaning two very different things. If you go to a word vector model, and say, "What are the words that are similar to ‘can’?" Well, there's “could”, there's “should”, there's “must”, “will”, “may”. So those are very similar to the first “can”, “you can drink”, but they're not very similar at all to the “can” of soda, so if you use that word vector to represent “can of soda”, you're missing a lot.

What I did in this demo, is I basically took Paul Allen's autobiography, "Idea Man" and pre-computed all the ELMo vectors for it, so then I can say, "Given the can, and you can drink an entire can of soda, what are the most similar words in context?" They're all “you can catch”, “you can take you can get”, “you can draw”, “you can helicopter’. Those similar words all appear to be in similar context, which is pretty easy. It's more interesting to ask, what are the similar words in context to “can” and “can of soda”? They are “can of corn”, “can have popcorn”, then “pack of cigarettes”, “jar of Tang”, “cup of afternoon tea”, “pack of gum”. These contextual embeddings say that “the pack” and “pack of cigarettes” is much closer to “the can” and “can of soda” than “you can do” would be. That's really interesting, and a pretty fun result.

What my colleagues discovered is that when you take models for solving NLP problems that were trained on, say, glove vectors, and you replace the glove vectors with ELMo vectors, you get huge gains in performance. These contextual embeddings represent a huge step up from just vanilla work to vectors. This has led to what some people are calling NLP's ImageNet moment, so ImageNet is a data set for image classification. What people discovered is if you take a model that does really well on classifying those ImageNet images, it can be fine-tuned to solve all sorts of other problems involving image processing. The dream is we'd like to do the same thing for NLP, find some model that we've trained on some problem and use it to solve all sorts of other problems. Here with ELMo, what we did is we replaced word vector, so it's not quite there, but it's definitely pointing us in that direction.

One other concept I need to introduce is this idea of self-attention, which is just another way of adding context, besides RNNs and CNN. Attention, as we said in that translation example, allows each word in a sentence to kind of attend over another sequence of various weights, but you can also apply that to itself. When you have “The animal didn't cross the street because it was too tired”, you'd hope that that "it" would attend in a way that assigns a lot of weight to animal. When you have “The animal didn't cross the street because it was too wide”, you'd hope that the "it" would attend over the sentence in a way that assigns a lot of weight to the street. If you compute the self-attention, you get a contextual representation of the sentence that looks at every other word in the sentence, rather than one step at a time.

This is a cute example or a cute illustration, I find this online. If you take an RNN, you're basically feeding one element through at a time and building a context that way. If you take a convolutional network, you're looking at a little window. So if you stack a bunch of those up, eventually each element can see all the other elements, but if you take self-attention, in just one step, you can see all the other elements of the sequence. It makes it much easier to learn these long-range dependencies and capture these long-range dependencies.

The Transformer

This led to a very famous paper called "Attention is all You Need," which introduced a new model called the transformer. What the transformer did is it said, "You know what? We don't need to use RNNs or LSTMs or GRUs or any of these. We'll just use self-attention everywhere," which is why it's called "Attention is all You Need," and it has an encoder which takes your input sequence, apply self-attention and just stacks up a bunch of those. It has a decoder, which applies self-attention in a mass way because it doesn't want to cheat and look at previous inputs, and then doesn't attention over the encoding.

This model got a really good state-of-the-art on machine translation problem, as well as some other problems. OpenAI, said, "We could take the decoder of this,” which does this kind of mass self-attention, “and just build a language model out of it." What they found was this language model that they trained had a really exciting property. That's if you took it, it could solve a lot of different problems. A lot of these problems, classification entailment, you can think of as language modeling problems if you're clever, or not necessarily language modeling, but you can feed a sequence into a transformer to get some sort of contextual representation and then do something with it. For classification, they put a start and delimiter, feed the text in, get the final state of the transformer, and classify it. With entailment, you have to put the two sentences together with a separator, feed them into the transformer, and classify it. What they found is that this transformer model, with a small amount of fine-tuning, was very good at each of these tasks. Not necessarily state-of-the-art, but it's pretty impressive. This is now very close to what we call the ImageNet moments.

People start thinking maybe we should have benchmarks that are tracking the ability of models to perform on all sorts of different tasks at the same time, so here's the GLUE Benchmark, general language understanding, I'm not going to remember what the E is, but it's basically testing, can your model solve all these different things?

BERT

Finally, the big thing that came out the end of last year is BERT. It probably wouldn't have been called BERT if ELMo hadn't been called ELMo and ELMo wouldn't have been called ELMo if I hadn't suggested ELMo for the name, so this is my second biggest contribution to NLP probably. If you think about how the OpenAI transformer model works, it uses this masked self-attention, where it was building up context only in this forward direction, because the mask prevented elements from looking at elements ahead of them, so you have the context, it's only one-sided.

In ELMo, you had a forward context and a backward context, but they never really interacted with each other until the end, so BERT said, "What if we took the transformer encoder, and built up this bidirectional context at every step of the way?" You think that might be cheating because if you look at elements ahead of you, it's easy to predict elements ahead of you, so they did something really clever, which is, instead of making the task predict the next word, it will randomly mask out some of the words and ask the model to predict what is the word that was missing. Take a sentence like this, and it predicts that the missing word might be “interesting”, “exciting”, “derivative”, “pedestrian”, “newsworthy”, possibly. Then similarly, “at a conference meetup rave”, “coffeehouse”, “WeWork”. By masking out these tokens, we can look at the full context on both sides. There's no way for us to cheat by saying we're supposed to predict because it's simply not there.

They also did something else interesting, which is, they trained them all at the same time on a second task, which is next sentence prediction. Given a lot of text, they can either put two consecutive sentences, or they can pick two random sentences and ask the model to predict, did these sentences actually appear next to each other? So “Joel is giving a talk, the audience is enthralled” feed it to BERT and it predicts 99% chances of the next sentence. “Joel's giving a talk, the audience is falling asleep,” and it predicts 99% is not the next sentence. This allows the model to learn not just language, but also some notion of relations between sentences. BERT, they found, was able to get state-of-the-art on all sorts of tasks, and it was really kind of groundbreaking.

GPT-2

The last model I'll talk about is this GPT-2, which is the next version of OpenAI's transformer model. They made it much bigger and they trained it on 8 million web pages, which they scraped from Reddit using some karma filter to make sure the web pages were all high quality. This is a lot more language trained on than the original GPT, which was just some books. It had 1.5 billion parameters, which is 10 times as much as original GPT and 5 times as much as BERT. They released this example of text that was generated by this model and everyone I know who does NLP was really impressed by this, because it captures these long-range dependencies in a pretty incredible way. The prompt mentions the Andes Mountains, and then several paragraphs later, it talks about exploring the Andes Mountains. It came up with this Dr. Jorge Perez, and then it keeps referring back to Perez, Perez.

It's actually really impressive that the model was able to generate language this, but it's not perfect. The second paragraph here doesn't actually make any sense. It starts talking about “After two centuries, a mystery is finally solved.” But “this mystery that two centuries” was never really mentioned, so it's not perfect. They ran it 10 times and picked the best one, so it's not like it does this every time.

One of the things that they said was, "We think this might be dangerous because it could be used to produce deceptive, biased, or abusive language, at scale." so they didn't actually release the full model. They released only a small model, which I've been using to produce deceptive and abusive language, not at scale, but just a smaller scale. I gave it, "Scientists have proven that vaccines cause…" and before you get too disheartened, you can tell “It is a myth that vaccines cause” and it also says “autism, so what it's doing is not saying, "Here's what I think I know," it's saying, "I've seen a lot of sentences and here's how I think that sentence ends." It sees a lot of sentences about “vaccines cause autism” and sees a lot of sentences about “It's a myth that vaccines cause autism”. It's going to answer that for both of them.

I decided to try a different approach, and say, "The earth is and flat, beat out round by quite a bit." This is the disinformation that I'm going to go with and, hopefully, I'll get the Flat Earth Society to respond and retweet me, although I haven't tried yet.

How Can You Use This Kind of Heavy Machinery in Your Work?

Very quickly, how can you use this kind of heavy machinery in your work now that you kind of understand how it's going on? If you just want to use pre-trained word vectors, word to vector glove, there are libraries called spaCy and Gensim, which makes it really easy to do that. Better still, these pre-trained contextual embeddings, the BERTs and the ELMos make your models a lot better. You can use AllenNLP; it's not really a production system, it's only a research system but you can use in production, some people do. There's a company called HuggingFace that implements a lot of these models in PyTorch and releases them. TensorFlow releases a lot of them on TensorFlow hub, do you can use either of these.

Pre-trained BERT allows you to build really great classifiers with a little bit of fine-tuning. I actually met a startup and that was their whole business model; it was, you give us your data, we'll fine-tune BERT on it and give you the classifier. HuggingFace has this as well, TensorFlow has it, and you can use it on Google Colaboratory, but don't tell anyone that I told you to go use a notebook on Google Colaboratory because I'm not supposed to do that. Then use the GPT-2, small if you dare, it's very likely someone will release the big version or manage to implement the big version soon, and then you'll be able to use it if you dare.

In conclusion, NLP is cool, modern NLP is solving really hard problems and is changing really, really quickly. A lot of stuff I talked about here just didn't exist two years ago. Lots of really smart people with lots of data and lots of compute power have trained models that you can just download and use. That should be your real takeaway from this, is that pre-trained BERT model is out there, pre-trained ELMo model is out there, pre-trained OpenAI, GPT model is out there. You can take these models that they've trained and done all the work on, and with very little extra work, fine-tune them and get really great results on your own problems.

Take advantage of their work, now you can be a pre-modern practitioner and fine-tune a transformer model and get really good results.

Thank you, I will tweet out these slides. Here's my blog, the links for AI2 AllenNLP, the little app I built for exploring the GPT-2 predictions in my podcast. Thank you for coming.

See more presentations with transcripts

Recorded at:

May 22, 2019

Joel Grus

InfoQ Software Architects' Newsletter