Transcript
[For more information, you can find Emmanuel Ameisen on Twitter @mlpowered]
Ameisen: What I want to talk about today is practical NLP for the real world, which is maybe a lofty goal. What I really want to cover are lessons learned from building a lot of NLP projects, and practical tips that can help you succeed and go beyond the standard blogs or papers that you might read.
Why am I talking about this? I work at Insight Data Science, I'll tell you a little bit more about this, but my role has been mainly in helping a lot of brilliant fellows build applied machine learning projects. A lot of them are on topics that are not NLP, but quite a few of them are in topics that are NLP, so I put a few on the right here. In 2017, we had this automatic review generation for Yelp where we managed to generate reviews that were really hard to detect. We had another project where we were classifying support requests automatically, I'll tell you a bit more about how these work and what works in practice.
What's Insight? Basically, all of these projects were done by some of these Insight alumni, which are from all over the U.S. This map is outdated, we now have offices in L.A., in Toronto. These fellows come to Insight, build these projects, and then go on to work at some of the companies listed here and others, so while they're at Insight working on these projects, the Insight staff and mentors help them.
Let's dive right into practical NLP. What I want to do is talk very briefly about the "Why?" I think most of you probably have a sense for why you would use NLP, so we won't spend too much time there. Then talk about the theory, and then talk about where maybe some of that theory breaks down in practice and some of the challenges that are there.
Practical NLP: Why
What's NLP and why you would need it? Practical examples of things that you could do with it, we had a project where we did medical understanding where you automatically extract keywords of diagnoses from recent medical publications and use them to update a database that then doctors can reference when they're treating patients to always stay up to date with the newest treatments. We had another project that used NLP based on code where you can simply read somebody's code and their answer to a coding screen question and you can try to automatically assess whether they'll get hired or not. It turns out that a machine learning model is pretty good at that, which, I think, tells you a lot about whether that's a good interview practice or not. Then, support ticket triaging of routing support tickets to the right version, there are many versions of systems like that that exist, but I'd say that that's a very common use case of NLP that can also deliver a lot of value.
Why focus on NLP? I think images get all the hype, but when I talk to my friends that are ML Engineers or data scientists, it's very rare that their day-to-day tasks include identifying stop signs, unless they explicitly work on that problem. Most useful data is in text, whether it's public data, tweets, or Reddit, your proprietary data, or a mix, whether it's reviews or comments about your company. Maybe more importantly, since you're all familiar with NLP in general, compared to computer vision, it's much easier to deploy NLP models, they're usually shallower, usually easier to debug and usually more affordable to maintain. That depends, we'll go into some pretty complicated models a little later, but usually, that holds true.
In Theory
In theory, how will NLP save the world? In theory, it just works, if you read corporate blogs or papers, you do your end-to-end approach. Whatever your NLP problem is, translate this to that, write code automatically, write reviews automatically. You put your inputs, you put your outputs, you train for a couple months on 200 GPUs, and then done. Your data is easy, either you have a standard data set that's used in academia and you can just use it and see if you can get that extra percent of performance, or you work at a massive company that has absolutely infinite data, just one query away, or you have the money to have somebody label 9 million photos, or sentences, if that's what you need.
Finally, I'd say that the other in theory is you build your model, you get a really good model, and then, that's it. That's where most papers stop, that's where most blogs stop, but how do you know that your model is actually good? How do you know how to deploy it? How do you know when you update it? How do you know when you change it? These are some of the things that I wanted to talk about.
A lot of this theory, I say, comes from the promise of deep learning, which is that it's going to automate a lot of this work for you. You'll be able to just feed in raw data and through some models that will automate feature extraction for you, you won't have to do a lot of that work. I think in practice, deep learning is very useful, but it doesn't solve nearly all of the practical problems that come up.
I wanted to illustrate this with an example, so this is another project that we did at Insight. This was by a fellow in the summer of 2018, the project is simple, the idea is you want to learn how to paraphrase sentences, so say the same thing in a different manner. Why would you want to do this? A big example is, let's say, you're on the Alexa team or the Google Home Team, and you want to capture all the ways that somebody could ask you, "Hey, Alexa, play this on the TV."
We built this model that learns to paraphrase, and we gathered this massive corpus of sentences that roughly mean the same thing, that was a whole endeavor in and of itself, and then trained a model that was sort of a simple encoder-decoder, which is a pretty standard deep learning architecture for text and we got some reasonable paraphrases, if you look on the right, that seems reasonable. Here are a few issues with it, the model is powerful, so that's good. It's pretty hard to do this, and it generates reasonable suggestions. It's data hungry, I would say most of the work was just looking at the data, looking at the data again, cleaning the data up, realizing that the sentences weren't aligned, realizing that there weren’t enough sentences, realizing that there were a few things that were missing.
It is resources intensive, the final model took over a week to train, and it's brittle. Here I gave you, I think, a decent example of it working pretty well, but it's very brittle and it's going to be hard to see how brittle it is. What I mean by that is I was just sort of showing this model to a friend after building it, my friend's name is Tosh, he's great. We just put in a sentence that said something like, "Oh, Tosh is going to the grocery store," or something like that. It turns out that the model, because of some data it had seen, was really 100% confident that Tosh was a curse word and just spat back just insults. No matter what we did, no matter where we put Tosh, it would just spit out insults. My friend was fine with it. He was like, "Oh, that's deep learning." I was like, "Well," but in the real world, if you were to ever use this, this would be a pretty bad use case, you find yourself insulting somebody in your living room and all of a sudden, Alexa starts playing music, you'll be a little confused. There are a lot of issues that I would say come with what happens before you have a model, so you get the data set that works, and what happens after you have your model, in between like, "Hey, I trained the model. The curve looks really good and we have a product that we can actually use."
In Practice
The way I like to think of machine learning in practice, it's just like regular bugs, but worse, because they're often harder to see. Machine learning code can entirely run, no errors, no warnings, but everything is wrong, and your model is terrible. It can entirely run with no errors, no warnings, good-looking accuracy, but everything is wrong. It's a challenge, I think that's a lot of the challenge about this. Some of you have probably seen this quote, "In theory, there is no difference between theory and practice, but in practice there is." As I was preparing for this talk, I was Googling who gave that quote, in theory, it's Benjamin Brewster, but in practice, it's very disputed.
Here are the real challenges and the way that we try to think about them. One, there are very many ways that you could frame an NLP task. NLP is broad. If you think about just understanding text or understanding sequential data in general, that's a pretty broad domain. There are a few tasks that work really well, you don't have to transform everything into these tasks, but if you can, you're in a pretty good spot. Mainly, if you can transform your task into either classification, so you take examples and you give them one or multiple categories, named entity recognition, or information extraction where you take some sequence and you try to extract salient information where you're like, "Ah, somebody said, 'I won't be able to make my appointment tomorrow.'" You say, "Ah, they're talking about an appointment and the date is tomorrow."
Those tasks usually work pretty well and are pretty well understood and anything outside of those has to do with embeddings, which is finding a good representation of your text so that you can use it later for recommendations, for search, for anything like that. The other thing we talked about already, I'm going to focus most of the talk on, is the debugging step of once you have a model, how do you look at it, how do you validate it, and how do you do a deep dive?
We'll walk through a practical example and this practical example is actually from a blog post that I wrote over a year ago, and that was pretty popular. I basically took that blog post again, which was a standard machine learning pipeline and did a deep dive on it, which I hadn't done in the original blog post to see exactly what's going on. Here's what the practical example is, this is a professionally curated dataset, contributors looked at 10,000 tweets, a little over 10,000, almost 11,000 that sort of contained disaster words. The question is, well, is this about a real disaster? I think maybe a disaster that somebody would want to know about if they were emergency responders, or the police or just like knowing about generally, actual bad things that happen in the world, or somebody that was just very angry about a sushi restaurant and was using extremely strong language, so, can we build a model to separate both?
The reason I chose this task is because I think it's really interesting, because by design, you're trying to separate things which are using the same words, because all this data set was curated by looking for these words here, "ablaze, quarantine, pandemonium, etc." You're building your task so that it's a little harder because you can't just discriminate on words because a lot of these tweets will share the same vocabulary.
Here's what we're really going to talk about. One, vectorization or embeddings, in a way, two, visualizations, three, the model. Because so many resources are dedicated to how you train models, how you train good models, etc, we don't spend too much time on that. Then four, deep dive into the model to actually analyze it.
Vectorize
How to feed your model data? This is something where I wasn't sure how familiar all of you would be with this, I'm happy to dive deeper or stay at a high level depending on how people feel. Models can't usually take in raw strings and train or predict, they need to have, usually, numerical data. Most machine learning models need to have numerical data, so all models that work on text will need to find some ways to represent text as a number or as a set of numbers, as a vector.
You can think of it on the left, there you can simply transform your text to a vector using yourself, you can use some heuristics, we can talk about them, and then feed them some simple model, like a logistic regression. Or, if you like really modern NLP, this is a diagram from ULMFiT, which was one of the first papers that started the transfer learning for NLP phase, you can have the super complicated model but then, if you think about it fundamentally, all the model does is that it has a complicated way to give you a really good vector for a sentence. Once you have your vector, then basically, you have what is equivalent to a logistic regression and you pass it. In practice, there's some nuances there, but fundamentally, that's what's happening.
Who hasn't seen this diagram? At this point, I feel like every presentation about NLP has to have the word2vec plot, so there's my contribution to that rule. I'm just putting it to remind you that the idea here is that ever since 2013, 2014, there are ways to vectorize words, so to find embedding for words, to find ways to represent words and sentences as a set of numbers that are semantically meaningful; meaning that on average, your hope is that words or sentences that talk about the same things, that mean the same things, will have vectors that are close to each other.
How you do this? I feel like there are enough talks about this, there's word2vec, GloVe, there are very recent approaches, BERT, etc. The main takeaway here is we find a way to take our tweets, our sentences and make them vectors and there are a lot of just pre-trained models online. Here, for this part, how good the model is, is not that important, so we're just going to start with something that gives us pretty good vectors.
One simple way to do this is to not even use any of these complex BERT, GPT-2, all that stuff; you take your sentence, you take all the words, so, "We love machine learning." ,you transform them into vectors using a pre-trained model that you can find online that just has basically, a dictionary of word vector, word vector. You take the average, you also take the max along each dimension to preserve salient words that have a strong meaning, or you concatenate both, and that gives you a pretty good vector, this is from a paper that you can see at the bottom. This is definitely not the best performing ways to embed vectors, but it gives you pretty good results and it's really simple, so we're going to do that.
Visualize
The traditional machine learning way is somebody will still be like, "Well, I have this great new method. I embed the vectors, then I feed them to my classifier and done, we get 80% accuracy, this is great, we're ready to deploy." but let's actually look at our vectors. Here, this is for the same dataset, I used a dimensionality reduction technique, just PCA. You can think of PCA as just a way to project data, so those vectors are very large, 300 dimensions, let's say, I wouldn't be able to show them, so I'm going to show them in a 2D plot. There are a variety of dimensionality reduction techniques, PCA is one of them, it just helps you project from 300 to 2 so that we can actually look at them.
Here, this is a simple embedding technique that's even simpler than the one I've shown. If our classes are pretty separated, we'd expect the embeddings to sort of spread out, so we have blue on one side, which is disaster, and orange on the other side, which is irrelevant. It isn't happening here, that's fine, but we're going to try a little bit of a better embedding. This is TF-IDF, which normalizes words a little better, this is starting to look a little more separated. Then this is using the word vectors from that famous slide. This is looking a lot better, we can have hope that our vectors are pretty reasonable, maybe if we feed this to a classifier, we'll be in a good spot, so we do.
Model
This was a very long blog post that I'm summarizing it in a few slides, but essentially, we get a classifier, we use a simple one, logistic regression, 77% accuracy on relatively balanced classes. It's basically a two-class problem, there's a third category here, you can see, "Unsure." but, there are five examples in the whole data set so we sort of ignore that, and we get good results, so again 77% accuracy. In the blog post, I go on to try more complex models, CNN and RNN, etc. and it gets us up to 80% accuracy. Now we're done, we have our model, we have 80% accuracy, we're ready to deploy it, we're going to give it to the FBI and all of the police are crawling Twitter and just be like, "Yes, here, just use this. It's great," or not, or we're going to dive a little deeper and see what's actually going on.
What we're going to do is we're going to inspect the model, and we're going to inspect the data and then we're going to inspect the errors, which are basically, I like to think of as the combination of the model and the data. The way we're going to do this is first we're going to just look at the model. Because we used a simple logistic regression, we can just look at the coefficients of the logistic regression and see what words it finds important. Here, for disaster, the first few words are "Hiroshima, fires, bombing," seems pretty relevant, for irrelevant, "See, this, never." It seems like the disaster words that the model picked up on to us for its decisions are pretty relevant. That's looking good so far, but let's dive a little deeper.
One thing that I found super useful, and I recommend to fellows at Insight all the time, is to use vectorization- so again, just transforming your data into a vector and dimensionality reduction techniques- to just inspect your data and validate it. Oftentimes, you'll hear, especially at companies that have the resources, "Just label some data." that's often uttered by somebody who never had to label any data. If you work in machine learning, I would say, one of the most instructive experiences you could have is spend three hours labeling data, it will change your life, maybe not in a good way, but it's really enlightening. I think there are very many things that are extremely hard about labeling data, one of them is just a numbness that comes with doing it a lot. The other thing comes with is, once you've labeled 100 examples, whether the 101th tweet is about a disaster or not, becomes this very uncertain concept, and so you just start guessing. You get in a flow state, but a very weird one of just guessing left and right. This was a professionally labeled data set, so it's easy to sometimes treat it as, "Yes, it’s ground truth." , if we have a model that performs 100% on this dataset, then we have a perfect model.
Deep Dive
What I'm going to do is I'm going to do a deep dive on these labels. Here, similar to the plots I was showing before, we have a plot of all the labels with the relevant ones and the not relevant ones. This is a UMAP plot, which is a different dimensionality reduction technique. I chose a different one just to show you that these techniques are great at giving you a view of what your data looks like, but they make a lot of approximations. In fact, this looks very different from the other ones, but it's the same data. You just want to be a little careful about making too many assumptions, but they allow you to actually look at different parts of your dataset and actually look at the individual examples.
What we're going to start with is finding outliers. What are these points that are super far from the center? Is there a reason that they're super far, are these tweets maybe messing with our model, or maybe they're really complicated? At this point, I spent about, I'd say, 45 minutes trying to debug this visualization tool because I kept having this. I was like, "Ugh, the thing that shows the text when I hover is obviously wrong because it keeps showing me 20 of the thing instead of one, so I have to have some loop that's wrong somewhere."
It turns out that 10% of the data is basically duplicates or near duplicates. It's usually duplicates or near duplicates of the worst things, because a lot of the duplicates are people that tweet things for contests. They'll tweet the exact same sentence with some extra stuff. This one says, "One direction is my pick for Army Directioners". The idea being that you have dozens and dozens of these repeated tweets that are going to be super heavily weighted by your model, but maybe not for a good reason, maybe this is not really what you care about. Then there are questionable labeling decisions, this person says, "I got drowned five times in the game today," which is unfortunate, but probably not a natural disaster, it's labeled as one. Then there's “China's stock market crash” is labeled as irrelevant. That one, I think, is actually even more interesting, because is it a disaster, a stock market crash? Maybe, maybe not. You can see how you'd get in that situation after labeling which I think would be like, "Well, I don't know."
Then, there's the even better version where this is about the movie about the Chilean miners that were trapped, so you have two tweets about them, one is labeled relevant, the other is labeled not relevant. They're the same tweet, so you can imagine how feeding that to a machine learning model would be sort of extremely risky. What I wanted to show is, by just removing these duplicates, so removing a thousand duplicates and cleaning up data, we get a much better model. In fact, this model performs on par, if not slightly better, with the most complicated models I used in the blog post, even though this is the simplest model I'm using on just cleaner data.
There's a little thing that I want to ask you about. If you look here, we have a better model, the metrics are better, the confusion matrix looks better, so we have cleaner data, a better model. It seems reasonable, but our accuracy only increased a little, which I was a little saddened by initially. After thinking about it, our accuracy should have dropped, does anyone know why? What I'm saying is, after we cleaned our data set, our new model on cleaner data, removing all these examples that I showed you, should be performing more poorly.
Participant 1: Typically, it should not treat the label the same way, therefore, correcting these labels. If you remove them, that means that you're losing part of the labeled data that should have been labeled correctly, therefore, you're losing things that you assumed were correct.
Ameisen: Yes, that's exactly right. All of the duplicates were actually really easy cases, especially because if you've seen two, then you can guess the next 20. Even more, we had severe data leakage because we weren't controlling for the duplicates, so if you have 30 of the same example, you probably had five in your training set and then 25 and the ones that you used to validate, so we would actually expect, since we removed all these duplicates, to have a much, much harder task. The fact that our model's metrics have actually improved, shows that our model is actually not a little better, it's much better because it's doing something much harder, so the metric is as good as the data.
How can you find out about the quality of your data? The easy way is to inspect it like we did, the hard way is to just deploy it to production. If you deployed that model on to production, it would fail horribly because it wouldn't have to deal with this easy, full of data leakage data set, and then, you'd know you did something wrong, but you'd have to go back to the first step anyways to look at it.
Complex Models
I want to talk about how you would do that for complex models, we did this for a simple model, but is there a way that we can go deeper where we basically look at the data to see what's mislabeled? Once we've trained the model, can we see what's particularly tripping our model up? Complex models are better, but they're harder to debug and validate, oftentimes.
Here, I'll skip a little bit of the details of complex models, I'm happy to just talk about that in the questions. These are just a few that I've seen work well, CNN's language models, and transfer learning. Once we have a complex model, as you now know, it's not the end, there's a few things that we can do to debug them. I'd like to narrow it down on two things, one is LIME, LIME is one framework out of many that's basically a black box explainer. A black box explainer just makes no assumptions about what models you're using, and tweaks the input in various ways, sees how your model responds to it, and then fits a surrogate model around that, that basically is going to try to give you explanations. In this example of something that was relevant, it removes words and it says, "Oh, well, when I remove this word, your model says that it's not relevant. When I remove that one, it says it's more relevant, etc.," and so it gives you an explanation. No matter what your model is, you can get an explanation, and here's for a pretty complicated model, and we have a reasonable explanation of what's going on. You can then use LIME on, let's say, 1,000 examples picked at random in your data set, and average which words it thinks that are important, and that will give you important words for any model. That's model agnostic explanations, it's really useful for sort of the high level debugging.
This is a trick that I don't see done enough in practice, so this is really the one that I want to share. Visualize the topology of your errors. You have a model, you've trained it, and then the key is you'll have this sort of confusion matrix where you'll say, "Ah, here's our true positive, here's our false positive, here's our accuracy." but what's actually happening? What are these errors? Is there any rhyme or reason to them? What you can do is you can do the same plot, here it's the plot that we did before, you'll notice that it looks pretty different, because here in orange, it's the predictions that our model's got right and in blue, it's the ones that it got wrong. Taking all your data, taking all your model's predictions and seeing, what does it get right, what does it get wrong, and seeing if there's any structure.
By looking at that and zooming in, here you can see we zoom in on the bottom right side, you can see that there are labels that are in conflict. Here there's one label that it's basically the same thing, where they're similar sentences, one is labeled as irrelevant, the other as relevant. You can see that there are still even more duplicates, apparently, our Duplicate Finder was not perfect. Here are data gaps, so that one is, I thought, a little cheeky, so I added it.
There are a lot of examples of a joke that was apparently going around Twitter, which is, "I just heard a loud bang nearby. What appears to be a blast of wind from my neighbor's behind," so sort of a crass joke. Then there's, "Man hears a loud bang, finds a newborn baby in a dumpster." One of them is really bad, the other is just a joke, but our model has seen so much of that joke that it has associated a loud bang with just, "Ah, certainly, that's about a joke." and so it said that this horrible baby story was a joke and it's fine.
Looking at the errors of your model can also make you see like, "Ah, this is very clear because we have a gap in our data." and this is oftentimes, a lot of what happens with either recommender system has gone wrong or search gone wrong, that some sometimes a malicious actor has found some gap where your model isn't good and has exploited that. Using this visualization of, "What does our model actually get wrong?" is super helpful to find these.
Clear Priorities
After seeing hundreds of these projects, what I've learned, and I know that saddens most data scientists I talk to, is that, basically, your priorities in order should be to solve all the duplicates and the conflicting data, then fix all the inaccurate labels that you learn about when doing this. After you've done this about 30 times, you can usually look at your errors and more definitely say, "Ah, perhaps, to understand the context of the joke, we should use a language model." That's usually the step that is useful, but only after you've done all of these many steps before. The better the dataset solves more problems than any model.
In practice, what I recommend is just find a way to vectorize your data. This seems like a simple tip, but it's so important because debugging a data set by just looking at thousands of examples, especially for NLP, is extremely hard and mind numbing. Vectorize it, organize it in some fashion, use different methods, and then visualize and inspect each part of it, and then iterate. That's usually the fastest way that we've seen to build models that work in practice.
You can find me on Twitter @mlpowered, if you want to know more about these projects, there are a bunch of ones on our blog. You can apply to Insight, or you can come partner with us.
Questions and Answers
Participant 2: This is more of a curiosity question based on the example you had given saying it took your friend's name Tosh and kept thinking it is a bad word, or something of that sort. What was the reasoning behind that using the same techniques? What was it that led to that conclusion for the model?
Ameisen: The more complicated a model is, the harder that it is to say, so that one was actually quite puzzling to me. I don't know that I still have a good answer, my best answer right now is that the model wasn't working on words, it was working on sub-word concepts called byte pair encodings. You think of a few curse words that I won't mention here, but that are a few letters away from Tosh, those curse words were in the data set, and so that's what I think happened.
Participant 3: Very interesting talk, thanks very much. One of the questions I have has to do with the models you've used in your career. I'm new to the NLP field, I've used Naïve Bayes a couple of times, I haven't gone as far as to use recurrent neural networks. Would you say that neural networks or recurrent neural networks are the best model to try every time, or do you think it's a case-by-case decision?
Ameisen: The best model to just try every time is a bit of a complex question because there's what the best model will be at the end. If all your model's implementation was perfect, and if your data is perfect, then usually like a transformer RNN, if you have enough data, will give you the best results. In practice, that never happens, even if you have a large data set. In practice, what happens is that for whatever task you want, your data set is imperfect in some way. The best model to use, in my opinion, is actually using even the Scikit-learn tutorial of count vectors, logistic regression or word2vec logistic regression, and then doing at least one pass of this deep dive, because you know you'll be able to code that model in 10 minutes. Then looking at the data, you'll get much, much more value than anything else by simply, basically saying, "Oh, here's 10% of my examples are mislabeled. No matter what model I use, it'll be wrong in the end." so I usually the best model is the one that you'll implement in five minutes and be able to explore. Once you've done that a few times, you can go for the artillery of models of RNNs and transformers.
Participant 4: More curiosity: a lot of people type emojis and a lot of our data, you end up having a million smiley faces, sad faces. Is that part of the vectorization? How do you deal with that unknown words language?
Ameisen: It depends on your data set, it depends on what you want to keep. For this one, the emojis were removed from the data set, which I personally think is a terrible decision for this dataset, because you probably want to capture if there's 12 angry faces in one of those tweets, it's probably a big indicator. You can sometimes pre-process them away, if you were just trying to see, did that person order a pizza or a burger, you don't need emoji, or maybe you do, actually, for a pizza or burger, but it depends on the use case as far as how you would represent them, how you'd recognize them. I don't know if they are in a lot of pre-trained word vectoring models because those are based on Wikipedia and Google News, and those don't have that many emojis, but you can train your own word vectors.
I skipped over that slide, but fastText is actually a really good solution to train your own word vectors and then you can define any set of characters. You can use emojis, as long as there's enough in your dataset to learn what a frowny face maps to, you'll be able to just use them as a regular word.
Participant 5: Thanks for the talk, I'm wondering regarding conflicting labels or conflicting interpretations, have you explored whether you can leverage the conflicting interpretation and maybe serve different models based on whether the consumer of the output aligns more with one labeler or versus labeler?
Ameisen: A lot of these conflicting labeling cases, as you said, it's hard to determine if it goes one way or the other, so using the user preference is a good idea. For this particular project, no, but here's how that's done a lot in practice. What we essentially have here is we have vectorized representations of all these tweets and then from that, we tell you whether it's relevant or not. If you wanted to take user input into account, what you could do is you could vectorize the user in a way so find a vector representation of a user and then use it has an input to your model.
That's what YouTube does, or at least, according to their 2016 paper, that's what they used to do, where based on what you've watched, etc. they have a vector that represents you. Then based on what you search, they have another vector, then, they feed both of those to their models. That means that when I search the same thing that you search, maybe because our viewing history is entirely different, we get different results, and so that allows you to incorporate some of that context. That's one of the ways, there are other ways as well.
Participant 6: Great talk. I was curious if you have a task classification. By default, your instinct is to use something like word embeddings, but is there any use for character-level embeddings, in your experience, or sentence-level embeddings?
Ameisen: It always depends on your data set, basically, is the answer, so maybe a show of emojis, you do something slightly different. It's really important that you keep the order of the words in a sentence, because you have this data set where it's very much how the sentences are formulated, then you want to use something that is like an LSTM or BERT or something, a little more complicated. Essentially all of those boil down to finding a vector for your sentence, what the best vector is is task dependent. However, what we found in practice is that for your initial exploration phase, usually using like some pre-trained simple-word vectors, work. I would say the point here is just find something where you can get vectors that are reasonable really quickly so that you can inspect your data and then worry about what the best implementation is. I wouldn't say there's an overall best one, it depends really on the task and what your data looks like. Sometimes, you want character-level because just capturing a vocabulary won't work because there's a lot of variance, sometimes you do want a word-level, especially if you have small data sets, so it really depends.
See more presentations with transcripts