Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Understanding Deep Learning

Understanding Deep Learning



Jessica Yung talks about the foundational concepts about neural networks. She highlights key things to pay attention to: learning rates, how to initialize a network, how the networks are constructed and trained, and understanding why these parameters are so important. She ends the talk with practical takeaways used by state-of-the-art models to help us start building neural networks.


Jessica Yung is a research masters student in ML at University College London. She was previously at the University of Cambridge and was an NVIDIA Self-Driving Car Engineer Scholar. She applied machine learning to finance at Jump Trading and consults on machine learning. She is keen on sharing knowledge and writes about ML and how to learn effectively on her blog at

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Yung: How many of you have heard of deep learning? Good. How many of you know vaguely what a neural network is? Cool. How many of you have trained in neural network? Great. And finally, how many of you have used neural networks like quite a lot or use it day-to-day? A couple, cool. This is perfect.

Some of you who have trained these neural networks might then empathize with the story that I'm about to tell you and if not, then well, welcome to the world of deep learning. Early on in my experience, when I first started my relationship with deep learning as it were, so I was working in finance and we were trying to predict stock prices which is a fairly challenging thing to do. I had some ideas as to what kinds of models I wanted to train and I thought was suitable for this problem. So I trained a couple of models, got a few results back, but quite frankly, the results were pretty bad. What do you do when your model is doing badly and you don't have strong evidence yet that the direction you're going in is completely wrong? You then go on and try a couple of different alternative models trying to see if some of them are going to work better, right?

But the problem in deep learning and with neural networks is that there are a lot of different parameters in a neural network. For example, the number of layers, the number of units in your layer, the type of layer. What? Not the brightness of a screen. The learning rates, how to initialize every neuron, and so on and so forth. And the thing is these parameters are often continuous, so you can try hundreds or even thousands of different combinations and still get really bad results.

Really, at the time I knew how to code a neural network, obviously, but I really didn't know what was going on in the model. I did the typical things you would do; you Google, "Why is my neural network so bad?" Then you read blog posts and look at what people have done before to see how you can better improve your model. But even so, I felt I was walking in half blind. It's sort of like I think debugging code, but the problem is here, you don't even know what the code is saying and you're trying to debug it, which is pretty insane. Why would anyone do that? So what I did at the time was I ran over 1,000 different experiments trying to find an architecture that was good for my problem.

With deep learning, you're always going to have to do some trial and error. But, if you do it without any sense of direction, then you're going to waste a lot of time and compute. This is really bad when training models can take hours, days, or even weeks. So the real aim of this talk is for you to avoid the mistakes I made. That's why we're here; so you guys can understand the foundational concepts behind deep learning so you can then use this insight to better design, debug and then deploy into production hopefully these models. The aim is then so you can then reduce this model development time from, say, weeks, or maybe infinitely long if you never get to that model, to maybe hours or even minutes of your precious time.

Why Deep Learning

Underlying this is the question, "Why deep learning?" I'm not going to talk much about this because you guys are already here. Deep learning is not good for every single application, but it has shown superhuman performance in things like image classification. This is a notable case where models can diagnose eye diseases from scans of the retina better, than or as good as, some of the best eye doctors in the world, Google translate, or if you were in the track earlier, your phone predicting what word you're going to say next or generating speech from text.

Here's an outline of what we're going to talk about today. We're going to first look at a single layer, then we're going to put these layers together, then we're going to look at how you train these models. I'll talk about things like backpropagation, learning rates. Then we're going to go on to look at deeper models and the state of the art problems with training really deep models and how you tackle them. And finally, I'll end on some practical tips about how to quickly build powerful models. Some of you who have experience with neural networks I suspect are going to look at this list and going to go like, "Okay, that looks really basic," especially if you look at tutorials, they often skim or skip over this and go straight to the code. And that really is a reason why I think giving this talk, this topic in particular, is so exciting, is because there are a lot of things you can look at, but understanding is not something that you can Google. Understanding is like the hard part.

Anyone can go online and Google, "What's the state-of-the-art model for natural language processing?" Anyone can go on Github and clone an open source implementation of what someone has done and it's really great we have those things. But when you actually use that implementation and when you run into problems and, your model is not working even though they said that they produced a state-of-the-art result, what's going to help you is this understanding of these foundational concepts, and not everyone is going to spend the time understanding and doing that which is why I think that's really exciting. And honestly, this is from, at least my experience from working and talking with people, what makes really big difference.

Another reason I'm really excited to talk about this at this conference with you guys is that machine learning is very engineering-driven. Now, obviously the research is really important as well, otherwise, we wouldn't be here. But the truth is there are also a lot of parts that I'm not going to talk about today, things like preprocessing the data, things like setting up the infrastructure to train your models, things like deploying your models and even the process of training and tuning models itself. It's very much like software, very much engineering-driven so I think that you guys are really well-placed to make the most of this kind of technology.

Single Layer

With that, let's begin. A lot of you said that you've seen neural networks before, so you might've seen people have drawn them like this or like this. They look quite modular and that's because they are, so you can see that these things they're made up of layers like this. We're going to look at the most basic layer to start with, and the most foundational layer rather, which is a fully connected layer. It's called a fully connected layer because as you can see here every single input influences every single output. It has two major components; it has a linear layer and a non-linearity.

The first question to ask is, "Why do we have these two components?" It looks like this. Let's first look at the linear layer because it's more familiar. And that's the equation for linear layer, so it's basically like a weighted sum of the different inputs. Why do we have that? It's because, it's very quick to calculate, it can be very powerful in many cases. So what can we model with a single linear layer? Quick quiz, can we model this? Yes, because it's a line. Can we model this? No, because we can only model lines because we only have one line, obviously. What if, say, because we're talking about deep learning, what have we stack all of these linear layers together? What can we do? It's just going to collapse into a single line because, well, if you do the math and all the matrices multiplied together, it's just going to be a single matrix.

What can we do to actually make depth work for us? The answer is we need a nonlinear layer, something that is not linear in order to put between these linear layers so then we can make more powerful representations. So these are some of the nonlinearities people use and we're going to go into more depth into them very shortly. This is the Sigmoid and the ReLU. But before that, because we want to analyze what an entire model does, we're going to quickly look at output layers. Again, this might be something that sounds really simple, but I've literally had a head of research come up to me and ask, "What's wrong with my neural network?" When the only thing that was wrong was the output layer and what was outputting. So this is something that we should still go over.

Let's have a problem. Suppose we were trying to model the amount of coffee we need in the foyer outside as the time of day varies. What we want to output is a continuous number. If we used one of these nonlinearities like the sigmoid, if you notice, look at the graph over there the axis, the output is only between zero and one. If we ended our network on that, we'd only have between zero and one liters of coffee, which is definitely not enough for this many people. Usually what you want to do is you want to add a linear layer after that, so you can then use that W, that weight matrix there to increase the range of your sigmoid so then you can model in theory as many values as you'd like.

What about if instead, we wanted to classify whether an image was that of a cat or not? Then in this case, a sigmoid would work really well, because we'd want a zero if it's not a cat and one if it is the cat. So the values in between would be how certain the model is that there is a cat in the image. To generalize this, so suppose we have this dataset where we want to predict what digit is in an image. Who here has seen the MNIST dataset of handwritten digits? Great. This is the kind of 101 or like to-do list of machine learning. You're predicting what digit is in these images.

In this case, we'd want a generalized version of the sigmoid over there. So we'd use something called a Softmax, and it's called a Softmax because it's like a soft version of a max operator. If you see there, the numbers at the top, the bigger they are the surer you are that the image is that of a digit of, say, the index. The image is a zero, one and a two, so you're really sure that it's a two. And in a ,ax operator, you'd up 001, but here you do a little bit less strongly than that.

Now we've finished talking about output layers and we're going to go on to talk about models. There are two simple ways in which you can increase the modeling capability of your models. The one people always talk about is you make your network deeper. The other one is to make each network layer wider which is to increase the number of output units there. It's actually a really interesting question as to why people don't call it “wide learning” and as to why going deeper makes it work so much better, but that's just how it turned out.

Now we are going to go back to a coffee problem and then look at how these models can model things. Suppose this is the curve that we want to model, so that's amount of coffee we need outside varying by the time of day. During talks it's lower because you guys are in talks usually, and during breaks, it would be higher, and more in the morning, less in the afternoon, sort of. Suppose we have a very simple network, literally like one hidden layer. You can see the linearity, the sigmoid. And we put linear layer at the end because we don't want our outputs to be just between zero and one. That's a sigmoid.

What comes out of this, notice, is always between zero and one is always this shape. But depending on what goes into that input, if you had a lot of layers before, what goes into that input could actually already be a very sophisticated feature. So if you are dealing with images, then that could be for example whether or not there's a cat ear in the image. It's not very obvious how that comes about but also just does if you use a convolutional neural network.

Just as a caveat, because we're in a talk of understanding deep learning, you might be like, "Wait, you're meant to understand deep learning. Why aren't you explaining to us step by step why that network can learn the fact that there is a cat ear?" Well, unfortunately, we don't know exactly how your neural networks work. What I mean by understanding deep learning is knowing the foundational concepts so you can then direct how you debug and design these models. By the way, if any of you understand exactly how neural networks work, please talk to me at the end, I would really love to hear from you. And everyone in the world would love it, too, I'm sure.

That is what would happen if we had sophisticated features. But since we only have one hidden layer with one linear unit, it's literally going to be a scaled version of the time of the day which is our input going into that sigmoid. We're going to get curves like this, or because we have a linear layer after that, scaled or shifted versions. So more coffee in the morning, less coffee all day, more coffee all day and so on. That's a bit boring, so let's try to make our model a bit wider here, still sticking with a single layer. Now you can see we have two sigmoids. And before the sigmoid, we have something that is like two parallel versions of the model that we had before. So there's two sigmoids. The interesting thing is that these two sigmoids can interact at the linear layer at the end.

Let's do some maths, see what happens if we add these two together. We get this sort of weird shape. Then if we subtract these two, so then at the very start it's zero and minus zero and in the middle, the black one's above, so it's one minus zero, so that's one. The end it's one minus one is zero. So we get something like a bump which is a unit that we can actually work with to try to fill in these curves later on.

Going back to this, with two units, we had two sigmoids and so we had one bump. So if we have four units, we can have four sigmoids and two bumps and so on. You can imagine that if we have more and more sigmoids and more and more units, with only one layer, we'd be able to model this curve by filling it in with bumps like this. Now, here we only have six units, so I haven't filled it in completely, but you can imagine that when a single layer gets wider and wider you can actually model whatever you want.

But as I said in practice, we found that just making networks really wide doesn't work nearly as well as making them deep. And there's some sort of complex maths behind it and people trying to explain it. But again, that's the main takeaway for now. Although obviously, you would want to have a balance between the two, so you wouldn't want 1,000-layer deep network with only one unit, so people usually use networks with width about different powers of two ranging from 32 to 512, and depth going from, well now probably thousands, but even two or three can work quite well.

There are some other kinds of layers that you may have heard of. Who here has heard of a convolutional or of a current layer? Great. Quite a few of you. Here is a convolutional layer. This is the kind of stuff that works with images and works really well with images. Here are two key properties of a convolutional layer. I'm not going to go into this in depth. One is translation invariance. That means that if I want to know whether there is a cat in an image, it doesn't matter where the cat is, I should still classify it as a cat. The second one is locality. The idea is that if I want to use a feature, like the idea of, "Is there a cat ear in this photo image?" The only information I need is in a small local area of the image. I don't need to process the entire image at the same time to get one bit of information and then I can then combine all this information.

Just very briefly, the way that each convolutional layer works is that you multiply a patch of an image by at the same weights, and you go across the image and then that becomes your output for the next layer. What is less talked about, I guess, is how convolutional layers can actually be seen as a special case of this fully connected layer we've already talked about. What you do is basically you reshape the input so that in every column of a matrix you have a patch of the image, and then you put it through this linear layer, reshape it and then put it through the non-linearity. The details are not my point here. My point is rather that all these layers are often built upon this fully connected layer, and which is why we're talking about it today.

The other thing is recurrent layers, which are really good with sequences, those things like text and sentences, especially if you went to Jay's talk on NLP just now, you've been dealing with texts. Often when you deal with a sentence, you have a lot of different words. So at every timestamp, you would usually put in one word at a time, which begs the question, how do you deal with the words that came before that you aren't putting in at this timestamp?

You would have some kind of state or some kind of memory saved inside these layers. This memory will be passed from previous versions of your layer to the current timestamp. In this sense, the recurrent layer is like a fully connected layer, a stack of them, but stacked not vertically, but stacked horizontally, going backwards in time, which is really cool. Then you'd also input your current word in there and then get your output and so on.

How to Train Models

Now that we've talked about the layers, how exactly do we train these models? By training these models, it means how are we adjusting the parameters so that we actually get it to do what we want it to do? This is like a really crucial part of the performance of the models, even more so than the first part. Suppose, we want to predict the digits in this dataset, and our objective is for the model to recognize these digits and we use a metric such as something like prediction error. In reality, something a bit more fuzzy. Suppose it's prediction error for now. Before we go into the algorithm specifically, let's think about what the information we have and what the constraints of our problem are.

It's very much being like on a mountain range, where your latitude and longitude and your position are like your parameters, and the height is like your negative loss of reward. Basically, you want to go to the top of the mountain, or not the mountain but to the highest point in this mountain range, so that you can then, say, have a really nice view. The reality though, is that you can't actually see the mountain because then you would have to evaluate how the model would perform on all possible combinations of parameters, which is super exponential time and compute. So you can only really evaluate the performance of the model where you are now, and you can also evaluate the gradient, which is the direction in which this loss is changing the fastest.

Given that kind of information and if you want to picture it, it's like being a person who has a blindfold on and had a walking stick and is just poking around and can only figure out what the slope is in that mountain range. So given that limited information, one good strategy would be to maybe use this information somewhat greedily and so say, "I want to go in that direction since I think it's the steepest." Let's use a very simple case to see what might happen.

We find a gradient and we go in that direction, go in that direction again and then perfect. We reach the top of the mountain. You can probably see that this is a really simplistic case, and you can see that I've made it work just right. Because if our step size was a bit bigger, we actually would have gone over the top onto the other side. And if the step size was too big, we would never actually reach the top of that mountain. Yes, so extreme case, if it were a cliff we would've gone way down there and would've had a lot of problems. This is what it would've looked like on our mountain range.

The question is how do we decide how far we step in that direction? This is the official equation that people use - not official - but the formal way people do this gradient descent. In our case, it was gradient ascent but usually, you want to decrease the loss, not increase the height, so it's called descent. That's why there's a minus sign. So these are the parameters of your model and this is the gradient which is a direction in which the loss is increasing the fastest. Then this alpha is the step size over here. I must stress this parameter, this learning rate, is really important and can make a huge difference. The question then is how can you identify whether the alpha or learning rate that you choose is good or not?

This is an example of a loss curve that is pretty good. So this is the loss of the model as we train the model. That's the X-axis. An epoch is one time the model sees all of your data. You can see here that the loss is decreasing and it's decreasing stably. Here, there are two curves. One is called the training loss. That is the loss on examples that your model has seen and is allowed to update its parameters on, whereas the test loss is loss on examples that your model hasn't seen. And really when you're choosing models, you want to look at the test of validation loss, because presumably, you're using your model on data that you haven't seen before. So you want to see how good it is on that.

Question: what we're going to do now is see what is the last curve going to look like if our learning rate is too high? Suppose we have this loss landscape on the left and we plotted the loss on the right, and then we're following the gradient, want to minimize this loss this time, have a big step size, go way over there, the loss increases. Then we do this again. Notice this time we go even further back than we did before, because now our gradient is quite steep over there. So then if we do this over and over, we're going to notice that this loss curve is fluctuating a lot, because we have this minimum there that we're trying to hit but we can't because the step size is too big. This is one way you can identify that your learning rate is too high.

On the other hand, if your learning rate is too low, then you have a curve that is more like this. It's very flat because you're hardly decreasing anything at all. This one is slightly easier to identify. Just to put this concretely, this is the background for the curves that I was showing, the data set where you classify different kinds of irises. It's a bit misleading because a dataset isn't actually images, but these are pretty flowers. So why not? Here we have a two-layer neural network with the linear layer afterwards, and we're classifying into three classes, remember. We have a Softmax that outputs three numbers each with, how strongly we believe that the example is of a certain type of iris.

Here's a quiz for you guys. Do you think that the learning rate here is too high, just right, or too low? If it's too high then raise your hand. Just right? Too low? Okay. People who raised their hands were correct. This is the very high learning rate of two, so never use something that high basically. You can see that because it's fluctuating a lot. It's not very obvious in this case that it's doing badly in a way, because it seems it's going near the bottom of the graph. But if you look at the access over there, it's actually being very cheeky and it's actually one so the loss is pretty high for this task.

For this one, it's a bit trickier. Who thinks the learning rate is too high? Who thinks that's just right? Good one. Who thinks it's too low? Yes, this one's really tricky, but thank you for raising your hands. This one is too low. That's because if you look at the axis over there on the left, the numbers are really small. So it's actually hardly decreasing at all. Yes, on the top, top left corner, again when you plot this kind of thing, you have to read all the axis, there's actually a loss of about one. And a better way of doing it is to plot it with a test loss, so you can see the curve is actually basically flat. With this one who thinks it's too high? Who thinks it's just right? Great. Who thinks it's too low? Great, no one.

This one's pretty good. If we plot all of these together, then you can see more clearly the difference in performance the learning rate can make. This is one of the things that I didn't get when I started out and I spent ages testing different kinds of architectures so my learning rate was bad and that was a nightmare. But anyway, often people use learning rates of like 10 to the minus 2 or slightly smaller, usually not 10 to the minus 8, but depending on how complicated your model is, often the more complicated then probably you need a lower learning rate. And you often make these smaller as you continue training, with optimizers like Adam or RMSprop.

Question, how would you calculate this gradient? Fortunately, we have things like Tensorflow and other kinds of software like PyTorch or Autograd which don't have creative commons logos. I didn't put them on. Yes, it's too bad. The question is, so these calculated for us, but then how do they actually go about doing this? Because we want to find out what might be the potential problems, because otherwise we're just going to get number spat out back at us and we won't know what to do with it. Here we have how the loss changes with respect to how the parameters in, say, the first layer change. We want to update the parameters in our first layer in a model of 10 layers. If you write it out like this, it's very complicated. We're not going to try to do that especially if you have 100 layers, just don't.

Fortunately, we can actually look at it in a more of a cursive way. We can try to break down this problem and instead of differentiating it all at once and calculating the gradient all at once, we're going to calculate how much the output of the 10th layer changes when the input to the 10th layer changes, which is equal to the output of the 9th layer and so on. So we have this pretty expression here that is very modular again and fact-wise. This is called backpropagation, which you may have heard of. It’s the backbone of deep learning as we know it today. We're going to look at this expression, so how the output of our model changes with respect to the input of our model and see what could go wrong there. And we're going to look at a three-layer model.

Look at how to calculate one of these terms specifically zooming in again. You may remember that the single layer is composed of a linear layer and a non-linearity. You would multiply these two together. And let's look at the non-linearity, because the linear layer, the gradient is pretty much going to be that weight matrix W. Not that much to say at the moment. So the gradient of this is basically you can say the slope of this curve. If you notice in the middle, we have quite a large slope which is quite good. Because remember when we go back to this expression, we want to multiply our gradients with how the loss changes with respect to how our model's output changes.

If we think that changing this kind of output is going to improve our performance a lot, we really want to get that signal across. When we have a good gradient, we multiply it this big-ish number by the signal we get back from our loss. We can then update the parameters of our model relatively effectively. Whereas if you have your weight over there and then have a gradient that's really, really small, you can hardly update your parameters of your model at all even if the loss is telling you, "Hey, you really should not be doing this." So how can we get around this?

You initialize the weights with small variance. What does that mean? So that's an example of what you would use, Glorot initialization. What I mean is the weights over there is the W. You have this W and you initialize it with a small variance so the weights will be roughly in this range over there. If your weights are very small, when you multiply it by the output X then it's also likely to be relatively small so then you can get the good gradient over here. Whereas if your weight is very big, and then you multiply by an X, then it's likely to be very big. So it's going to be far out over there and so you're going to get a gradient that's very small, it's not going to be as good. That's why you should initialize the weights with a small variance. Again, this is something that is not mentioned very much in tutorials but that is super, super important. Again, you can import a Glorot initialization layer to do it for you.

Deeper is Better

Now we're going to look into more state-of-the-art kind of work. Who has seen ImageNet before or has heard of ImageNet before? Okay, a couple there. This is a huge dataset of over 1 million images with 1,000 different kinds of labels. One interesting thing about machine learning and deep learning is that a lot of the progress is made off the back of datasets that are made available. So when a big dataset comes out, you often see a lot of progress. If there are no new datasets, then not so much progress.

The correct label for this image is leopard, but there are also many similar labels such as cat or a snow leopard, a saber-toothed tiger, that can make it quite challenging, as well as some other labels that are less relevant. But if we look at this, and the title Revolution of Depth is not from me, it's a fantastic title, this is the depth of the models that achieve the best performance on ImageNet from 2010 when the dataset that was released up until 2015, so still a few years ago. But you can see that the depth here is increasing exponentially, which is incredible. The general consensus is that if you can train models that are deeper, going deeper is better.

But it's a really big caveat, being able to train models that deep, because you would have thought that if people knew that training deeper models was better, why did they not train a very deep model in 2014? In order to find out why they couldn't do that, let's look at, again, our gradient descent equation. So we're going to specifically look at the gradient that we calculate here. And as you recall, this is how the loss changes when a parameter changes and you can factorize it like this, and we'll look at that expression again. Let's say we have 100th layer deep model. What could go wrong?

Imagine if all these numbers were really, really small, the number that you get is going to be tiny and you're not going to be able to propagate the signal through to your 100th layer. This is called vanishing gradients and is a really challenging problem, especially with the recurrent neural networks we're talking about before. Because in that case, even if your network is very shallow, you have all of these horizontal, let's say layers going back in time, remember? If you have a sentence or a paragraph of length of100, even if you're training a very shallow network, you're still going to get this kind of problem.

Then on the other end you have, well, if they're all very big or mostly very big, then you'd get exploding gradients. This is much less common because you'd usually have a small number somewhere there to offset it but it can still happen. So yes, in this case with exploding gradients, you have this massive parameter change and then because your weight is very big, then your output WX often goes to there and then you can't change your parameters and you're stuck. Question: how can we solve this problem of vanishing gradients?

This is one example in machine learning that I really like because it's a very elegant solution but it's not obvious of course. In the ideal world, we would be able to just take out one of these 0.1's and make it a 1 or something, because if we could turn all of these into 1's or not based more numbers than we would have solved our problem, because then this number 10 to the minus 100 wouldn't be that small. How do we do this? You literally make a gradient one. You actually add a path where you add a copy of your input, you skip over one or two layers which is why it's called skip connections and add it to your output.

In this case, your network is not learning the output, but the difference of one time of your input and your ideal output. Even though, if you go through those dense layers, your gradient might be very small. If you go through this path that gives you a copy of the input, your gradient is going to be one. Because of this, they were able to train a network that was 152 layers deep, as opposed to 22 before with GoogLeNet and then 19 with VGG in 2014. This is a massive step, and pretty much everyone uses this nowadays when they want to train really deep networks.

Practical Tips

With that, we're approaching nearing the end. Of course, that is why things like learning rate initialization are really important. Here are some practical tips that can be really helpful. One is that you should overfit first. Just now in the machine learning open space, we had some questions about, "How do you go about choosing a model,” or, “What if I don't want to use a lot of compute and want to choose between different configurations?" Actually this is more about just seeing if there are any glaring problems in your model without having to train it on your entire dataset which might be 1 million data points, might take weeks, you really don't want to do that.

So what you do is you put a few examples into your model, maybe three or something, and then see if your model can have a very low training error, and basically just memorize those examples. And the idea is that not that this is the best model, but that if your model can't do that, then there's definitely something wrong with it. But if it can, there's no guarantee. But again, something that you can do very quickly to test your model is performing well. Here, just to label the point that if you show this model examples it hasn't seen before, it's probably going to do very badly because it's memorized exactly what the previous examples were saying.

The second thing is funny, it's called Batchnorm. This is basically a hack that you can use to make your models perform a lot better. To get layers you can import that go right before your fully connected layer, just to normalize the inputs so it has a mean of zero and variance of one. Again, this is a very interesting story because when the paper first came out, these guys had an explanation for why they thought that Batchnorm worked. But just last December at Europe's, which is like the huge machine learning conference you might have seen in the newspapers, there was a paper that showed that the original authors were wrong about why they thought that this worked. That's just an example of how this field is continually evolving as more and more work, in theory, is being done. But yes, so that is a very useful technique to use.

Thirdly, therefore, is transfer learning. Who's heard of transfer learning? Oh my gosh, Siri's transfer learning or whatever. The idea of this is that raining networks often take a lot of data and a lot of compute, and you really don't want to have to train things from scratch every single time. You want to leverage things that other people have already done, as in like open source software, use what people have done.

For example, in a convolutional neural network in images, you often find that the earlier layers... Again, those are not creative common graphs, so I had to draw them myself laboriously, but it was quite fun. The lower layers, they often detect edges and then the middle layers, you can start to see shapes and then the higher layers. See this one was trained on cars, you can see that the one on the left is like a wheel and the ones on the right are the images over there we see.

The idea is that you want to take these mid or high-level features that people have learned on the networks by training on like a lot of examples with a lot of compute for your own applications. One example is ResNet and you can often find these networks trained on ImageNet, so this dataset of 1,000 images. Those you can use with a lot of different types of images, but it's often best to look and see if there is a data set specific to your application, for example, for like medical imaging or images about buildings or cars and that sort of stuff.

The idea is you chop off the last few layers of this network and you can just download the weights off GitHub and stuff. You can get these mid to high-level features that we were talking about. Then using a few layers, you can then repurpose these mid to high-level features for your own uses and get better output. This is a really good way to be able to train really good networks on images or words as well, if you've heard what Jay's taught before, with relatively little compute.

Again, for example, just to give some idea, so the state-of-the-art networks in 2014 took weeks to train on a couple of GPU. So now depending on how much compute you have, there are papers on how you can train ImageNet in I don't know, like a few minutes or 15 minutes or something. But then you probably use at least hundreds of GPUs there. It does take a lot of compute, which is why transfer learning is really helpful.

And the final one is one I really like, and this one is actually from Karen Simonyan, who was one of the co-creators of the state-of-the-art network in 2014, VGGNet. If you look at this architecture, it looks a bit complicated at first, but when you look at it, you can see it's really quite elegant because you have these modules one after another, that comprise that convolutional layers followed by a pooling layer, which reduces the dimension of your image by factor of two. You can see that each in each block all the conv layers have the same number of hidden units, and in successive modules, the number of units in each convolutional layer is multiplied by a factor of two. So the benefits of this really, apart from it being elegant, is that you can reduce the number of parameters that you have. And importantly you are spending less time on parameters that matter less, and more time on parameters that matter more. So like the exact number of units in each layer is not a parameter that matters very much, nor is the exact depth.

With that in mind, I think another really cool thing reflecting off the back of this is that a lot of the ideas that we've talked about today are things like the linear and nonlinear layer, like back prop, well, learning rates are sort of simple. Things like Batchnorm and skip connections are all ideas why is that quite simple. It's really cool how really simple ideas have often shown to be the most effective in machine learning.

Today we've gone through a single dense layer, we've put these layers together, we've learned how to train these models, and we've gone to the state-of-the-art in terms of depth and talked about more practical tricks. Just going back to the thing at the start, I just really hope that you can take the insight from the talk today and maybe not only apply it to deep learning, but also to your general understanding of models and how you approach models even, so you can use that insight to better design, debug, and deploy better models and avoid the mistakes, bad examples that I made. Thank you.


See more presentations with transcripts


Recorded at:

May 14, 2019