Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Reinforcement Learning: Not Just for Robots and Games

Reinforcement Learning: Not Just for Robots and Games



Jibin Liu presents one of his projects at eBay where the team used RL to improve crawling of targeted web pages, starting from the basics of RL, then to why and how to use it to power web crawling.


Jibin is a former Software Engineer at eBay, where he was working on using Reinforcement Learning to improve the efficiency of web crawling. Prior to eBay, he worked at Esri, a pioneer in the geospatial information system, mapping, and spatial data analytics technology. Before that, he was an Environmental Consultant at AKRF, Inc. in NYC.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Liu: You must know about all these applications in reinforcement learning and the deep reinforcement learning. On the top left is where the most famous AlphaGo from DeepMind which beat the human players in The Go Game and on the bottom left still from DeepMind if you're playing Starcraft 2 and that it is the AlphaStar, it's beating the top players in the game as well. On the top right, we have the researchers using deep reinforcement learning to teach robotic arms to reach to a target location, so we can think about this being applied into the manufacturing or industry locations. On the bottom right is where the OpenAI Five is playing against the Dota 2 game, so if you read the news about the OpenAI Five, recently I think of the last Friday, they have a game with the Dota 2 by TI18 champion O.G. and the agent won the game. Those are the cutting edge of reinforcement learning and the deep reinforcement learning in the real world right now.

In this talk, I just want to introduce besides the video games and the robotics, RL can also be applied to many other fields and one of them is a web crawling. If you have never done web crawling before it is basically a way to visit and collect information from the web. We can imagine a website as a graph or a tree, traditional web crawling is usually done by using BFS or DFS doing the exhaustive search. This does the work however, it wastes time and bandwidth, in the case where only you want some kind of pages from a website.

In this slide, if we only want to collect information from the blue pages, we will waste resources if we crawl in the ones in the red box, so we're trying to avoid those ones. The spider says, "I want to be smarter." so how can we do that? Here's where reinforcement learning can come to help because this problem involves decision making and also reward signals.

Coming back to the outline of this talk, we're going to go through a quick introduction of reinforcement learning and then move into how to apply it to web crawling and lastly, just by comparing with other machine learning techniques, we shall see what kind of problems that reinforcement learning can help or may help.

Reinforcement Learning Demystified

In the big picture, as Mike mentioned before, reinforcement learning is a different concept as deep learning. Reinforcement learning is usually defined as one of the three major categories in machine learning together with two others, supervised learning and unsupervised learning. Deep learning is actually a technique applied onto those categories, you can apply deep learning into supervised learning, we have the image classification of neural networks, or you can also apply deep learning of neural networks onto your reinforcement learning, which becomes a deep reinforcement learning technique.

One of the best example to introduce reinforcement learning is dog training. Say you have a new puppy at home and you're trying to do some commands. When she hears the sit command for the first time, she probably won't understand what this means, and will make some random movements. Eventually, she sits down and you treat her with food or just a pat. The more and more accurate practice she does, the more accurate the action she will make just based upon the commands. This is exactly what we're doing in reinforcement learning as well.

On the left, on the dog training example, the dog receives signals from voice, gestures or from noticing the owner's emotions and her brain will try to interpret the input and pick out certain action as a reaction, if she gets it right, she will receive a treat. In the reinforcement learning world, the dog is what we call the agent. This is the one that learns how to interact with the environment. The state is a representation of the environment that the agent sees. We can imagine in the AlphaGo game the state would be the Go board and all the pieces and where they are, how many are they. In the robotic arm situation, you can imagine the state like the torque angles of each joints, as well as the target location, where they are and how far I'm reaching to it. Those are the information that the agent sees from the environment. Those are called states.

Besides the agent environment state, the basic term, there are four main sub-elements in reinforcement learning system. The first one, very important is policy. Policy is just a mapping from state to action, you can think of it as a factory that takes a state as input and then output actions. The mapping can be simply as a linear function or a lookup table or can also be very complete, such as neural networks, the ones using AlphaGo in the game. The second one is reward signal, this one defines the Go of the reinforcement learning problem. As humans, we can imagine the reward signal to be something as the experiences of pleasure or pain, so when you're doing something and feel pleasure, you might want to do that over and over again, but never comparing to something that causes you pain, probably trying to avoid that.

Reward signal is the primary basis for us to altering the policy, so how we can perform better, we shall soon see how we can do that. The value function evaluates how good the state or action is in long term, so comparing with the reward signal which defines how good it is in the immediate sense, the value defines how good it is in long term. If you make some investment today, you deposit some money in the bank or in the stock, the reward you get today might be low because you take the money out of the pocket and put it into something else. However, the value of the action can be huge if you make a good investment, that's a difference between the reward and the value.

The last sub-element is the model, this is something that mimics the behavior of the environment and can tell the agent how the environment can transition from one to another. I put an “optional” word there because the model is not necessary to solve the reinforcement learning problem. For example, in AlphaGo in the early ones, they do a model free on training having NATO one, when they bring the model in they can increase the training speed. However, that's not necessary to solve the whole problem. In summary, the reinforcement learning system goes like this, at time “t”, the agent observes the environment and received a state, “St”. Using the value function, the agent chooses a specific action which can maximize the value it might get, so that's denoted as “At”. From there they receive a reward as “Rt”, which the agent can use to update the policy, so that for the next time when they have the similar situation, it can know how to pick the ones that maximize the value. At the same time, the agent also receives St+1 upon which is will make the next action, so the cycle goes on and we call this SARSA sequence, as the experiences that the agent learns.

There are many ways to solve the reinforcement learning problem. For the sake of this talk, we will focus on the Q-learning method which is one of the most famous ones. In the previous slide mentioned that the value function can estimate on either state or value or action. Action value is the one being used in the Q-learning. What action value means is it basically tells the agent how good it is when the agent is at a given state and to take a particular action, this is denoted as Q(s, a). This is also what the Q means in the Q-learning, the Q means action value.

We also learn about the experience the sequence SARSA, how we can utilize that to update our policy - we can use the equation that is showing here in the middle. The new Q value is being calculated using two parts, the old value and also the learned the value. The learned value is calculated based on the immediate reward and the estimated maximum value of the next state. This future value is multiplied by a discount factor. The discount factor gamma is just a value between zero to one which is used to balance between a short term and long term views on returns. When the value is closer to one, the agent will take more consideration of future or long term rewards, while when it's closer to zero, the agent will be more short term focused. It depends on the problem they are trying to solve, you might want the agent to see further or just closer.

The learned value is multiplied by a learning rate which is the same idea as in deep learning where introducing the learning rate so that agent won't overshoot the optimal solution. I know that code tells everything so let's look at this code. It's a very simple code, basically here in the Q is just a mapping, just a look-up table from state to action. When we're trying to get the max Q, we're taking all the actions out of the mapping to get the maximum of the action values out of it. The next one is just to calculate the future returns using the formula I showed earlier, and the last one is to update the action values just using the current value and also the future returns plus by our learning rate.

We'll go through on how to update the Q value and then how we are going to pick an action. We use so-called epsilon-greedy solution, we introduce another parameter called epsilon, which is range from zero to one. The smaller the value, the greater the possibility that the agent will choose the action that maximizes return. However, the larger the value, the greater the possibility that the agent will just randomly select an action out of all the choices. You may wonder like why we need to do this, why can't we just choose the maximum one all the time? This is a trade-off between exploration and exploitation.

A simple example in real life would be which restaurant to go to. You can go to the ones that you know the best and continuing tasting all the dishes you haven't tried before, but in this case, you may lose the chance to experience something better at a new place. Epsilon-greedy solution is used to solve this dilemma by introducing some randomness into searching for the actions so that the agent is able to sometimes collect all the rewards along the way on the same path, just to go deeper and deeper. And at other times and look around what are the possibilities that I haven't tried yet, “I'm going to try this one to see how to work it out”

I also got a code example for that one, too. The Q is just a mapping from state to actions, so every time when trying to pick an action, we first draw a random sample from 0 to 1 and compare that with epsilon. If it's greater, we take the index off the action that maximize the values. Otherwise, you should just take a random choice out of all of them.

RL Concepts on Web Crawling

Up to this point, we have enough information about reinforcement learning, how we can update the values how the agent is learning from the rewards and the how he's taking actions out, given those specific states. Let's see how it can be applied into web crawling. This is the same graph as before, but now we have the agent as a smart spider. At first, we need to define what the basic areal reinforcement learning terms mean in web crawling. In web crawling, we can define the state on a given web page, that including the information about the URLs and all the links, just like the actions, because doing web crawling is the main focus on which is the link that I want to pick next, so that I can get more rewards out of it. For all those links, we want to extract the attribute for each link. Why do we need those? Those are the ones that provide the context about the link. You can imagine, if you are just within a site, everything you get out of the site are just URLs, you don't get any graphics, you don't get any attribute, no colors at all, how are you going to navigate the website? Those attributes help the agent to navigate the website and to know what value can be calculated for a particular link.

I talk about the action, which link the smart spider chooses to visit next and the reward is also simple since we are crawling the target pages. For the pages what we're looking for, we just give the smart spider often reward of positive 100, otherwise for every other pages we can get negative 10. Those are just numbers picked by experiments, between small scales or large scales, those are the ones that work better in practice. Just so that we have better ideas about those terms, I’m going to show an example.

I'm going to use the QCon AI web site as an example. We’re going to crawl the QCon AI site today. Out of all the pages, one page, for example this home page, is like a state. All the hyperlinks over here that we see, that link is basically the actions that the smart Spider is going to pick. He will think about “What are the links I want to go next” and the attributes associated with each link is like this one like plus, equals check title. This one is actually very obvious to human as well and also this one. I would imagine it has as a larger score, after the smart spider has been trained, because this one is where it can lead to the tracks. The target in this practice would be we want to crawl all the presentation pages. We don't know where they are but we want the smart spider to find them. We just define the mechanism in the reward, to tell a small spider whether you get what it would want or you don't.

You get what we want to get 100, otherwise, you're getting negative 10. On the home page you'd get negative 10, now the smart spider might say, "Ok, I want to go this one. This seems promising, might turn out something," and then it will go to this one, or it might go to the other one because it sees there is a session card or there is a session title attribute with the link. It might go to that one, even though it receives a negative 10 rewards it might choose to go to that link, and finally landing on the presentation page, it will get a positive 100 rewards. Upon that, it can use all the rewards to update all the action value so that next time, when the smart spider sees the same attribute or similar attributes it knows, "Ok, that one I tried before I know that works really good, I might go to that one."

Upon here at this point it is obvious to us that the action value can be defined as at a given web page as what is the value to visit a particular link ABC. What is the value for A, what is the value for B, what is the value for C, that's what the smart spider needs to define. From there, we can define the first value function, I write it down as QL which is just the value of a given link at a specific web page. However, we cannot get this value just by looking at the URL itself, we have to look at all the attributes, so instead we use another value function I wrote it down as QA which tell us the value of a given attribute at a specific level. When you think about the website as a giant tree or a graph you can have the level order, traverse so think about it that way.

This may look weird to you - why do we want to have the attribute separated by different levels? The reasoning behind that is, for a particular attribute it may have different values on the home page, closer to the home page where is the starting point, and also have maybe a smaller or larger value when it's closer to the target page. We want to differentiate between those two cases. That's why we introduce the level as a dummy state, so that we can differentiate between different attributes. Also consider most of the websites are well-structured by level, so this actually works pretty good in practice. All this information, including the state of the action, the reward, and also the QA values, are all stored and in a SQL database, so just for a lookup table.

Bearing that in mind, we can use Q-learning to update Q-values and the pick actions. First, how are we going to update the action values? Imagine the case where at the level L we have this page P and it received a reward. From there we first fetch all the attributes that links to this page P, and then we're just updating the values using the Q-learning equation that I showed before. How we pick the link next is also obvious as well at this point, we just get the current level so that we know what are the possible attributes out there that might have different scores. Then we retrieve the QA value for that order, for that level and for each attribute and we calculate the QL based on the QA. For link P it might have 100 attributes out there, have different scores. One way that you can do it, is that you can take the maximum value out of them. Another way is, you can take an average of everything else, of all of them. So it really depends on which one works better, in our experiment, we take the maximum value that is associated with each link.

Now we have the QL values and then we use the epsilon-greedy to pick the next action. As a result, there are two metrics there we're looking at just comparing this smart spider with the traditional spider. The first one is the metrics that you probably look at every reinforcement learning models, is the cumulative rewards. In the smart spider case, it will increase and then maintain the increased rate after 500 to 1,000 episodes. Before that, the cumulative rewards might even decrease below zero because it's trying to see what other actions are possible and it goes there, but get all the negative 10s rewards out there. However, as soon as I pick one of the 100 ones, then we can see the trend that the cumulative rewards just goes up as positive and then maintain that speed.

In the traditional spider case, we didn't see the cumulative rewards increase but it just fluctuated around some values because either zero or tens, really depends on how sparse the target pages have been split into the whole website. Another metric that we're looking at is the ratio between the number of target pages over the number of total downloads. Why we're introducing this one? Because this is the fundamental value that we're looking for out of this smart spider, because you can imagine the less downloads it does and also the more targeted pages crawled, the more resource, time, and bandwidth it saves, so the larger the value is, the better the system is performed. In the smart spider case, we saw the value is in the range between 0.2 - 0.4 based on different sites. That just means out of 10 downloads, it can get four of them or target pages. However, comparing to that in the traditional spider, the value is usually just below 0.1 which means you have to try 10 downloads and then you can get one of target pages.

Engineering /overview

Also for folks who are interested in the engineering part, this is the basic setup as microservices as we put it into production, we use a custom framework written in Python. We didn't use TensorFlow or Pytorch because there was no neural networks involved in this yet. The application is all Docker containers with REST endpoint as intercommunications. In this case, I can show you the user as a human being and we usually just put all the information into the smart spider and that's the communicator between us two. We tell the small spider what are the websites that we want to do and how you how you're going to define what kind of rewards you got. From smart spider to extraction, the smart spider will pass the link, the action that it picks to extraction; extraction will do the downloads coming back with all the information and it will return the state and the reward to the smart spider. Those pieces are all within the graph that it shown before the cycle, so the smart spider gets all the information and it can update the values for each website. Also, the state, the attribute Q values are restored and updated in SQLite database.

What Are the Problems RL Can Help With

Lastly, I just want to compare reinforcement learning with two other machine learning techniques and see what kind of problems reinforcement learning can help. There are many ways that we can compare them but here let's look at the training and the usage part. At the training part, we all know that in supervised learning it needs labeled data and also needs direct feedback. Direct feedback is all just coming back from the labeled data, we have to know exactly what image belongs to what category. In unsupervised learning, the examples are K means clustering, it's for the clustering techniques and also a J and is under unsupervised learning. For that one, we don't need labels but there are also no feedbacks, so it's all per the model itself to figure out how we're going to split or divide the data.

In reinforcement learning, the training is done by trying to do error search as we saw in the example, like in either the AlphaGo game or in the web crawling and the first agent and just try out some of the actions and see what kind of reward it gets and a slowly that what learns. "Ok. I did that right, I did this correct, I get 100. I do that wrong, I get a negative 10." Slowly the agent will learn what action space looks like, what the action value looks like. There was a reward signals in reinforcement learning, in the supervised learning it’s a direct feedback where it tells exactly what I'm looking for, this is the only answer you're looking for. In reinforcement learning it just tells the agent how good it is. There's no accurate or inaccurate but just good or bad.

On the usage part, the supervised learning is good at getting the best answer. Out of doing the classification, the model is going to predict what kind of categories this image belongs to and also use a lot of predictions as well. In unsupervised learning, the ability is to find the hidden structures in the data, how we can split it in the clustering case. In reinforcement learning, the agent learns how to take actions and actions don't have to be episodic or it just have to happen one time. But I can continue doing that, for example, you can think about you can train an agent how to predict the stock market, that's like a continuous prediction.

There's also the delay the reward when we're using the reinforcement learning. Think about the agent that they would take right now not only affects the next state, but also it might affect everything else that you saw, if you're following that path. When we're using the reinforcement learning, it can be applied onto some problems where you might see short term and long term tradeoffs. In the Dota 2 games the agent or the hero can't just take the most aggressive actions all the time, otherwise, you're going to die most of the time but it has to think some strategies, doing someone ganking, predict the tower, hit some smaller creatures, that way get some experiences, get some money and in account has a strategy so that they could win the game in long term.

There's this balance between short term and long term, if that exists in the problem, you might give reinforcement learning a try. Those are the characteristics of RL comparing to other machine learning techniques. The application of reinforcement learning, and the deep reinforcement learning is really rare or is small, especially comparing the ones in production environments, comparing to other machine learning techniques. I hope this can help to come up some ideas in applying reinforcement learning in different fields.

Learning Resources

What I appreciate the most in AI and the machine learning community is that everyone is very open to share from tutorials, from talks to papers. If you're interested in learning more about reinforcement learning, here are the resources that I personally used when I started out and really helped. The book, "Reinforcement Learning: An Introduction" by Richard [S. Sutton] and Andrew [G. Barto] is the Bible in the field. If you'll want to know the ins and the outs in this field, then this is definitely the book that you want to read. The lecture below by David Silver who is the lead on AlphaGo and also the co-leader with an AlphaStar project, he has this lectures which actually used the book as the textbook, is an awesome starting point as well.

If you want to get your hands dirty, the environment that I tried myself and I would recommend is one is OpenAI Gym. This one has a lot of classical reinforcement learning and deeper reinforcement learning problems that you can try to get your hands dirty and try something out. The Unity Machine Learning Agents Toolkit is in a 3D environment and it also involves a lot of like multi-agent collaboration - play soccer, play tennis, all those kind of environment, so that's more advanced. If you want to try it out you can do that too. There are also many other resources that I use for resource out there and I really wish you a great journey into reinforcement learning if you're interested.

Questions and Answers

Participant 1: Looking at the Q-equation and you're looking at the learning rate and the epsilon-greedy rates, I notice in your metrics you did not actually mention what those values were and how did they actually go across. How did they really affect your final, because different learning rates are going to cause you different metrics at the end of the day.

Liu: You're wondering how the action values have been applied onto the action?

Participant 1: No. You said you had between 0.2 - 0.4 download rate, so if you change your learning rate there you're going to actually change how quickly or slowly. What was the actual point where you got the best learning for the spider to go through the best download rate?

Liu: For the learning rate, we experiment between 0.1 - 0.5, that is the range that we try and the best one that we saw is in the range between 0.1 - 0.2. We don't want to make it too large because in that case there it will introduce a lot of noises, the larger the learning rate, the more noise or the more new information it will bring in into the agent. Especially if the website is too large or the attribute associated with each link is too much, then the agent cannot relearn anything at all, so want you to minimize that learning rate and just go really slow.

Participant 1: Do you try doing something as you kept going through reducing the learning rates or you kind of start an exploration and go to it?

Liu: The question is whether we decrease the learning rate along the time. For now, the project is just maintained at the same learning rate all the time. You can use a decay learning rate as well and also for the epsilon-greedy, especially in deep reinforcement learning people usually are using decayed epsilon-greedy as well. Because the reason behind that is, at the beginning you want the agents to try out all the possible actions because the bad actions you make have less impact at the beginning of the learning. Later on, when it has a better idea about action space and actually slower down, like lower the value of epsilon so that the agent will try more specific actions with maximum value.

Participant 2: I've got a related question about how you choose these hyper parameters. One of them you mentioned is the carrot and the stick, there's reward of 100 a penalty of minus 10. In this case, where do 100 and minus 10 come from? In general, how does one choose those numbers?

Liu: Those numbers are really just by experiment. We want to see how fast or how the accumulated rewards is increased by introducing different rewards. For this problem which we tried the positive 10 and negative 1, and also positive one and negative 1. It really depends on the scales between the positive and negative numbers that we saw. Here I just show that the reward is just a pure number, however, you can think about when you're training a robotic arm or you're training actually a robot to walk at a stage. The reward is actually there was a function you can calculate on the reward. It’s not a pure static number but it can be calculated always time really depends on how complex the problem is.

Participant 3: I'm curious if you tried to formulate this as a supervised learning problem and did you have any baseline comparisons and also your experience in training a supervised learning system, this is an RL system. What are the tradeoffs in performance, complexity? I just want to hear your thoughts.

Liu: For now we don't do the comparison with supervised leaning on this problem because we're trying to solve a problem where we have the targeted web pages and we want to save time and money on not crawling those ones because each link that we crawl they basically cost the company some money. We're trying to save out of there and that's where we are thinking to use reinforcement learning. We haven't reached the point of where we'll compare with a supervised learning yet, but that’s a good point.

Participant 4: I know you mentioned that you used hand label data to train this model. How much data did you need to train and how long did all that labeling take?

Liu: The only label that we provide is on the reward. On the reward, we have both automatic and manual works, there's probably like 60, 40% on each site. It's like a selector on the page it's like X fast pass that on the extraction we're specifically looking for. If you've found that one, then it return us a positive reward, otherwise, it will return a negative one. The whole system is kind of different when comparing to other reinforcement learning models is that we basically have the smart spider learning and doing the work at the same time. We have a system that starts from fresh out. All the action values are initialized at zero so we want to try all the links and we'll update all those values in the same period as it is collecting all the targeted pages. It's like learning by doing, the training and application is all happening at the same time.

Participant 5: How do you manually define the reward function in that case? What does that mean?

Liu: For example we can use this attribute or selector as the signal, as a reward signal to the smart spider. We'll extract all the information the link and attribute. As soon as it sees the selector that I just showed you, on this back the response part it will say ok, you get a reward of 100 because I saw the selector on this page, otherwise, it just would return negative 10. That is the only thing that we do and if you ask how we pick a particular plot, a particular attribute, or a selector on the page, we have both automatic and manual works.

Participant 6: You cannot reuse smart spiders across domains, so if you want to crawl Kijiji or Craigslist those will be basically two different smart spiders?

Liu: Yes. For each website, we train a different smart spider but we can reuse the smart spider on the same website if there are new pages coming out because all the Q values can be saved in the database and it can be reviewed the next time. It doesn't have to start learning from scratch.

Participant 6: Did you have any plans at the time of ways to reuse them or using NLP or something that they look like for HMO to be used?

Liu: We have ideas about whether we could introduce the NLP or we can introduce an image of some of the things showing on the page that hopefully that we could share. It just really depends on how we split the whole different domains or website together. Some of them can share, but some probably not.

Participant 7: On one of the examples of the code you have the state in this Q-table. Assuming that state has to be numeric, how do you take those fields like the attributes and everything and to turn it into a numeric representation?

Liu: Over here we define the QA numbers, so the QA number is what the actual values stored in the database and that value is calculated using the Q-learning equation based on the rewards. We store all those values in this table and upon the time that we trying to update it or pick the link, we can just fetch the corresponding rows and it got all the values out of it. In the equation that shows on here where we're just fetching the action values, you can think about the tuple of numbers. In the robotic arms, maybe you have four joints and each joint would have the torque, the angle, all those alphanumerical numbers ordered in a sequence tuple. You can use that as your state and you can do the lookup tables using the lookup table to do this one.


See more presentations with transcripts


Recorded at:

Jun 11, 2019