In this episode of the InfoQ podcast Dr Phil Winder, CEO of Winder Research, sits down with InfoQ podcast co-host Charles Humble. They discuss: the history of Reinforcement Learning (RL); the application of RL in fields such as robotics and content discovery; scaling RL models and running them in production; and ethical considerations for RL.
Key Takeaways
- Reinforcement Learning came out of experiments around how animals learn. Most intelligent animals learn through reinforcement, the idea of providing positive or negative feedback to actions that were performed in the past. More intelligent animals are able to learn more complex sets of actions that lead to higher order behavior.
- Likewise, unlike typically myopic Machine Learning approaches, Reinforcement Learning aims to optimize and solve problems based upon sequences of decisions over time.
- A fundamental mathematical concept for Reinforcement Learning is the Markov Decision Process developed by Richard Bellman in the 1950s. It comprises an agent, an environment and a reward. Rewards are ideally simple, easy to understand, and mapped directly to the problem you're trying to solve.
- One area that has seen relatively wide industry adoption of RL is robotics. There has also been success with recommendation systems for content discovery. However it remains challenging to operate Reinforcement Learning models at scale, at least in part because RL models are inherently mutable.
- The ethical considerations for RL are similar to other Machine Learning approaches, but inherently more complex give the nature of the model. Observability/auditability is key. There are also approaches such as safe RL where the algorithms are being trained to exist within a constrained and confined set of states.
Subscribe on:
Transcript
01:52 Introductions
01:52 Charles Humble: My guest on the InfoQ Podcast today is Dr. Phil Winder. Phil is a software engineer and a data scientist. He is the CEO of Winder Research, which is a cloud native data science consultancy, where he helps startups and enterprises improve their database processes, platforms, and products. Phil specializes in implementing production-grade cloud native machine learning, and was an early champion of the MLOps movement.
02:18 Charles Humble: More recently he's offered a book on reinforcement learning for O'Reilly, and reinforcement learning is the main thing that I want to focus on with our conversation today. Phil, welcome to the InfoQ Podcast.
02:30 Phil Winder: Thank you very much for having me.
02:32 Charles Humble: And congratulations on the book which I think is fantastic.
02:34 Phil Winder: Thank you.
02:35 Charles Humble: One of the things I really like about the book is it has a lot of very practical real world examples while as so many examples of reinforcement learning you can find on the internet, and I think this is partly a side effect of the relatively low levels of adoption in industry, but many of the examples you can find are rather contrived. I'm also fascinated by the history of reinforcement learning and how it relates to studies in how humans and animals learn, which is an area I have a layman's curiosity in. And so, I wondered if you would perhaps start by talking about that history.
02:35 Can you talk about the history of Reinforcement Learning?
03:05 Phil Winder: Yeah, definitely. Some of the earliest experiments into how we learn and, well, ultimately led to how machines learn were started by scientists that were interested in psychology or the field that later became psychology. The classic example is Pavlov's dogs where he was attempting to train behavior and experimenting to find out how dogs learn. And that had a ripple effect through science in general and prompted a lot of other researchers to really ask the question, "How do we learn? How do we learn to ride a bike? What's involved? How do animals learn to ride a bike?"
03:37 Phil Winder: And so since then there's been a whole host of experiments with various animals and they found that most animals tend to learn through reinforcement, and it's through this idea of providing a positive or a negative feedback to actions that were performed in the past that intelligent animals are able to learn that that action is good and so I'm going to do it again or bad. And generally it seems like that the more intelligent the animal the longer the string of actions that they can perform in response to that feedback.
04:06 Phil Winder: For example, I liked talking about chickens a lot because I've got a couple of chickens in the back garden there, and they're actually pretty smart birds. They can be trained to do tricks but the tricks that they can do are quite low level, the basically single-action type tricks. You can get them to pick up things or tap on certain colors or recognize letters and shapes and things like that but it's a singular thing. More intelligent animals are able to learn more complex sets of actions that lead to higher order behavior.
04:33 What is it that distinguishes reinforcement learning from machine learning?
04:33 Charles Humble: What is it that distinguishes reinforcement learning from machine learning and how do they both fit into the broader discipline of data science?
04:43 Phil Winder: I consider the machine learning to be part of the bigger field of data science. Data science, doing science with data, doing things with data. Machine learning is a part of that and within machine learning I consider RL to be part of machine learning.
04:57 Charles Humble: And what's the reason for that?
04:59 Phil Winder: The reason for that is that RL actually depends on a lot of machine learning techniques and ideas and theory and in fact, the research that goes into ML can usually be directly applied to RL as well. But the major difference is the way in which these two methodologies make decisions. For ML or machine learning, they're always optimizing for a single decision. They're optimizing to make the one right decision at a single point in time, and that's it, it doesn't take into consideration any previous decisions, any future decisions. The word is myopic. So it's a very single point in time view of the world.
05:37 Phil Winder: Reinforcement learning on the other hand, it aims to optimize and solve problems based upon sequences of decisions, sequential decisions, through many decisions over time. So a decision that it makes right now actually depends on the decisions that we've made in the past, and also it may be making quite a strange decision in order to get to a better position in the future.
06:01 Could you describe the Markov Decision Process?
06:01 Charles Humble: Before we go too much further we should probably clarify some of the terminology that we're using. I wonder if you could describe the Markov Decision Process that Richard Bellman developed in the 1950s, which is central to a lot of what we're talking about and is one of the earliest algorithms, I guess, for reinforcement learning.
06:18 Phil Winder: That's right. It's a fundamental mathematical concept that defines the RL framework. It consists of two major entities, an environment and an agent. The environment is the world around you, the world around the agent, everything, all of the context that is required to provide information to the agent to make a decision. The environment can change, it can mutate and it is principally the agent that is mutating the state of that environment.
06:45 Phil Winder: The agent is the thing that is learning over time to make optimal decisions. So that could be some software, it could be a human being, it could be an animal. That's the thing that is attempting to use the information provided by the environment to make a better decision in the future. And it does this by a series of fairly, well, fairly simple but it forms quite complex behavior, this simple feedback loop where you've got actions being generated by the agent that are fed into the environment.
07:14 Phil Winder: For example if I'm riding a bike an action could be turn left or turn right. The environment then changes, it's mutated, and that produces a new observation or a new state that is represented by an observation, which is sent back to the agent again so this is the view of the world that is being sent back to the agent. Like if I'm riding the bike again, then the state or the observation would be what I can see, what I can feel, things like that.
07:40 Phil Winder: And then finally, there's another signal that comes from the environment called the reward. And this is the thing that tells the agent whether that action that was just made or the previous set of actions was a good one or a bad one. And so that's it, there's just three links, action, state, and reward.
07:56 How do you design a reward in the context of a business problem?
07:56 Charles Humble: Just to try and unpick that a little bit further, how do you design a reward in the context of a business problem?
08:05 Phil Winder: Good question. Rewards are ideally simple, ideally easy to understand, and ideally they're mapped directly to the problem you're trying to solve. So one of the key issues with machine learning is that because they're only operating on single decisions, they tend to be optimized for technical metrics that only refer to that single decision. For example you're often seeing technical metrics like accuracy or precision or F1 score or something that is completely incomprehensible to the business, but reinforcement learning allows you to take a longer view because you're not optimizing for that individual moment, you're optimizing for potentially all of time.
08:48 Phil Winder: And that means that you can introduce reward signals like my real goal is to maximize the amount of money that someone spends my e-commerce website. Or my real goal could be to increase the engagement and increase the number of users on my platform. So you can link that goal directly back to the actions that are being made to the agent. So ideally the reward is mapped to the problem you're trying to solve but it also has to be quite simple because more often than not the rewards themselves can create these very strange situations which may appear optimal to the agent but in fact end up being suboptimal in the long run.
09:27 Can you give an example of where a reward ultimately is suboptimal for an agent?
09:27 Charles Humble: Can you give an example of where a reward ultimately is suboptimal for an agent?
09:32 Phil Winder: Imagine the situation where a robot is trying to navigate through a maze to get to a goal. Quite often what can happen is that in robotics tasks you use a distance to goal which is a metric where you're measuring the Euclidean distance between the robot and the goal which is as the crow flies distance to the goal. And if you've got a dead end that's quite close to that goal, quite often the robot can get stuck in that dead end because it is actually physically quite close to the goal but it's not at the goal and it never learns to go back and try anything else because as far as the robot is concerned, that's suboptimal.
10:05 Phil Winder: So you can get yourself basically trapped in these dead ends that are almost optimal but are not quite optimal. And therefore the whole field of reward engineering is akin to feature engineering, there's a lot of work that needs to go into that to make sure that it makes sense, it's simple, and it fits the problem you're trying to solve.
10:22 How do you translate state into something that's usable by the policy?
10:22 Charles Humble: You've talked a little bit about the role that state plays and something that I was also curious about is how you translate state into something that's usable by the policy.
10:33 Phil Winder: Yeah, good question. Ideally what you're trying to do is you're trying to learn a model, the actual true state of the internals to the environment, but more often than not you can't actually view, you can't see or you can't even comprehend the real state of the environment. For example if we're working in the real world, actually outside of lockdown, outside of our homes, if you're actually outside and you're trying to train the robot to do something, for example, then the environment is the world. That's not something that you can model.
11:06 Phil Winder: And so what you need to do is you need to take a representation of that state. And so they call it an observation because from the agent's point of view it's just an a single observation of the world. It could be one of many that it could have made but it's got this particular observation. And actually, how you take that and then map that internally inside the agent to make a decision it's kind of a multi-stage process, actually.
11:30 Phil Winder: You could have models to do some feature engineering based upon the observational data that you've captured. You also need to design and train a model to actually predict and choose actions based upon the data that you have. So typically I recommend trying to split that process as much as you can because trying to do both at the same time actually makes it quite a computationally expensive thing to do.
11:54 Why has Reinforcement Learning seen relatively low industry adoption?
11:54 Charles Humble: One of the things we mentioned at the top of the podcast is that reinforcement learning hasn't seen anything like with levels of say industry adoption or indeed press attention that some of the one-shot machine learning techniques have and there are exceptions of things like AlphaGo Zero, I guess. But in general, that level of both press attention and adoption is relatively low. I wondered if you had any thoughts as to why that was.
12:18 Phil Winder: I think one of the main reasons is the fact that the market size for these techniques and these tools are just different, are smaller. So for example, software engineering has huge applicability and therefore it has a potentially large market. ML has somewhat large applicability but it's certainly less than pure software so therefore the market size is smaller. Reinforcement learning, I don't think it's the same size as the ML market and I don't think there's as many problems for RL as there are ML therefore the market size is smaller, inherently.
12:51 Phil Winder: But with that said you're right, it certainly hasn't been exploited as much as it could be exploited to a huge degree. And again, I think the reason for that is it's just time. Software engineering was invented hundreds of years ago potentially now, ML created in the '30s, '40s, '50s. Reinforcement learning only really started to gain traction in about the '90s. So in the timeline of things RL is behind ML and software engineering. So for ML it took maybe 60 years before industry really, really took it up so I think we have to wait for about another 30 years before RL is going to be the same level MLS, but no, no, no. I think it's just the media attention and also attention from the press and attention from industry as a whole. once that starts to ramp up, then I think you'll see an increase in adoption.
13:46 What is it about robotics that makes it a suitable candidate for reinforcement learning?
13:46 Charles Humble: One area that has seen relatively wide adoption is robotics which we touched on briefly when we were talking about the maze example and the challenge of having a suboptimal reward. What is it about robotics that makes it a suitable candidate for reinforcement learning?
14:03 Phil Winder: I think there was a bit of a snowball effect, really, I think that's the main reason. It's because they truly had a very difficult problem to solve and that is, how can you tell a motor to move in a very complex sequence of movements in order to generate complex behavior? Because it's very easy to think of a simple rule to go from point A to point B, but as soon as the problems become more complex like walk forward or make coffee or something like that, it suddenly becomes such a high-level complex task that it becomes increasingly difficult and to the point where it's nearly impossible to hand code a solution.
14:43 Phil Winder: So researchers began looking for ways that they could, well, there's two ways they attempted to solve it. One was through modeling and one was through a data-driven approach. So the first one was see if I can build a model of the tasks that I'm trying to solve, and that works to an extent. You see kinematic models that are used in robotics and these are the things that are used to balance and tell a robot how to move its motors to move from one position to another. And that works great, but again, that only works to a point. It only works to a point where the task that you're trying to solve is simple enough to model theoretically. As soon as you introduce some level of complexity like trying to interact with an external entity then it becomes practically impossible to model.
15:27 Phil Winder: Therefore they went back to the data-driven approach and thought, "Well, how can we actually learn how the robot should move? Can we use data? Can we use experience and experiments to learn how to teach a robot to move?" And that's what brought them to reinforcement learning because RL as a process is a process of attempting to learn optimal behaviors, optimal actions, over a period of experimentation time. So it was basically a perfect fit for the problem that they had at the time but it just turned out that actually that problem exists in many other domains as well, it's just maybe not quite as obvious.
16:02 What would be the simplest experiments we could devise that would enable a four-legged dog-like robot to walk?
16:02 Charles Humble: Hypothetically, if we had a four-legged robot, maybe a dog-like robot, one of those Boston dynamics robots that I think many listeners would have seen footage of on the internet. If we had one of those and we wanted to use reinforcement learning as a way of training it how to walk, what would be the simplest experiments we could devise that would enable a four-legged dog-like robot to walk?
16:29 Phil Winder: In the simplest set of experiments you could do it would look like a process of trial and error. It would look like the robot is flailing around all over the place trying to learn how to stand up, trying to learn how to move forward. And actually one of the problems, and this is one of the main reasons why as far as I know the robots that you see online are not using RL at such a low level, they still use classic techniques to actually move the robot. But I'm sure someone will correct me if I'm wrong. But the main reason for not doing that is that when you do train the robot like that it ends up learning some very odd ways of moving.
17:05 Phil Winder: For example you could do a lot of simulations on this in a simulator. And when you look at some of the ways that the agent has learned how to move it's really, really fascinating. I think one of the classic examples is there's something called a 2D Cheetah and it's done this way to make it simpler to compute. But basically it's a model of a dog shape but in 2D, it's just like the profile of it. So you've got two legs and you've got a back and a head. And with many sort of rudimentary RL algorithms it learns to do cartwheels, it learns to roll over it's head and roll and roll and roll and roll before it learns to walk. And that's just because of the environment that it is brought up and then the algorithm that is being optimized for you must move forward at all costs. It doesn't matter if it works his head on the floor and does the cartwheel.
17:53 Phil Winder: So in reality, a lot more of the newer robots are attempting to, well there's two approaches. One is to merge the model-driven approach and the RL approach. So it's like accepting the kinematics and the model-driven approaches are actually pretty good at doing what they're doing, the simple walking movements, so I start with that and then learn more complex movements. That's one approach.
18:16 Phil Winder: The second approach is to attempt to decompose the problem into a series of steps. First we're going to learn how to move a leg then we're going to learn how to walk forward. Then we're going to learn how to jump, then we're going to learn how to fetch a ball and so on and so on. And building up these skills over time tend to lead to slightly more, well, maybe not reliable or stable but I think it's more expected behaviors. So we're kind of enforcing a curriculum on the agent so that it learns in a way that we expect, because otherwise like I said with the simple reward example, otherwise you can end up with behaviors that were not expected.
18:56 Can you make a reward negative?
18:56 Charles Humble: Can you make a reward negative? So if it rolls over and bashes its head that hurts and so it doesn't do it again, which would be kind of how a baby learns to crawl and stand and things.
19:08 Phil Winder: Exactly. Yes, you can. You can add negative rewards, you can have positive rewards, you can have any value of reward that you like. It just has to indicate to the agent whether you want to reinforce that behavior or you want to prevent that behavior from happening again. But the problem is, is that if you included that negative reward for never banging your head then a robot or a child would never, ever attempt a handstand or a headstand, it would never attempt to lie down in the bed. It would try and stand up in bed, it would never sleep, and so on. That negative reward could lead to unintended consequences and you would probably end up having to add another thing and another thing and another thing.
19:51 Phil Winder: In fact, the general recommendation is to either keep it as simple as possible so you don't get yourself into that wild goose chase if that's possible, or there's new and some quite novel ways of learning optimal rewards of learning rewards for your problem. For example I saw one piece of research, going back to the robotics example, that was using motion capture on a real dog in order to learn a reward to act more like a dog. And so that forced the robot to actually learn how to move like a dog and not just move.
20:25 Can reinforcement learning be applied to the problem of content discovery?
20:25 Charles Humble: That's brilliant, I love that, that's absolutely ingenious, a fantastic approach. Something that I've spent quite a lot of time over the last two or three years thinking about researching is the problem of content discovery basically. So when all content is freely available on the internet, how do you enable people to find the content that they want? And that's typically done through a recommendation engine, the recommendation is typically using fairly simple principles in fact to decide what content to show you and the standard thing that you optimize for, obviously depending a bit on your business goals but the standard thing you optimize for most of the time is click-through rate, how likely is this reader or this listener or this watcher to click on a given title or headline and go and consume that bit of content.
21:17 Charles Humble: And what you tend to find is that it's relatively easy to do that and relatively difficult to optimize for longer-term engagement. Does this piece of content satisfy the reader? Will he or she come back and consume more of our content? There are ways that you can do that but it isn't all that commonly done. I was curious as to whether you thought reinforcement learning might be something that could be applied effectively in that particular problem space.
21:43 Phil Winder: 100% yes, and a lot of researchers have spent quite a lot of time looking into this problem especially at some of the biggest companies in the world because their entire platforms depend on user retention, they make no money if the user leaves after the first click. So it's vitally important to their entire businesses and I'll give you an example in a second. But yeah, going back to the original question, the reason why reinforcement learning helps here is because you're avoiding that clickbait.
22:10 Phil Winder: These single-shot ML algorithms are always going to prefer to be optimized to the point where you're basically creating these clickbaity titles. And there's nothing that turns you off more than clicking on what you think might be an interesting article only to be pummeled with adverts and really horrible text and it's just not very insightful or content-full article. So yeah, reinforcement learning can certainly help that, it can recommend and learn to recommend articles that are optimal in terms of the long-term goal you have for that individual, whether it's spending money or retaining the readership or getting them to come back. Yeah, definitely.
22:47 Phil Winder: And the one example that comes to mind that I thought was really testament to the approach was a group of researchers at Google attempted to build an RL driven recommendations engine for YouTube and they did it and they tested it and they actually put it into production for a short period of time for the research. And they found that the retention rate and the metrics that they were most interested in improving not only met the performance of the current recommendations algorithm, it actually improved it by several percentage points as well.
23:18 Phil Winder: I think the thing to take away from that is not only is RL useful in these problems, but that recommendations algorithm at YouTube is possibly one of the most highly-tuned recommendations algorithms on the planet. And out of the box this RL algorithm was able to surpass that in terms of the metrics that they're interested in with just one try. So I can only imagine that they probably started to use this in their production systems. I haven't seen any research to say that they have yet but it would be stupid not to.
23:50 When you have a working reinforcement learning model, how do you go about scaling it up?
23:50 Charles Humble: When you have a working reinforcement learning model, how do you go about scaling it up? So typically with a lot of conventional machine learning algorithms it's basically a horsepower problem. You throw more CPU at it or more grunt it and you can improve training speed and you can improve performance. But I don't think that's necessarily true in the reinforcement learning case. So, how do you scale a model up once you've got a working system?
24:16 Phil Winder: So there's two parts to this problem and one has been solved basically, and the other hasn't. The first is the training side of the problem. So the phases or the process that you go through to develop an RL algorithm goes through a phase where you're attempting to train the feature extractors that you've designed or maybe you're trying to train the actual model that lives inside the agent that's trying to make the decisions. And it goes through a period of training. And this is largely a one-shot process that is, depending on the algorithm and depending on the framework, but it is relatively scalable. You basically spread it across machines and you can scale to pretty much whatever size you need to scale to in order to get that training time down to a reasonable amount. And that's all well and good and the output of that is a snapshot of a model. It's like a single point in time that gives you a model for the thing that you're trying to do.
25:08 Phil Winder: The problem is then you need to try and get that into production, you need to try and productionize that thing so the real users can use it. And one of the outstanding problems in RL, I think, and it's largely because not many people are actually using it in production yet, the problem with RL is that it's inherently mutable. It's inherently learning over time to make better decisions. So the more data it gets, the more it can learn, the better job that it can do. But that goes against every single law of software engineering ever. It has to be immutable, it has to be scalable and replicable and things. And unfortunately that's just a lot harder for RL because it is an inherently mutable, learning, unstable thing. So the frameworks just don't really exist to handle that problem yet.
25:55 Phil Winder: And I've read quite a lot of research where people are trying to solve this and they're solving this by actually storing the mutable part in something like a database, actually storing it in something that is meant to store state, and then you have another component that is actually the run time of the thing that is actually immutable, other methods that are attempting to have a snapshot of a model that doesn't change, it's actually serving the users. And then that feeds back to a shadow copy which is not feeding users but can actually learn and does mutate over time. Other ways of actually storing the result is in a log and learning from the log itself. So yeah, there's lots of approaches but I don't think there's one way of doing RL in production yet.
26:38 What are the ethical considerations for this approach?
26:38 Charles Humble: We are unfortunately coming towards the end of our time, and I just to touch briefly on one other topic, which is the ethical considerations of this kind of work. You have a section towards the end of your book where you talk a little bit about ethics and I just want us to pick that up a little bit. We were talking before about YouTube and obviously YouTube is an example of a content distribution platform. Basically anybody can upload anything they like and then the recommendation algorithm will suggest content.
27:11 Charles Humble: And there are now fairly well-known examples and fairly well-documented and studied examples of some of the undesirable side effects where people who are vulnerable to certain kinds of suggestions can get led down a particular path, whether that's towards radicalization of one kind or another or perhaps towards conspiracy theories or we might think about sort of anti-vax type material that's on YouTube in the context of the current pandemic. And I wonder whether with a reinforcement learning system that's learning and mutating all the time, what you think the sort of ethical considerations are there because even more so in other machine learning techniques it's really hard to audit and monitor and understand what it's optimizing for.
27:57 Phil Winder: Yeah. So it's a really important topic and I don't think that I'm going to be able to provide an easy answer to this just like ML hasn't been able to provide a single easy answer to solve all problems, it still exists in ML. And what we're doing with RL is we're making the problem exponentially more difficult because looking at one decision at one point in time may not seem like a bad thing but actually in the context of the history of lots of decisions you may be can see how it's leading down a certain path, for example.
28:25 Phil Winder: And to just talk technically for a moment, there's a couple of key approaches. One is the observability and the understandability of algorithms. Now, just like ML has gone through a bit of a renaissance to try and provide explainability for their models, there's a similar thing going on for RL as well. The thing that makes it a little bit more difficult is that we're not talking about single decisions anymore. So we're not talking about like static statistics, we're actually talking about trajectories over time, it's multiple decisions over time. So it actually becomes quite difficult to visualize and comprehend in your head.
29:01 Phil Winder: But that's one angle. The other angle is something called safe RL or safety-conscious RL. And this is where fundamentally the algorithms are being trained to exist within a constrained and confined set of states. There's two main approaches to that. The first one is to build it into the algorithm, so there's a certain set of algorithms that can be mathematically proven to be safe to a point where if you have the correct constraints applied to the algorithm then it is safe within those constraints. So that's useful in certain circumstances. Another approach is to take a little bit more of a external approach and have a supervisory control over the decisions that are being made and again just apply rules and constraints to that process to make sure that it's not making stupid decisions. So that's the second general theme.
29:52 Phil Winder: But in general I totally agree. I think that the best thing to do is to evaluate and accept if it's going to be a problem in your implementation and actively start to maybe not solve the problem but at least attempt to make the problem visible and make it known. Explain the problems, explain the reasons why it's bad, make it publicly available. Be very clear and open and up front, and you can start to do things like if you're just creating a really good audit trail of everything that goes on, then not if, but when something does happen at least you've got the audit trail to go back through and provide you with the post-mortem data that you can figure out what went wrong and prevent it from happening in the future. But again, no easy answer, no simple solution, it's tricky engineering.
30:42 What are the risks from malicious actors within the system?
30:42 Charles Humble: Along similar lines, what are the risks from malicious actors within the system or adversarial type attacks on a reinforcement learning system? And are there ways that you can guard against malicious attacks?
30:56 Phil Winder: There is a great example that I gave in a presentation a few months ago. And it's an example where an algorithm is being trained in simulation but it's still useful to teach a humanoid simulation of a robot to kick a ball and score a goal. And there's also another agent that's been trained, another humanoid agent which has been trained to save the ball from going in a net. And over time both of these robots learn to do both things pretty well. It can kick and score, and these are really complex models by the way, the full anatomical models. So they're really hard to train in the first place let alone to kick a ball into a goal.
31:32 Phil Winder: And the goalkeeper is doing pretty well. But then the researchers introduce an adversarial attack on the robot that is trying to score the goal and what it does it just falls over, it just lays and sits on the floor. And because the kicker robot has never, ever observed those states before, it's never seen a goalkeeper just lie on the floor, it doesn't know what to do and it just stumbles around like a drunk person and it can't kick the ball anymore. So it completely fails when it's seeing these unobserved states.
32:03 Phil Winder: And so just like ML, this is still a data-driven approach. So if your data is not representative of what is seen in real life then it's not going to do very well. And there's all sorts of technical tricks and generalizations that you can do to try and improve upon that problem but fundamentally it's still a data-driven approach so you need to have representative data in order to get a representative and reasonable result. So yeah, that certainly is a problem with adversarial attacks and I'm sure that these things can be manipulated to do strange things.
32:34 Charles Humble: That's a great place to end, I think. Dr. Winder, thank you very much indeed for joining us this week on the InfoQ Podcast.
32:41 Phil Winder: No worries, thank you very much.
32:41 No worries, thank you very much.
Additional resources
- Reinforcement Learning by Phil Winder
- A Markovian Decision Process by Richard Bellman
- Kinematic Models
- Learning Agile Robotic Locomotion Skills by Imitating Animals by Xue Bin Peng et al
- A Comprehensive Survey on Safe Reinforcement Learning by Javier García et al