BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Serverless IoT @iRobot

Serverless IoT @iRobot

Bookmarks
50:53

Summary

Ben Kehoe discusses the benefits of how the Internet of Things is right when you're producing connected devices, and that Serverless is a natural fit there (both from a technology perspective that Serverless is often very event driven) and so is IoT. Kehoe also talks about the ability to do all this in a very rapid and lean fashion.

Bio

Ben Kehoe is a Cloud Robotics Research Scientist at iRobot, and uses the internet to enable robots to do more and better things. His interests include the Internet of Things, the Connected Home, scalable, developer-friendly cloud architecture, and stamping out the scourge of servers.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thank you, so I'm Ben Kehoe and I'm here to talk about serverless IoT at iRobot. I'm a serverless evangelist, there's a lot of buzzwords here, and it is in a lot of places on the life cycle. And that’s about connecting robots to the internet to help them do more and better things. I will talk a little bit today about what is serverless because, you know, as everybody says, it is very nebulous, and it gets used in a lot of different ways. It is meaningless, that is a bad term, all of those things are true. I will talk about iRobot's journey to the cloud and to serverless, and about how we do architecture deployment operations and what it takes to be a serverless organization.

What Is Serverless?

And so, what is serverless? There are servers, of course, everyone feels the need to make that caveat. But that feels like the wrong question to ask, what it is. A better place to start is to ask what we are trying to do with serverless, what is the point of what we are trying to do? We want to make things cheaper, faster, leaner, and better. We want to make them cheaper in terms of paying less for resources we are not using, we want to be faster in terms of feature velocity, and we want leaner DevOps teams. We want to be better both in the systems that we are making, and in focusing better on delivering business value.

And serverless is also not a bunch of things. To start with, FaaS, that is an integral component of service, and where you bring your component into a serverless system. A lot of people use it synonymously, and it is not -- that's where you bring your code, and the rest of it exists as managed services that you are bringing in to do all the rest of your parts of your system. It is not event-driven compute. It functions as a service and works naturally with event-driven compute, but you can do serverless response paradigms on serverless architectures. If your data is idling, you are probably paying for it. Storage is cheap, but not free. And it is also that you may be willing to pay some for idle to reduce your cold-start costs.

Containers are sort of a tricky topic. You can certainly run a server in a container. A container that runs with an externally-defined life cycle and, you know, serves multiple requests inside it, that is a server. You are running it in a container, it is not serverless. On the other hand, you can have a container that is started up in response to an event, and its life cycle goes for the length of processing that call, it disappears afterward, and that's a container used for serverless. So containers are orthogonal to serverless; they can be in there, but they may not be there.

This is probably the most controversial thing, I think, on this slide at least. Serverless doesn't have to be public cloud. So you can run a function as a service system in a private cloud on prem, and as long as there is enough separation, and the organization is large enough, that the users of that system and the people who maintain it are very separate, like, separate enough that it wouldn't be that different if it was running in the public cloud, then you are still, you know, kind of serverless.

But if the team that is using functions as a service is also running the service that hosts those functions, then you are not, you are not serverless in that case. And so the real truth is that serverless is a spectrum. It is not a black-and-white thing where something is serverless, or it is not.

And so, you know, it is shades of gray, and it is increasing towards serverless, as you get service-full, you heard that term earlier today, where you are using managed services for everything that is not the core of what you do as a company. And for the core, you are using ephemeral compute, things that do not stick around very long to glue all of these services together. And it is serverless as the resources you are built for conform closely to the resource that you actually use. And that can take, you know, a number of different steps. Recently, AWS moved from per-hour billing on their instances to per-second, so you can use the instances in a more disposable, faster, manner. The start-up time is still long and that, but you can imagine that that line may blur in the future.

And finally, having smaller and more abstract control plains, this is the fewer knobs that you have to turn on a system to get performance out of it, to get scaling out of it, the more serverless you are. There can be knobs that are available, but if you don't have to use them, you are making that surface smaller. And the more abstract you are, the difference between provisioning database nodes and throughput, throughput is more serverless, it abstracts for more of the details underneath.

And so, going back again to this cheaper, faster, leaner, and better, I want to show a couple of graphs that indicate the trend that you are going through when you go serverless. So first, cheaper. So again, this is a trend along that spectrum of -- you are paying less for your cost at idle. You may not be paying nothing, but it is going down. And for the faster and leaner piece, it is a little more complicated of a story. In terms of your effort, your development and your operations effort, in sort of the traditional terms, are going down.

But, as you go to serverless from a platform in the service, you get a little of the extra new things that you have to do. And I will talk a little bit later about what those are. And similarly, in your code base size, the amount of business logic in your system doesn't change. And ideally, that is all you have to do.

But, in the real world, you have a bunch of infrastructure and a bunch of other code around it that supports the things that you're doing. And as you move towards serverless, that starts to decrease. But your infrastructure is in the serverless, you are using all of these managed services and it starts to increase from the platform as a service, from the infrastructure as a service as you go serverless, even though your overall size of the things you are handling is going down. There are no units on the axis here, it is just an abstract picture, and what better means will come later.

iRobot’s Journey

So iRobot, we make the Roomba, who here has one? That is pretty good. That is why you are all here? In 2015, we launched the Roomba980, the first connected Roomba, we have a history of building robots, we go back 25 years. We have built space robots, under water robots, oil well robots, industrial robots, some really creepy dolls. And we have even built network robots. So this is PackBot, the first successful defense robot. And you can see there's a yellow spool on there, that's the fiber optic cable, if you have a fiber optic cable, you don't have wireless problems, and we did mesh networking and stuff, so we have experience in connecting robots over the network.

We have cloud experience, where you can drive the robot remotely, drive it around, and video conference it in. So that was connected through the cloud, and it was an enterprise product. And the cloud for it was single-tenant hardware that was installed in Iraq somewhere. So it was not scalable, elastic, public cloud infrastructure. So, that was then. And we're now connected across our entire range of Roombas, and our hard floor products, the Brava is Bluetooth connected.

So we are scalable and connected in that way. In the future, we are moving into the smart home, where we will be, you know, that connectivity brings us the ability to provide a spatial context that our navigating robots build up about the home, their understanding of the space, and use it with other smart home products to provide the users more compelling experiences.

But if we go back to 2015, we were in a place where, at launch, we were launching with a cloud provider that provides a full solution for IoT. And they were chosen some years before launch for reasons that were pretty valid at the time. But, by launch, even beforehand, we had realized that they were not going to scale to the volumes that we needed, and they didn't really have the extensibility that we wanted.

AWS IoT

You know, we now have an Alexa integration, and a Google home integration, and those would have been hard to do on their platform. So we knew that we wanted to switch, we launched with them, we had robots in the field that we knew we would need to transition, so we searched around for another connectivity provider. And we settled on AWS IoT.

And so, IoT is a service that includes a number of different things. The main entry point here is this device gateway; it allows a mutually-authenticated TLS channel between the device and the cloud, and over that, it uses MQTT, which is a lightweight pub/sub protocol for IoT, and in that pub/sub system, you have a rules system where you can write SQL for all of the messages passing through, and then you can hook it through AWS services. And then you get shadow, an asynchronous communication mechanism for communicating with the device. So the robot owns its schedule so that, you know, if it cannot contact the cloud, it will still clean when you told it to clean. If the robot is offline and you want to view the schedule from the app, the robot has reported that to the cloud and you can see a copy of that. If you want to change it while the robot is offline, the app puts that as a desired schedule in this device shadow. When the robot, it comes back, sees that, changes the schedule and reports back out to the cloud. So that is a part of that service that makes the system really easy.

So you will see this a number of times, this is the AWS icon for IoT. And the service is serverless, there is literally nothing, no knobs to turn. It scales to the number of events that you pass into it, it is very event-driven, it is pub/sub, you can listen to the messages, you can send them to a lambda, Kinesis, to ElasticSearch, Firehose you can do a lot of things with it, and it integrates to the process and that is important as a device manufacturer that all the logistics that go into hardware need to go into the IoT systems.

So that's the connectivity layer; it gives us the cloud connection and then we need to build an application behind it that does all of the functionality that we needed out of that cloud system that had been originally provided by this other full-solution provider.

IoT + Serverless

And, for that, we decided to go serverless. And this is a natural fit, you know, that chocolate and peanut butter thing, because IoT is very event-driven, you get sensor events, user inputs, firm ware updates you want to send out. It needs to be scalable, we sell millions of robots a year, and that was already happening before we connected it. So we knew the volumes we were going to be selling afterwards, and they are all going to be connected and we had to be ready for that. It is important to be lean for us. We are historically a device company, not a cloud application company. And so we didn't want to have to go through the process of building up expertise in cloud infrastructure and cloud operations.

Serverless enabled us to skip all that, and you will see some of those numbers later. There are now products that are allowing the reverse. AWS Greengrass is one of these. Just today it was announced, it is called, like, Azure IoT Edge, I think, and what these allow is cloud computing models, like AWS Lambda, to be run on a device, and to help manage the connection back to the cloud. What that means is if you have expertise in the cloud, but you want to be on a device, you don't have to build up all of that expertise in what embedded programming looks like. And both of these mean that you are just focusing on your business value, right, that you are not learning to do all the things that are undifferentiated heavy lifting.

And so, as an example here, when you have a Roomba that wants to connect to the internet, it has a certificate that is signed by our certificate authority, and we want to connect that to AWS IoT. And so what we've done, or what we're able to do, with AWS IoT is install our certificate authority cert on AWS IoT, we do that, we prove that we own it through an AWI call. And the robot establishes a connection, if you know about TLS, there's a key exchange and with that comes certificates. And with that, on the web it is used to verify the authenticity of the server, it is not used on the web but the client can provide a certificate and the server can verify the client's identity. So this robot's certificate is signed by our CA, AWS knows that CA and says that anybody comes in with this should be allowed in. That's how we have authentication- is with that AWS IoT service.

And what is important is that we did not tell it anything about this robot before hand. So when we look at when we build these robots, it is behind the great firewall, and these robots have, you know, there are keys on them, they are an identity in the factory, and we installed the CA certificate on the IoT, and once on a hardware security module in the factory. An HSM is a tamper-proof box that performs crypto graphic operation and destructs if you meson it or look at it wrong.

And this logistical change is very short, it is measured in feet, as opposed to a lot of other services, a lot of other IoT providers that want to provision you identities for devices which, is batch-transferred for the provider to you, probably in the U.S., and then from you to the factory, across the great firewall, which is a long logistical chain measured in thousands of miles, and if any part of that gets disrupted, it has the potential to hold up production. And that is a thing you don't want. So we actually do collect those certificates for analytics purposes, but we don't have to as part of that connectivity.

So, that's AWS IoT, right. And behind that, we built up an application, and that's built around about two dozen other AWS services, these are a bunch of icons that represent services. The only thing that you need to know is that there is no EC2, there is no Docker, none of that is happening in there. In the other parts of the organization, we have other inputs that provide other things. There are servers in there, and we don't like to talk about those.

Long Story Short: Success!

And so the long story short was that this was really a success, our production cloud is fully serverless, we will have two million devices connected by next year, the analytics platform that is in the stack is mostly serverless, there is some Spark in there and stuff that we will get rid of as time progresses and this gives a future for the data-powered platform.

iRobot Scale

And the scale has 100-plus lambda functions in it per deployment. We use, again, two-dozen ish AWS services, EC2 and Docker. And our foot print, we probably have a dozen of those accounts, we have thousands of lambda effects per day, and we do this with a couple of people running it. It mostly runs itself, and our development team is in the, you know, the 10 to 15 people. So this has been, we been able to stand this out without having to build up a large amount of organizational infrastructure for us to do this.

To make this long story less short, I'm going to talk about our architecture, deployment, and operations, and that organizational piece which, is really critical.

Before Serverless

So, for architecture, in the dark ages of servers, you put a bunch of code on there, and you are using your programming language to make calls between functions and do all this stuff, and you may have some internal state, maybe a 12-factor app that is completely stateless, shared nothing, all of that, which is great, but a lot of people have a decent amount of internal state within a service. And when you go serverless, you take that and you expand it out, you take the functions you are putting, or the major functions that you have, and you put them in different lambda functions. And all of that state has to go to an external store. And so your infrastructure diagram becomes more complicated, there are more bits to it. But the code within each of these functions goes down. And actually, it gets really complicated. So this is one tiny little piece of our application where robots register with the cloud.

And so, they take their certificate, this bit here, and they pass it into API gateway over HTTP s, that hands over to a lambda function that does some initial checking on it, asking AWS IoT to make sure it is valid, puts it on a queue, there's a lambda function to read from the queue, which dispatches to a third lambda that creates the shadow I talked about and installs permissions. It installs a life cycle event in case we will do more afterwards, and the queue, if the process fails enough, that information goes from the queue to the dead letter queue that we can take action on it later. All of this is connected through a log in service, so this seems like a lot of stuff. It is very complicated, I want to create something and add, like, you know, a couple of policies to that entity.

Serverless Architecture

But the truth is that, what is really happening here is that your call graph and your code is instead moving into your component graph in your infrastructure. And so the complexity of the actual code that you are putting in those lambda functions is very small. It is primarily the business logic that you're intending to use and, you know, the bulk of it is air handling. What do you do when something goes wrong? This means that you are thinking about distributed systems from the start. Traditionally, you add system boundaries, you treat distributed systems and within a service you often ignore it. Because you are working code and within that programming language and everything seems to be fine. And this makes you build robust by design systems when you are treating it systematically and that's where you get better with serverless.

So an example of how this shows up in the architecture is that, when you look at building something serverlessly, you have to accept the options that you are given in terms of services that are available. And so we will do file upload from robots, and the problem there is that S3, which is Amazon's file storage, their blob storage uses the AWS standard authentication mechanism, which uses a key, a secret key, and all of this stuff. The point is that the robots don't have the certificate, but the certificate authenticates it with AWS IoT. We could have set up a service that accepts certificates for authentication, and then proxies over to S3. We could have added all of the authentication pieces that would be needed on a robot to talk to S3 directly. We could have sent the files up through AWS IoT.

The problem there is that the size is limited, so we would have to chunk it, there are no ordering guarantees, and it becomes complicated. So, given the services that were available, we took this approach, asking through IoT, for an upload url, so that request goes to a lambda function that says, I can talk to S3. And I can get a pre-signed url, and this is an url where credentials, those AWS credentials that the robot doesn't have are actually provided in the query string. And they are time-limited, they are scoped to just the action that you want to take, and then you pass that down to the robot.

And so then the robot is able to take that and make a plain HTTPS input, with no other authentication involved, and that lands in S3. And the second piece of this is, we want that file, when it gets uploaded, to be encrypted and decryptable by the cloud.

And, for reasons that I'm not going to detail here, the server-side encryption option available on S3 doesn't really protect against the threats that we're looking at for this. And so we want the robot to upload it where this channel is encrypted; this is a TLS channel. But we want the object passing over that to be encrypted so, when it lands in S3, it is encrypted there.

And so we take the approach of, well, you know, we need a shared key for it to upload that with, and so the cloud can decrypt it and the robot encrypts it. And like AWS, there's an encryption key that KMS provides us, and that robot certificate has a public key on it, we have it on AWS IoT, and we can encryption the symmetric key, and the robot encrypts and we upload test at the end. And so this is how we build out of the services that are available, rather than looking at it and saying, it is going to be a lot easier. If we were proxying in front of S3, we could have done the certificate authentication and the encryption in place, but the amount of extra operations it would take to run the service and the maintenance to make sure that it was running correctly over time, patching the servers, is not worth it.

Serverless Deployment at iRobot

And then we talked about what we put in the cloud, we will talk about how to get there. So this is, again, 2015, and way back then, serverless was very nascent, lambda had just come out and people were really just figuring out what it could do, what was possible, and the tooling was almost non-existent. There was one, called Jaws, and this is now called the serverless framework. They quickly figured out this is not a great name, for legal reasons.

And there are a couple of others at the time, and we looked at all of them and said the complexity that we're trying to do here is just not supported by these tools. And so we had to roll our own, which is really, you know, ideally not something that you're doing when you are serverless. Serverless is all about how do I only do the things that are relevant to my business, and building cloud tooling is not relevant to what iRobot does as a business, but it was worth it for the total cost of going serverless.

Deployment Tool: CloudR

And so we built this tool called CloudR. The reason it is called that is that a group of cats is a CloudR, and managing cloud resources is like herding cats. We do this with cloud formations; this is AWS infrastructure management service. It is a bit like Terraform, except it operates cloud side. This is attractive to us; we want it to happen from the developers -- and once you hand it off, it is going to respond even if you get disconnected.

But the cloud formation, they support what they decide it supports, which especially early in 2015, was not everything we needed. IoT was not supported at launch, Cognito, their mobile identity solution, was not supported at the time. It has gotten a lot better, it is closer to the set of services that are available. But it is never going to be everything that you might possibly need. And, for that, we use these custom resource lambdas.

So in addition to custom built in resources, you can build custom resources that rely on a Lambda function behind it. When we looked at that, we said, that is hard right now. And we built a library to make it easy, and then that has enabled us to always say, yeah, we want to manage that with cloud formation, let's just do it. As long as it is something that has a create, update, and delete life cycle, if it is a third-party service, if it is an unsupported AWS service in cloud formation. Jared Short, who talked earlier today, I said, custom resource Lambdas, you can control anything with cloud formation. And he said, all right, IoT light bulbs. It took me a half hour, most of which was wrangling with the light bulb SDK. So that's fun. And so the way that we do this is, in the source, right, there's a bunch of functions defined, lambda functions. And those get zipped up, and then there's a cloud formation template. And that template defines, you know, the functions and the services that are used and all the interconnections between them.

And so the zip functions get uploaded to S3, and the locations that we're uploading to S3 get, then, written into cloud formation. Because those -- the functions that you are defining in your infrastructure, the source needs to come from somewhere. We then upload that template into S3 to create an artifact, and then we tell cloud formation, okay, go deploy this template. And that becomes a stack, and that's the instantiated template.

And, of course, we have to bring those custom resource Lambdas, deploy them and stitch them into the template and tell cloud formation where they are. And so this works well. It is something that we hope to eventually move away from, there are some, you know, serverless framework has come a long ways, AWS has their own tool called Sam, there are a bunch of serverless frameworks out there that can do a lot of things. We're not ready to replace all of it yet. There's still a lot of stuff that is still lacking.

Red/Black Deployments

And so once we've got that up there once, we want to update it over time. And the terminology for what a phased roll-out is called is very ambiguous; it is used in a lot of different ways. For this talk and this talk only, I will define it in a certain way so it is clear to you. So when I say blue/green, I mean that behind the load balancer, workers are updated. To the client, the end point stays the same. And for red/black, you are standing up a new copy, including a new load balancer, and client has an end point that they switch between. Blue/green has traditionally been an area whose serverless offerings are lacking. It is getting better, but it is not there yet. So we went red/black.

Hosting Multiple Versions

And there is no overhead to it. I can stand up an entire new copy of our application, if there is no traffic through it, it doesn't cost as much. IoT makes things tricky for red/black in a number of different ways. AWS IoT makes things tricky for various technical reasons, but also just robots are more finicky clients than web browsers. It is harder to ensure that they are switched over. The other piece is that data stores, external resource, often have a separate life cycle from, like, the code in your application. And you need to manage that differently.

Deployed System Architecture

So the way that we do that piece is that we do deploy them differently, they live elsewhere, mostly managed by cloud formation, but not in the same stacks as our applications live. And what we do then is stand up in cloud formation a set of proxy resources, and so these are, again, custom resources, we use a Lambda for that, and this goes over and says, we say, here. Pretend to be this resource. And it goes, it reads all the information from it, and it pretends, it provides all of that information back in the cloud formation that it would if it was the real thing.

And so, when cloud formation tells it create, update, and delete, it doesn't do anything. But it allows us to tell the application, here, go and look at the proxy stack and get all of your information about your databases, your Kinesis streams that flows out to the analytics stack, there are, you know, hooks into our user authentication system and such for admin access access that live there, things like that. And that forms a manifest of all the pieces, which allows us to re-use it when we have a new deployment that is using the same resources.

And now, when something gets updated, you know, there's a new IM role that we need to use. We create another one of these stacks and point it to the appropriate resource, and then we deploy the application pointing to that. But, of course, if these other applications, these previous versions are still in existence, that stack of proxy resources still exists, all of the resources are still managed in their own life cycle, and so all of this can co-exist at the same time. We don't have to do any in-place updates of anything. And of course, if you are adding, over time this just continues to function in the same way.

Serverless Operations at iRobot- Monitoring

And this transitions well into how we do operations. So the first part of operations is what are you running, how do you know what is going on there. We use Sumo logic for this, we do cloud watch logs, cloud trail, across all of the accounts that we do. There's account per developer, all of the staging accounts, and all of that is pulled into a central place. We use it for alerting, we use it for notifications, and we use it for diving deep into the logs to find out where problems occurred.

DevOps

And so the DevOps piece for serverless that we found is that developers are exercising a platform much more than your production application is, simply because they are making more API calls. They are deploying the application, those thousands of lambda deploys per day, almost none of them are production deployments, they are all developers iterating on their code.

And so, this tests your account limits, this tests your metrics, and so everything that you are trying to launch, you can make sure that the information flows from them. Unlike, you know, in sort of more traditional environments, to your provider, the difference between your dev and your prod environment is none, right? It is the same thing that they are providing, they don't know that one of them is production and another one is not. And, because the developers are exercising the platform APIs all the time, you can actually see problems there in the platform before they hit production. Because often, that occurs when you are trying to change something. And you are changing more things in development.

Visibility

And so the visibility piece, those metrics, you want as many as you can. Always be greedy, ask for more. AWS IoT now has a ton of metrics, but at launch, it had a handful. And so there's a lot of things that took us a long time to debug because we couldn't see into the system well enough. But, over time, that improves, and so that is always something to keep an eye on, am I getting to see the things I want to see.

AWS Enterprise Support

We found this AWS support to be a straight plug, we found it to be useful, it is great, worth the money, they understand what you are trying to do, they learn from what you are doing, so there's a two-way flow of information there. Everybody in your company there can create tickets, that is useful. It comes with a bunch of stuff.

The Future of Improved AWS Visibility

And then, in the future, always ask for metrics. Tell them what you want, ask -- even just for future requests on Twitter, #AWSwish list is a good one for us. And this service health dashboard is helpful, it is hyperscale, and it takes a lot to change any service from a green check park. This dashboard is a per problem health, issues can be communicated to you in a timely manner. And if something is happening and it is not your code, you need to be able to prove that to the higher-ups in your organization, and that goes to how we, you know, need to change in our organizations to go serverless.

Serverless Organizations- Conway’s Law

So Conway's law; this is a different Conway over here, this is Conway's game of life. But it is a life lesson that says your software architecture is going to match information flow in your organization. That is just true, it has always been true, is always going to be true. And you have to flip that for serverless, if you want to go serverless, that means you are going to change how the information flow happens in your software architecture. If you don't change the information flow in your organization, you are not going to be successful. So you have to make sure that you are setting yourself up to succeed by working within the organization to make sure that you are setting expectations correctly.

DiffOps

Because it is not all rosy, it was presented early on that serverless is no ops; that is just flat out false. It is a bit like the change from prem to cloud. Maybe you have taken that step, but some are still working on that. It is easy, once you are in the cloud after on prem, it is so much easier, but everything is not strictly easier. Most things are easier, some things are still hard, and there are some new things that crop up that you didn't have to do when you were on prem, especially monitoring your provider. Right?

If you have done outsourcing, it doesn't mean that you do zero work, it means that it reduces and changes. And this is something that you need to establish inside of your organization, and the nature of your involvement with technology is changing a bit, from ownership to a more management and partnership role.

The Cloud Has Weather

Because the cloud has weather, right? No provider is immune to problems, and small effects are more common than big outages; big outages get the headlines, the internet is on fire. But small blips increase the latency, a small spade of 500s come out, that can happening, and as you are transferring your stack outside of your organization, you will realize that all of the problems that used to be your problems are coming from outside. And this comes with the territory, you want to be architecture robust to those problems as much as you can, but sometimes there is nothing you can do about them.

Visibility

And the biggest problem is visibility, you only know what your provider tells you, they are really good at the things that they do, but you don't know how they do. You know that you don't know things about them, there are unknown unknown things about them. The only thing that keeps you up at night is the things that you don't know that they don't know they don't know. That is what is going to come out of the blue for you.

Reacting to Incidents

And when there is an incident, the first thing to do is to gather data; where is it happening and when is it happening and then you can root cause. Is it your code that is causing this problem, this impact on your customers, or is it the platform? Suppose it is the platform. The first thing you do is own that impact to your customers. You might not be responsible for the incident, but you are accountable for it. It is your choice to be having technology in the way that you do, and you are accountable to your customers for the performance of your application. And on top of that, you want to diagnose that your application is handling that incident in an optimal way. You might not have the right settings to be robust to what is happening, so online, you want to make sure if there is anything I can tweak to mitigate this problem, and then look and see if your architecture is right and correct and appropriate for the type of incident that happened.

And in the aftermath, the push-back you get after an incident is always, well, why can't we own it ourselves? Because that feeling of -- there's an outage that is impacting your customers, and there is nothing you can do about it, is very uncomfortable. And the truth is, right, that if you did that in-house, you would have more control and you could take steps to fix it. But you would be worse at it than this managed service that you're using. And so those incidents would happen more frequently and more severely than what you are getting when you are using a managed service. Obviously that is not always true, and the biggest the organization, vertical integration can make sense for certain scales and certain business cases. Sometimes it is in your interest to own the technology further down. But, when you are going serverless, your default is to say, how can I not own this.

Summing up

And so, to sum this up, serverless is cheaper, faster, leaner, and better. And we want to be serviceful, we want to use managed services and tie that all together with ephemeral compute. We want the resources we build to be closer to the resources we are actually using, and we want the control planes to be smaller and more abstract, so there is less we are obligated to do to get good performance out of our systems. At iRobot, we transitioned from a cloud provider to one we built on AWS, and that skipped building Elastic Cloud infrastructure, that helped enable us to be lean and do this quickly.

Lessons Learned

The lessons we learned is that serverless deployment still has problems; people are figuring it out, support from providers is still an ongoing effort. Your call graph is your component graph, so the complexity in your system moves from one place to another, and visibility is always going to be your biggest operations obstacle.

In your organization, you want to keep Conway's law in mind, the cloud has weather, set expectations about that internally and make sure you are focused on the total cost of ownership. It is easy to mistake, to look at your AWS bill and forget about your operation salaries. That's all I got. Any questions?

We have a few minutes for questions. All right. Who raised his hand?

What is a robot doing with the cloud?

You asked what does the robot do with the cloud? The connectivity enables a few different things. The most basic is a better UI, so scheduling your robot through the app is much easier than messing with a bunch, like, four buttons on the robot that doesn't have a screen. The level-up from that is, you know, you can clean remotely, and additionally, we collect product telemetry that allows us to find out, for example, how much cleaning are we actually doing for our average customer. Does that mean that we should make our batteries bigger, or smaller. I don't know about you, but I have never sent in a product warranty card, you fill out the little surveys. And so, before having connected robots, you know, we had beta customer and surveys that we did. But we didn't have good data on how our customers were using products so we can use that to make it better for them, and then the smart home pieces are there, too. This guy was looking pretty skeptical the whole time.

Why is it 2015? What are you doing now? So what we are doing now, so that's when the robot launched, we transitioned at the beginning of 2016. Yeah. And then we launched other connected products in the interim and released features on that, so we can get maps from your robot, you can now do -- you can now do Alexa integrations, so you can tell your robot to give the cat a ride, that's one of the things through Alexa. And a lot of this is stuff that we didn't learn at launch, right? And so -- right. As the robots are coming online, we are learning more about how to ingest those volumes in analytics. So you can natively shuffle data from IoT into ElasticSearch, but at a certain volume, that ceases to work well and you have to start batching it through Kinesis, so there are things that we've learned. And of course, the organizational piece has been a continual learning process.

Thanks, it was a nice presentation. So each device has its own topic to communicate? So what is one of the challenges that you faced … does it demand a standard around that, did you put in a standard to make the registration easy --

So there are topics that are used to communicate with individual robots, there are also a special hierarchy of topics that AWS reserves for themselves. There are best practices that AWS provides to make sure that the way that you structure your topics -- it is a little bit like, if you look into, you know, guidance on S3 keys, there are best practices around how you order the elements in them to the best performance, those kinds of things.

Thank you.

Last question.

Hi, how about automated testing; how do you make sure that it is serverless, and that the functions are integrating well?

So testing is a great topic for serverless. There's a number of slides that I took out because it wouldn't have fit. So one piece is integration testing, is only ever worth doing on a deployed system. As soon as you are past the lambda DB stage, you cannot debug locally. You might be able to locally with sam Local, that's something that AWS has to debug locally, but for integration testing, you need to deploy it. There is no such thing as a local copy of a serverless application. And so, what that means is that you want unit tests to be much more comprehensive, and you also don't want your unit tests, most of the code is making SDK calls. And so the -- there are mocking frameworks for the SDKs that allow you to stub out the SDK calls in those unit tests so that your unit tests can run without cloud connectivity, and the more of that you do, the more you are going to discover before you move to your integration tests, and the easier it is going to be. It is certainly a very different experience and process and workflow that doesn't have all the kinks worked out yet.

Okay, thank you. All right. Give him a hand, if you have other questions, please come up.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.

 

See more presentations with transcripts

 

Recorded at:

Mar 30, 2018

BT