InfoQ Homepage Presentations Putting Node.js Serverless Apps into Production without the Pitfalls

Putting Node.js Serverless Apps into Production without the Pitfalls

Bookmarks

View Presentation

Speed:

Download

45:59

Summary

Eoin Shanaghy covers the highs and lows of building Node.js apps with Serverless. The open source, JavaScript-based SLIC Starter project is presented, showing how it is used to quickly adopt best practices and get to a successful production deployment without pain. He also shows how TypeScript-based Infrastructure as Code is the way forward!

Bio

Eoin Shanaghy is the CTO and co-founder of fourTheorem, a technology consultancy focused on modern applications and machine learning. He is the author of "AI as a Service", a book on building serverless AI applications from Manning Publications. Prior to fourTheorem, he worked in software architecture and product engineering in both startups and enterprises globally.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Shanaghy: This talk is really going to be in two parts. The first part is going to be a little bit higher level, looking at modern applications, what they are, and what the best ways are to build them today. I'm going to talk a lot about serverless because most of the applications I build these days are using serverless in some shape or form. Then I'm going to look at some of the more technical details. What does it mean to build serverless applications with JavaScript? Then what are the challenges? There's always challenges with new technologies and adopting new patterns. What are they? How can we overcome them? I'm going to try and show then some varying recipes for doing stuff effectively with serverless.

The Modern Application

What is a modern application? What does it look like? These days, modern applications tend to have to be scalable. You don't have to have a billion users, but you might scale in some dimensions. It might be a stream of data coming from IoT devices, or it might be that you've just got large variances in scale. These days, we're expected to be able to handle sudden bursts in user consumption. We're also a lot more user focused. Users are a lot savvier. We keep them in the loop a lot more. We don't try and write all the requirements upfront and predict them. Instead, we work a lot more with users. We try and create hypotheses about user's behavior. Then we try and build stuff as quickly as possible so that we can test that hypothesis and either proceed with it or ditch it. Then our systems also have to be very reliable and secure in today's environment. We also need to look at incorporating intelligence into our applications, whether that's machine learning, or data science, or complex analytics. Really, I suppose the key things about modern applications is not necessarily the features or the technical constraints around them, it's more really about getting first to market. Because everybody's got the tools to deploy software very quickly, really. Your competitors do. Time to market and speed of iteration is very important. The ability to experiment is really critical, actually. Software companies that succeed the most today are the best at doing low cost experimentation, and putting experiments out into production, and measuring them effectively.

The Best Way to Build a Modern Application

Given all of that, I assume that everybody's come to QCon to up their game to find out what is the latest and greatest in building a modern application? Let's look at, what are the tools? What are the methods you could use to build the ideal modern application?

Sixty percent of websites on the internet are apparently built with WordPress. It's proven. It's open source. It's been stable for a long time. It's not very trendy. It's not great for the CV to build with WordPress. That might be a bit of a downside.

You might go a little bit more extreme and pick Haskell. Haskell is the purest of pure languages. If you want to get something into production that you know works, it's mathematically proven that you got no bugs in it, then you can go with an approach like that. For some reason, despite the guarantees around correctness that you get with Haskell, it hasn't been adopted by the masses. Because it's seen as a little bit scary, and difficult to take on.

You might go for something a little bit more pragmatic, something like Rails, or Django. Rails and Django are popular because they are pragmatic. They enable good developer productivity. They allow you to get stuff into production very fast. For a while it was all the rage, and people were doing great productive things with Rails. It was really a perception around performance and scalability that stopped Rails becoming adopted by the masses in the extreme.

Then we moved into looking at scalable distributed microservices. If you wanted to go along that route, you might do something like build your system with Erlang. We talk about microservices all the time. We're figuring our way through microservices as if it's the first time this has ever been done. Erlang was doing fault tolerant, distributed, scalable microservices in the 1980s. WhatsApp built their system on Erlang. They were able to scale to 2 million concurrent connections on one single server.

It's interesting to understand and look at this, and say, why haven't we adopted technology like this? I think it's largely down to the fact that it's too much of a reach into the past. People like to take stepping stones from the technology they're working with today. Instead, people move from doing monolithic Java applications, or Go, or C#, .NET applications into doing Erlang-like patterns with microservices. That's what brings us to, I suppose, the predominant patterns today, which is using container orchestration systems, and doing microservices at scale with Kubernetes. Of course, with Kubernetes, I suppose the emergent problems coming out of that space are that you've got a lot of infrastructure to maintain, and you end up spending a lot of time thinking about that, and not necessarily the function that sits on the top.

Then there's the idea I'm putting forward today, which is to build all your systems using serverless and to code them in JavaScript. Let's just think about that for a second. The idea there is that you componentize your software into the tiniest, smallest pieces, and you distribute them all over the network, so you introduce as much latency as possible. Then you relinquish control to the cloud provider for the majority of your system. Not only that, but you take a language that was invented in 1995, and developed in 10 days for Netscape, and you decide to use that language to implement it all in. Why would you do this? Why would you choose something that's so unclean? It feels so unfit for purpose. Why wouldn't you use a safer, purer language, and an ecosystem with more safety guarantees? This was because we've learned over time that perfect systems don't really work out very well. I spend a lot of time trying to architect elegant systems that people can use in a generic way, and which will provide guarantees around productivity, and speed to market, and cater for architectural growth into the future as well.

The Pursuit of Perfection

It doesn't really work like that. When you're trying to build perfect systems, and then you look at the reality of how developers actually take them, and adopt them, and use them, it never really works out well. The theory behind these theoretical frameworks and underpinnings never really plays out in reality. There's always trade-offs. It's about accepting those trade-offs. When we're building software systems, it's not just about having elegant architecture and elegant code. It's about how productive are you with it? The availability of skills. The community around it, the tooling available to it. Perfect software doesn't always build perfect systems. We're not trying to build beautiful looking code, we're actually trying to build meaningful user experiences.

I wanted to give one story about the pursuit of perfection. In 1975, in Köln in Germany, there was a young jazz fan called Vera Brandes. She was really into jazz and she wanted to encourage the popularity of jazz in Germany. She decided to book some concerts, and she booked an up and coming, but already quite famous solo improvisational pianist called Keith Jarrett. She managed to sell 1700 tickets for the Opera House for his concert. Keith Jarrett was a total perfectionist. This was a guy who used to give cough drops out to his audience so they wouldn't cough and interrupt. He requested a very specific model of grand piano for his performance. He arrived on the day of the performance, completely exhausted from his European tour and turned up at the Opera House with his manager. They checked out the piano, and it was a total disaster. The piano wasn't even a grand piano, it was a baby grand. It wasn't even going to carry the sound to the back of the Opera House. As well as that, it was a rehearsal piano. It was in awful condition. It was damaged. It was out of tune. Felt was missing from some of the hammers, so some of the high notes sounded just awful. He said, "No way. I'm not playing this." He walked out of the theater. This 17-year-old budding jazz promoter was left there with 1700 tickets sold. She didn't know what to do. She chased him. She ran after him out onto the street and begged him, pleaded with him to give his concert. Eventually, he took pity on her and relented. He went back inside. They managed to tune the piano, but they couldn't get another one. He had to proceed with the tools he was given. He had a sound recording team with him. They were recording all of his concerts. They were saying, "There's no point in recording this. Let's do it anyway. It will be evidence of what disaster looks like if nothing else".

He went ahead and he gave his concert. Because he had to work with the tools he was given, he was pushed way outside of his comfort zone. What that meant was that he had to resort to all sorts of things to try and work with the instrument, he was standing up hammering on the keys to try and make sure that the sound would carry. He had to avoid the upper register, and do things he wouldn't normally do. Something happened in the creative process about the recording of that album. Not only was it released, it actually went on to become the number one selling jazz solo performance in history. It also became the number one selling piano solo album in history. I think it's an interesting parallel. It might seem very far from the topic. Ultimately, software engineering is quite a creative discipline, more so probably than an engineering one. We have to think about how the tools not only enable us to do our work, but how sometimes when they push us outside of our comfort zone, they can actually stimulate us into new levels of creativity, that force us to do new things, and innovate in ways that working with perfect underlying systems doesn't.

Cloud

What that actually means is, if you look at serverless, as a concept, it was developed by accident. Nobody decided, let's go and design all these components to make what we now call serverless. It emerged from the failings, the successes of the technologies that went before it. The development of cloud over time, first as utility instances. Then we started doing microservices. Microservices got us comfortable with the idea of very small components. We started adding infrastructure as code in order to manage that. As we do that, we ended up with more and more infrastructure complexity: container orchestration, service discovery, circuit breakers, all of this stuff that you have to do in order to run microservices at scale, emerged. Then Functions as a Service came out as an afterthought, almost. Initially, it was about an ability to execute code without bringing up any containers or any instances. It wasn't necessarily about let's suddenly move all of our business logic into functions. It was really there as a cloud utility. AWS Lambda was released almost as an experiment to see how the users would adopt it. What happened was that everybody rushed towards. People started using it for simple things around the edges like doing backups, cron jobs. Suddenly, they realized that it made them a lot more productive for doing a lot more than just that. At the same time, managed services grew in their numbers and in their capability, so that we could actually start ditching a lot of the code that we write, a lot of the custom code that we do to do all the undifferentiated stuff that everybody has to do when you're deploying software in the cloud.

Features of a Serverless System

The features of a serverless system then are, number one, I would say managed services. People almost equate serverless and functions, but that's not really how I perceive it. Serverless is about a wide variety of managed services. On-demand managed compute in the form of a Lambda function is one of those services, it's not necessarily the defining characteristic of a serverless system. Serverless systems are then also very much event driven. Rather than having one service call another, explicitly, and having synchronous chains of calls, and having that tight coupling between systems, we try to develop components as very isolated components, and tie them together in an asynchronous manner. We also ensure you pay only for what you use. This is a defining characteristic of serverless. Also, you don't have any idling infrastructure. If you're not using it, there's nothing running. Ultimately, what that leads to then is less code. That's a really serious goal in software development. Less code means fewer bugs. Less code means less cost of maintenance. Less code means you don't tend to have that continuation bias, when you've written a large body of code, and you feel like you put so much effort into it that you just have to keep maintaining it, keep adding to it. When you've got less code, small systems, it means you're much more comfortable with actually throwing it out and redoing it, which is a fantastic thing.

JavaScript - The Success of Node.js

How does JavaScript fit into the world of serverless, then? In many ways, JavaScript has no right to be the top language for serverless. On the face of it, it seems quite unsuitable. If we step back and look at why JavaScript is as popular as it is today on the back-end, let's look at the success of Node. Node.js was really good for developer productivity. It was a bit like Jarrett's piano, with all its flaws. It took people out of their comfort zone, but also gave them a lot of creative freedom, because it doesn't put a lot of barriers, or friction, or unnecessary ceremony in your way. Node was initially really successful because of event-driven I/O. That was its initial selling point. It was about solving the problems that people saw with Rails from before, and the inability to scale to large numbers of concurrent connections. Node allowed you to scale to large volumes, concurrent connections, and had a very elegant solution to do that, using an event loop and asynchronous event-driven I/O to manage it. Then the fact that it was on a single thread means you don't have to worry about concurrency issues, which is another piece of friction that can tie people up in knots in other languages. That helps developer productivity. Then, of course, the module ecosystem, just helped the whole platform to blossom. Developers love small components. I think we've all got tired of using large frameworks that are very unwieldy to adopt, and to manage, and to configure. The ability to pick and choose small modules makes developers productive and it gives you that control and creative freedom at the same time. It's easy to understand because each component should have a single specific purpose, easy to find, create, destroy, and replace. It means the ability to do that low cost experimentation and move very quickly.

Is JS a Good Fit For Serverless?

Why is it a good fit for serverless then? In one way, it's not necessarily. I talked about the ability for Node to scale to multiple concurrent connections. It's a bit amusing, really. In a serverless context, you have an event coming in, and one Node process handles one event at any given time. It seems completely unsuitable from that perspective. You don't even need a HTTP server in a Node context. Instead, typically with a Lambda function, you get an event in, and that event can be driven by a HTTPI on the front of it. As well, its lack of types might also make it unsuitable, because if you want to test your Lambda function in a realistic cloud environment, you have to deploy it. If you have to deploy it before you run it in order to get feedback on it, then a lack of types might mean that you actually develop a feedback loop on correctness of your code, is quite slow. At the same time, Node is very fast to start. It's one of the fastest runtimes available to start. It's a very fast runtime. It's very good performance and low memory overhead as well. You don't have a compilation step. When you're developing these small components and iterating quickly, taking a compilation step out of it is actually a big advantage. Of course, you might have a transpilation step, which is another story all together. You still get to take advantage of that huge module ecosystem, all of the knowledge that's out there. All of the collective knowledge that's used both on the front-end and in back-end Node development is available to you when you're deploying systems into the cloud. Specifically, with npm, majority of those modules are very small, which makes them very well suited for deployment in a Functions as a Service environment. You got the familiarity and the ubiquity of the language. The skills are available. It's developing apart from things that you now can dispense with, such as your Express or your Hapi server. You don't need to bother with that. You've still got the rest of the ecosystem that you can still leverage and use in a Lambda function. It's still highly productive. The last one there I think is the most underrated actual benefit of using Node.js in back-end development, particularly in microservices, and in web applications. The best language for processing JSON is JavaScript. There's no ceremony to it. It maps perfectly to JavaScript objects. Then if you're doing the things you do very frequently, in Functions as a Service, like mapping data from one format to another, or doing some translation on JSON objects, JavaScript is the lowest ceremony overhead of anything to do that.

We know that Git and Node is the leader in deployments in serverless functions. This study came out a couple of weeks ago from New Relic, and it shows that over 52% of their monitored Lambda functions, were running in Node and Python. There's probably no coincidence that Python, another dynamic language is in second place. This might change, actually, because of performance benefits and improvements in the other runtimes. Node is still a significant leader in this space.

Given all of that, then, should you now choose JavaScript for your serverless deployments? Not necessarily, I would say. One of the main reasons I do it actually is because that's what I was using before. It makes me very productive. If you're not a JavaScript developer, or if you hate JavaScript with a passion, as I know many people do. While you have to learn all of the other new things around serverless, there's no point in forcing yourself into adopting JavaScript just because most other people do.

What Does A Node.js Function Look Like In A Serverless Context?

Now we're actually looking at what does a Node.js function look like in a serverless context. I've got a very simple example here to start things off. I built this simple API that will find accommodation for you to stay in Ireland. I found an open public dataset, which is the CSV file of all the hotels and B&B's in Ireland. We're going to build an API to query that and serve results back. We're going to filter that dataset by county. This is the Lambda function itself. The function is a lookup function. Like all Lambda functions, the first argument it takes is the event. With JavaScript, you can use nice things like destructuring to extract fields out of that. We're taking the county out of the query string parameters. We then parse that into the findAccommodation library, and return the result, which is a JSON array, and a JavaScript array, and a 200 status code. Then the actual implementation itself is an interesting, quirky one I picked for this example. It's not necessarily how you would do it in production. What we're doing is we want to filter out from the CSV file. In real life you might use a database. In this case, I'm actually fetching the object from S3. The CSV file just sits in a location on an S3 bucket. I'm using something called S3 Select, which has an ability to query CSV files, or Parquet files, or JSON files in S3 and have that query performed within the bucket itself so that you don't have to transfer the dataset back. This CSV file is quite small. It's like half a megabyte or something. This is really useful if you've got big volumes of data, and you're trying to pull data back for processing. We initialize the AWS SDK to the JavaScript SDK. We create an S3 client. Then we're invoking the Select Object content with a SQL query. That gives us back a stream. It gives us back a Node.js stream, and we're collecting the data in that stream, parsing it, and returning it as a JavaScript array. That's all of the code. This is how we deploy it. This is the serverless framework we're using in this case. It's the most popular framework by far for deploying serverless applications. There are many other alternatives.

At the bottom, we've defined our function. It's a lookup function. It can be found in the handler module. The function is called lookup itself. We're parsing in some environment variables. Then we're specifying what events trigger this function. In this case, it's an API gateway with an HTTP GET request. At the top, then, we're defining our general AWS cloud provider configuration. We're saying this is going to be a regional endpoint. It's going to be optimized generally for people within the EU region. We've turned on tracing for both the API and the Lambda function. Serverless framework is going to create CloudFormation log groups, so all our logs go in there. We're setting the retention on that to 7 days, because it costs money, and you don't want to keep them forever unnecessarily. Then the most important part probably is that we're defining our principle of least privilege permissions, which is this function only needs to access one object in one bucket. That's all it has access to. We serverless deploy that environment. Then we get an API endpoint. We can make GET requests against that, and it returns our results.

Some of the things you get with that deployment then. I'm showing a few examples of the things that you automatically get when you deploy a simple function like that. On the top left, we have the API logs in CloudWatch. On the bottom, you get the distribution of the response times in your API so you can map how long your requests are taking. On the top right, we've got the service map. This is coming from X-Ray. It's basically building a map of our components and showing the traffic, and the data as it flows through the system along with the response times. That's quite a simple example. As with all simple examples, they're very misleading.

Challenges

When you go to deploy serverless at scale in production, there's a whole other set of problems. It's a very different way of thinking than building software using other paradigms. Different to monoliths, also different to building microservices. There's a learning curve there. It's still quite early days, so best practices have yet to be established. It's a moving target. It's evolving fast all the time. That can be a benefit as well as a challenge. With that comes then organizational change because expertise in a particular cloud provider you're using is really essential in your team. That means potentially changing the structure of your teams and ensuring you've got that good DevOps best practices. There's hundreds of cloud resources. There's complexity in configuring each one of them, and that knowledge has to be there somewhere.

Serverless Adoption Rollercoaster

As with adopting any new technology, you've got this rollercoaster effect where you start off and everything seems wonderful when you deploy your first system, and you're delighted with yourself. You think you're invincible. Then you deploy to production and other real-world effects start happening. Then you've got this despair, and you realize how difficult things are to actually optimize. You discover that every single service you use actually has failure modes, and you have to understand those failure modes and react to them. That's a rollercoaster, and it's inevitable. At our company fourTheorem, we've been doing serverless systems for a while now and we've ridden that rollercoaster a few times, and managed to smooth out the bumps over time. One of the ways we did that was by taking all of the learnings we had and putting it into a template project. We're saying that one of the great benefits of serverless is that you've got creative freedom. That means you've got a lot of decisions to make in terms of how you assemble your system, how you put things together. It's actually quite daunting at the start, and it's something that can slow you down. We decided to take this opinionated approach for 80% of those decisions to allow us to bootstrap these projects quicker. We also decided then to open source this and make it available to everybody, and try and make it replicate a production environment, as close to a production environment as we could.

SLIC Starter

These are some of the things that the project aims to provide an answer for. These are all considerations you need to make at one stage or another when you're building production grade serverless applications. There's quite a lot in there. The project, which is called SLIC Starter has an answer for an element of each one of these things. You can find it on GitHub. You can search for it just by SLIC Starter. You can go to slic.app, and it will take you straight to the GitHub page. The application itself, it comes with a full demo application with a front-end and a back-end, and you can deploy it and use it. It's an application for managing checklists. You can create checklists. You can sign up for an account, create lists, create entries in those checklists, and mark them off. The intention is basically to provide you with an application that you can quickly disassemble, start adding your own features to, and really just pick and choose the components that are relevant to you. You can also just take it and just use it as a learning resource. You don't necessarily have to base your project on that. A typical workflow is that people will take the repo, clone this or import the code into their own repo and start chopping bits out and adding new features in. You can use all the features that were there just as an example, if you like. On the front-end side, you can sign up, log in, create and manage these checklists. Go in, edit the checklists, check them off. There's some event-driven flows in there as well, like when you create a list, it will send you an email congratulating you for creating a list. It targets AWS only, right now, just because it's easier to maintain. It's what we deal with most of the time. It's easier for us to be productive on the project by just building on top of AWS.

The Architecture of SLIC Starter

This is what the architecture diagram for the whole project looks like. There's about seven or eight services in the project. They're all single responsibility, single purpose components. For example, you've got a service dedicated to sending that welcome message. That reacts in an asynchronous way to an event. When we're building events into serverless applications, generally, we try to leverage cloud managed event services for everything, not just events. In terms of AWS, that generally boils down to SQS, or SNS, Kinesis, and EventBridge. Basically, for point-to-point communication, we'd use SQS. The email service accepts messages from other services on a queue, and it owns that queue. Once a service has dropped a message to be delivered onto that queue, it's essentially guaranteed to be delivered at some point in the future. When you've got Pub/Sub messaging, we tend not to use SNS so much anymore, but use EventBridge instead. The difference with EventBridge, is that EventBridge doesn't have to belong to any specific service. It's always there. You can create events and push them onto the Event Bus. You don't have to do anything with them until you're ready to consume them. Then other services will use essentially pattern matching to pick up those events and react to them. It's very powerful, and very flexible.

In this case, the checklist service just emits lifecycle events. Somebody has created a list, somebody has deleted a list. If other services want to react to those lifecycle events, they can say, I want your list to all events relating to checklists, where the pattern event type starts with created, for example. They essentially create that pattern matching rule. Then the service responds to that and it triggers a Lambda function. The other option, then, I refer to is Kinesis. If you're doing high volume, real-time events, where you want time ordered stream of events at high throughput rates, that's where you use Kinesis. Much in a similar way to where you would use Kafka. I suppose it's a lot simpler and more bare bones than a Kafka context.

Typically, the way we would deploy this is using the best practices of separating each environment into separate accounts. Separate one for staging, development accounts, and then your tooling account. You can also deploy this into a single account just to get going and see how it performs. The services themselves are deployed using the serverless framework. Then for some elements, we use something called the CDK, which is really interesting. In AWS, there's a couple of different ways to create the resources you're using. One is just clicking around through the console. It's not a very scalable or reliable way to do it. You can also use the command line, or the SDK to create things. Then there's CloudFormation. Realistically, if you're doing an AWS-only application, CloudFormation is really the way to do it. You can use other things like Terraform. The advantage with CloudFormation is it collects all of your resources that belong together into a single stack, which is deployed and rolled back as a unit. That's quite powerful when you're managing deployments. You do that in JSON or YAML. This is just an example for an S3 bucket. As your deployment grows, the amount of YAML you have to maintain becomes very unwieldy and very difficult to maintain. It's also very error prone, you've no validation in there. That's where the CDK comes in.

Creating CI/CD Pipeline Resources Using TypeScript

This is an example of some TypeScript we use to create some of the resources for the CI/CD pipeline. CDK is written in TypeScript itself, but they've got a tool that actually creates bindings for Java, C#, and Python. The great benefit of it is that it gives you a Typesafe programmatic imperative way to build your resources. If you've got any dynamic behavior within your resources you can do For loops. You can do IF statements, and you can create resources conditionally. In this case, we're actually creating our deployment pipeline. We're saying, let's create a pipeline for each module, which is a dynamic array of modules, because you can add modules to your system, and create each one for both staging and production. Then you don't have to repeat that code. If you're doing it in YAML, you'd probably have to do copy paste, or some ugly hack to get that to be deployed. The really nice thing then is that you don't have to switch between CloudFormation documentation and your editor. Everything leverages the autocomplete and the in-line documentation of your editor. It's a much more productive way to build resources. Even if you ultimately want to use YAML as the source of truth for your cloud resources, you could use CDK to generate that in the first instance, and maintain it thereafter.

Continuous Deployment

Continuous deployment is essential in serverless. In a lot of environments, you can't really get away with it these days, but particularly so in serverless, when you're talking about having lots of small units of deployment. This is a graphical architecture diagram for the CI/CD pipeline we generate using TypeScript. It creates the module pipeline for each module in your service, and then creates an orchestrator pipeline to manage all of them. We typically deploy that from a monorepo, for simplicity and to avoid the overhead of managing multiple repositories, which can introduce its own troubles. The way we do that is by having a change detection script at the start of the pipeline, which figures out which modules have changed. Then is able to run each module in parallel.

Observability

One of the other things I wanted to talk about and demonstrate in a serverless context is observability. To clarify the difference between monitoring and observability, monitoring is about having insight into your system so that you can detail when known problems occur. Observability is about having the outputs from your system rich enough so that you can answer any arbitrary question about your system, including asking questions when unknown problems occur. What that means is about collecting very rich structured data throughout your system and having it correlated to events.

A very simple first step in that is producing structured logs. We use Pino for that. I think JavaScript is the best fish for creating structured JSON logs. It's very simple. We also integrate the Middy serverless framework, which is a very simple middleware framework that allows you to automatically add before and after handlers to your Lambda functions. It can log your events. It can log errors automatically for you, and do a lot of other cool stuff. It integrates into the logger to make sure you can get info level debugging in production, but maybe also some sampling on debug logs too.

Then when we collect those in a centralized log repository. You can use a third-party repository, but CloudWatch logs insights is automatically provided for you. You pay for it by query, which is the one thing you have to understand about it. You pay by the amount of logs you scan. It's a very powerful way to extract data from structured logs and do aggregations on them.

Then coming down to metrics. One of the trade-offs that's really important with serverless is that when you deploy and use a service, you have to understand how it works and what its failure modes are. What that means is that for each service, you should have a look at the metrics, and what each one of those metrics means. If you're going to deploy a critical application in production, you have to really understand what these are, particularly in the important ones. For example, if you've got a Lambda function that's triggered by a Kinesis stream, looking at the iterator age will tell you how far behind the latest event your Lambda function has become. If you need to have events processed within 2 seconds, and you find that the iterator age has become 10 seconds, then you've got a problem. You can go through all of these errors, and decide what level is comfortable for us, and when do we need to start creating alarms, and alerts, and responding, and how are we going to respond?

Application and Service Metrics

Beyond those service metrics, then, you should also look at your business metrics, and what business metrics you need to capture in the system too. That could be the number of play button clicks on a video service. It could be the number of abandoned carts in an e-commerce application, or the number of products added to a cart. Every business will have a baseline expected level of behavior for those events. By creating those business metrics, you can actually correlate them with your service metrics and understand your system's behavior. Apart from just looking at the technical details of the services, you can say, if the number of products added to a cart is showing an anomaly, if it's falling outside of the normal range, then we need to do something about it. That's a good way to separate your observability from the code level metrics. This is one really neat way of adding metrics. This is a fairly new feature. Most of the metrics we collect use CloudWatch metrics. You can put metrics into CloudWatch metrics. Then it will allow you to build dashboards on them. This is a very low overhead way of doing it. We're using a Node module that AWS released for creating metrics logs. What that does is it creates a structured JSON payload with the metric in it. The CloudWatch service will then asynchronously parse that and create a metric for it. It doesn't affect the performance of your application. It also means it's nice. You can actually see it then in your logs, and you know exactly what's happening.

This is an example from the SLIC Starter application where we're tracking the number of items and checklists, and the number of words in each entry. Then you can build up statistics and check what they're like, like the average number of words or the 95th percentiles. Then doing queries in CloudWatch insights. It's very powerful. One of the things you get in every Lambda functions log is a report at the end that tells you how much memory your function used, how much memory you had provisioned, and how long it took to execute. Using the query mechanism and CloudWatch insights, you can actually collect all of those report metrics from every Lambda function invocation over a period of time. You can parse them, and then build statistics on them. In this case, this is a query that we use to figure out how to optimize our Lambda functions. When you create a Lambda function, you assign a certain amount of memory from 128 Megs all the way up to a maximum of 3008. When you do that, the more memory you allocate, the CPU and the IOPS of the function are allocated proportionally. If you need more compute power, you need to increase the memory. In this case, we've provisioned 976 Megs, but the maximum that's used over this period of time is about 166 Megs. We've over-provisioned significantly. We can look at this and say, 95% of functions executed within 100 seconds, and 100 milliseconds is the billing unit in Lambda. If we make it slower then we might end up paying twice, because they round up in billing units. You could also say, actually, that function, maybe it's more I/O bound, maybe it's just doing HTTP requests and waiting, so it doesn't need that extra compute. We could reduce this down to a 256 Meg function. It might execute in the same amount of time, and we pay much less for it.

Distributed Tracing

I showed a brief example of distributed tracing. The idea here is because you've got all these loosely coupled components, you need to be able to make sense of it in some way. Distributed tracing allows you to monitor the emergent architecture of the system. This is how you turn it on in serverless framework. Then the code example shows how you use the X-Ray SDK to wrap the AWS SDK. What emerges then is a service map, where you can see all of the traffic and data flow through your system. If you've used a system like Zipkin, or OpenTracing, it's the same idea here. You can also annotate each trace. You can put in business annotations on traces, and then search for traces relating to a specific order or a specific business event. You can look at the performance characteristics of them. You can look at whether any errors have occurred. It's really quite powerful. You can also correlate them against the logs that happened at the same time and the metrics that were relevant at the same time.

Resources

Myself and Peter, we've written a book on building serverless applications in AWS. It uses a lot of AI managed services. It's particularly about building AI enabled applications. It covers a lot of these serverless topics. There's a chapter in it all around building observability capability into serverless applications in AWS. I think we have some free code. If anybody's interested in knowing more about that, I'd be happy to share with you. Let me know afterwards or send me a tweet, or an email, and that'll be no problem.

Wrap-up

Serverless isn't about being a perfect system. It's about being productive and agile. I think, when you understand that you don't have to look for perfection, and get outside your comfort zone. There's definitely plenty of capability to do that in serverless, getting outside your comfort zone. It gives you a lot of freedom. It puts also constraints in your way, so you have to find creative ways around them. There are plenty of pitfalls. If you check out SLIC Starter, hopefully, it'll help you on your way. Please do feel free to contribute, send us critique, and pull requests, and open issues. We'd love to see them.

See more presentations with transcripts

Recorded at:

Jun 17, 2020

Eoin Shanaghy

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?