BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations With Great Scalability Comes Great Responsibility

With Great Scalability Comes Great Responsibility

Bookmarks
46:16

Summary

Dana Engebretson covers the contextual pros and cons of a number of architectural patterns given real world scalability constraints; from orchestrating Lambdas with AWS step functions to multiprocessing with S3 triggers to rate limiting with queues like SQS.

Bio

Dana Engebretson is a Performance Engineer at SPS Commerce, where she uses her background in Data Science to analyze software performance of distributed systems. She is also the founder of PyLadies Twin Cities.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thank you. I'm so glad that you came, after a long conference, it is difficult to stay awake. I appreciate that. I wanted to ask you, how many of you have built something with serverless, can you raise your hands for me? A lot of you, over half for sure. And how many of you have built a data pipeline with serverless? Okay, less. So that's around 10 or 20 percent, or so. Great.

And then, the next question I have for you is, how many of you recognize this? Yay, so many of you don't, which makes me really sad. There's half of you, half of you, your homework assignment is to watch Wonder Woman after this talk, it is amazing. And I hope I don't have spoiler alerts in here, but we'll see.

Right now, I want you to imagine that you are the queen of the Amazons. Okay? And these warriors are out in production working, and you want to make sure that they stay healthy. You want to monitor them in some way and analyze their health over time, as changes happen.

So I'm Big Dana on Twitter, if you have questions throughout this talk, you can tweet at me. I will give my background before we move into this problem that we're trying to solve. I worked on wind turbines for a while, and keeping them healthy, I saved some in Hawaii that were on lava ash. I run the Pyladies twin cities group, and then I moved to data classification on food terms, just trying to classify food terms given product description, and now as performance engineering, I think of myself as an Amazon queen that is trying to evaluate all of our performance in production. I'm a performance engineer at SPS commerce right now.

Building a Data Pipeline

And so I'm going to walk through, like, to get to the analysis part, I had to build a data pipeline from -- we put all of our data, we give it to a vendor that manages our monitoring services for us. Well, I needed to get the data back from them. So I needed to build a data pipeline in order to analyze that data. And the analysis is like, how many requests through the service, the duration of the requests, and the errors.

And so I'm going to walk through today these different attempts I made to get to a successful data pipeline. And first, I did Python multiprocessing on an EC2 instance, and then I moved to a spot instance and then tried lambdas orchestrated by step functions, and then S3, and then I came to a successful solution, to use rate limiting, to use queues and cloud watch trigger lambdas to have a rate limited thing.

And back to the problem. So, we're going to consider this vendor of ours, this monitoring vendor, the Oracle. The Oracle has all of the information that we want, and we want to ask her for it. But it is not our service, so we can't communicate with her directly. I'm not able to just ask her, like, I have all of the data. I have to communicate with her through Hermes, the messenger god, which is our server's API. So let's look at an ideal version of communicating with this API, versus the reality.

In an ideal world, I can just ask, Hermes, uh, fetch me all of the data for all of my devices, for all of my Amazon warriors, please. And Hermes would go to the database and come back and say, here you go, here is all your data. But, in reality, it is not that easy. There's a number of questions I have to ask in order to get the data. Okay, so I first say, Hermes, fetch me all of my Amazon warrior troops, please, which are grouped by microservices that we define, these groups.

And so he goes to the database and comes back and says, here are all of your Amazon warrior troops. And then I have to ask, for troop one, fetch me my Amazon warriors, please, which are the devices that are out there. So he goes and gives me these Amazon warriors. And now I have to ask, for Amazon warrior one, fetch me my data groups, please. So he goes to the database and comes back. And he gives me my data groups for that warrior. Here are your data groups. And then I say, okay, Hermes, for Amazon warrior one, and for group -- data group one, fetch me the data, please. Finally, I can get the data for one device and one data group of that device.

And this results in 35,000 to 65,000 API calls in order to get data to analyze on our own services, so we are sending data to this vendor to do, like, monitoring and then I'm like, making all of these calls to get the data back, in order to get some sort of analysis on our load tests and things. So, wow. I'm from Minnesota. Okay, so I had to ask myself, how should I write the code to communicate with Hermes?

Python Multiprocessing on an EC2 Instance

And this was my initial attempt- was Python multiprocessing on an EC2 instance. I mean, to be fair, my first, first attempt was running it on my computer, I had to turn it on, and it was like, brrrrrr, it would go over night and I would fall asleep, I would put on an EC2 instance, it was easy to implement, but it was very expensive. I spent, or my company spent, it was $700, the EC2 instance that I picked, and the process only took eight hours a day, so I was not shutting it off, so it was really inefficient. So I thought, well, maybe we can do a horizontal scaling with Kubernetes or ECS or something, but it seemed overkill for this data pipeline. I was like, ahh, I don't know, this feels like a lot of ongoing maintenance, let's try to come up with something cheaper. So the next iteration was just to shut it off when it was done. So that was cheaper, but it wasn't cheap enough, it was still pretty expensive.

Python Multiprocessing on a Spot Instance

And then, so let's move on. I tried spot instances, which is, like, paying on demand, so they are much cheaper. But I would -- what I didn't end up doing was, you have to, like, if the price becomes higher than what I'm willing to pay, what my bid price is, then it shuts down, and then I have to decide, when it starts up again, do I want to start the whole process of calling the APIs again, or do I want to save a state of where I was when it shut off? Oh, this is gross. And I didn't want to do that.

So I thought, maybe I will try this new thing called serverless. And so, how many of you guys have spun-up, used something in the cloud, like EC2 or whatever, spun up your own VM? So this isn't new. I want to walk you through how I think about serverless and why I want you to try it. Owning your own server was like people would have this family car and you have to maintain your family car, and then it became easier, now you can lease the car and you have the latest version for the last two years, you still have to maintain it, but it is like using the cloud in that sense, you are not running it in your basement.

And then this thing came along called spotted instances, which is similar to renting a car for the weekend, and now you have services like car-to-go, where you can literally say, I have one task, I need to go to the store and I will rent a car and pay by the minute that is like lambdas where you pay by a second and it does one task. It seems like a powerful transition the industry is going in to and I wanted to be part of that.

Lambdas Orchestrated by Step Functions

So we will move to my first attempt at serverless at a serverless data pipeline. Lambdas orchestrated by step functions. Okay, how many of you guys have implemented a lambda? So many of you, that's, like, over half I want to say. I don't know. I want to, like, sit and count you. So you know that there are some things to take into account, like, oh it has to run under fiver minutes, or within this amount of memory and there's only this much compute it has, there are other boundaries I ran into, so maybe this will help you.

So one is the dependency file size, when you zip your lambdas, I don't know if the lambdas that you put out there have external environments, or if you just had some-- but I will walk through the process. So this is my make file, I would use Docker to build, I will explain this. Why did I use Docker? The first thing that I ran into, I was trying to install, I think it was Pandas, a Python library, and when I installed it on my Mac machine, and then zipped it up and gave the lambda, like, gave it to the lambda, it ran locally but not in the lambda, because the lambda runs on a Linux machine. I will speak to that later.

And so anyways, so I'm using Docker, and this process, I build the Docker machine, I run it, and this libs part, this is where I install the libraries, I was using Conda because I'm in the data science area, where I had data science packages like pandas and things I wanted to work with. And then this process zips all of the files and puts it with the code all together and gives it -- packages it for lambda. And this is -- it is super small, but basically all of this is zipped en route. You don't zip a folder that has this stuff in it, you zip it, it is at the route level. This is all of the dependencies for requests, and then here, you have the actual lambda code right there for get data.

So I wanted to convert my data into parquet and use a library called fastparquet, and that was too big and would not fit into the lambda and that's a process I have to do outside of lambda. So something to think about. There are other things, I ran into -- so, when I was writing data to S3, like, the sample code I found online would write the data, would write, like, a file to the temp directory of the lambda, which is the only place you can write data, is the temp directory inside the lambda, and then from there it grabs the file and sends it to S3. You will have a limit only how -- on how much you can write inside the the lambda, there's a way to write it into S3. Or you can write per account. So you can write 1,000 lambdas at the same time. I was in the dev account, and I'm glad I was not in the prod because other services were affected by me spinning up a lot of lambdas.

And then here are some other things to consider, boundaries you might run into with your specific use case. And I mentioned this, that lambda runs on a Linux box. So I wanted to make that point that you should always pip install your dependencies inside a Linux container, that is where Docker can be helpful if you don't have a Linux machine if you are using Mac or Windows.

So great, maybe you have made a lambda, but maybe what you are -- you are trying to level up and you want to have multiple lambdas and they work together in some fashion, so you want to orchestrate those. This is where step functions comes in, it is a workflow management, so you can write -- you are essentially writing a state machine in code, this is one example, the simplest example is: you have the first blue lambda will run, and it will -- you will define in the state machine, okay, run the blue lambda, and then give the output to the red lambda, and then run the red lambda, and then give the output to the yellow lambda, and run this process.

There are other things you can do with step functions, you can have a choice state defined that says, run the blue lambda, depending on the output of the blue lambda, run the red or the yellow lambda. And you can do parallel processing, run the blue lambda, and run the red lambda and the yellow lambda at the same time. What I ran into, unfortunately, was that I couldn't do this dynamic fan-out parallel processing. If we go back to the speaking with Hermes part, I first asked Hermes, can I have the Amazon warrior troops and get a list of troops, for each of the troops I would ask for the Amazon warriors. So I would want to run the blue lambda, get back a list of Amazon troops, for each of those I would want to run the red lambda to get the Amazon warriors for that troop. That is not possible with step functions, you cannot define that. I will walk through alternative ways you could do this. And this was sort of, like, this was my first attempt at step functions. Oh, whoops, I'm supposed to play it. There we go. Goes for it, oh. So it fell short for me.

Okay, so here is something that you could do, it is a little gross. I know that I'm getting around 400 Amazon warrior troops, and it is pretty consistent. So I could define, okay, I think that six lambdas can handle that. So I'm going to have six of the same lambda run in parallel, and just break up the output of the blue lambda and give it to these six lambdas. But if you have, like, if your blue lambda is growing and grabbing something, it is variable, and you don't know how big the list is going to be, you don't know if you will run into the five-minute limit for processing, is it going to be too much, so it is not the cleanest solution.

And another option is, with step functions, you can run lambdas and you can run EC2 instances. So I can say run the blue lambda and give the output to an EC2 instance and do the parallel processing where you grab all of this, that's an option but we are moving out of serverless.

So what did I learn? Step functions is an easy way to implement a workflow management for lambdas in EC2 instances, this is a cool thing, a back-off policy for lambdas so you can re-try them when they fail, but it does not natively support a native one to many fan-out architecture for lambdas. So I was pretty defeated, I guess, let's say. And so I turned to Wonder Woman for inspiration, and here, I thought this was hilarious. She was like, wonder woman's eyelid muscles loosen the tape, and her eye lashes. Oh, my feminine vanity will not let me pull out my lashes, I will have to escape blindfolded. Wow, this woman has way more fortitude than me. I would be ripping those off. So this is new to me, this whole AWS stuff, but I can keep walking into the darkness of my misunderstanding and keep moving on to using event triggers with S3.

Using Event Triggers with S3

So this is an S3 bucket, object storage. Who has used S3 so far? Almost all of you, 80 percent. Great. So this is what -- so I used it, sort of, as a, like, a file system, in the sense that I had leading keys. So it is a little trickery. So I said, so I said, okay, I run the blue lambda, I grab all of the Amazon warrior troops, for each troop I write a file to this leading key troops, so every file will start with the key troops slash something and then it will have the file name. And then I have an event trigger that says, as soon as the file is written to anything under this leading key troops slash star, trigger a lambda to go -- and give that lambda a reference to that file. And then the next lambda, for that troop, it calls the API to get the Amazon warrior and it would write a file for every Amazon warrior to the leading key, Amazon warriors slash whatever the file name would be. This is cool. I will warn you, if you are trying to deploy this, you will return into a circular dependency. Because, the bucket has to reference the lambda in order to, like, define the event trigger, like that event trigger is defined as part of the bucket in cloud formation. So it will reference the lambda.

But the lambda here is writing to the S3 bucket, so the IM policy of this lambda, or the IM role will need a policy that references the bucket that says, hey, you can write to this bucket. So if I tried to deploy that all in one, you get this circular dependency, it doesn't work. So there's this little hack you can do, you just hard code a bucket name, so you have to make sure that the bucket names are unique, so you have to make sure that you pick a unique bucket name that you hard-coded into your cloud formation and then you won't have a circular dependency. You can also deploy the bucket first in a separate stack, blah.

So the same thing happens, as soon as a file is written to this Amazon warriors thing, it will pass that information about the file to another lambda, it will trigger another lambda, this yellow lambda, to go and get the data groups for that Amazon warrior, and then for each data group, it will write it to the leading key data groups. And then, the same thing happens again, every time a file gets written to the leading key data groups, it will trigger a lambda to go get the actual data. And then it will write that data under this leading key. And this way, what was cool about it is that I was able to store my data in the structure that it was, like, hmm, how do I say that.

Some people store the data like a file system, like you click on data and then you click on production, and then you click on whatever the name of the troop was, and then you click on the name of the Amazon warrior. So then you can kind of navigate it as though it was a file system. Okay. And so this reminded me of this scene where she just, like, does not realize how powerful she is. She is like “Whoa. What just happened?” She did not know she was a god yet. AKA, me.

Okay, so reminder, it is like, oh, my gosh. So this is all happening really, really fast. Like, this is -- the first lambda writes to troops and then all of the lambdas are triggered at the same time, which then writes to the Amazon warriors key, and then all of those are triggered at the same time, and then all of these are triggered at the same time. So it ends up being too fast. And so, this is where I accidentally took down our vendor's service, for an hour. Yeah, it was -- it was totally unintentional, I thought it was, like, oh, this architecture seems right. And then I thought, I had to completely scrap it and start all over, with this architecture, I could not rate limit. The vendor did not have one in place. They put one in place as soon as this happened. It was like -- yeah. Oh, sorry. But, in my defense, I did ask them, I said, like, how fast can I … ? They said, don't worry about it. So I didn't worry about it.

Okay, so what did I learn? S3 is highly scalable and durable, and if you use nested event triggers, you can create super fast multi-processing. However, you cannot control the rate that these lambdas are spawned, and it will only re-try the lambda twice once it fails. So that, too, I didn't have like a rate -- like a back-off policy in place. So it, like, their service went down and my lambdas were like, retry! Retry! And so what happens is, like, after it retries twice, it just gave up and then I lost all of those in-flight tasks, consider -- I had to shut down this whole thing. You can lose in-flight tasks if you are communicating with someone else, you have to basically build your own architecture to account for, as soon as it tried those two times, you have to build something in where maybe you send it to a dead letter queue, and then you try to reprocess it with another lambda. And so this is, I thought, okay, this would be Wonder Woman if she were an event trigger. She was like, why did they bind me with such small chains, it is an insult. These poor event triggers are really cool, but too scalable in some cases.

And then this is the paradox that I ran into, which is like, okay, you can't run into a rate limit if you never scale. So, like, hmm, okay. So this is like the irony here, or the paradox that we always are trying to build for more scalability but, like, I wouldn't have this problem if I hadn't tried to optimize for scalability, right? So it is a sort of paradox.

Rate Limiting with Queues and CloudWatch Triggered Lambdas

And so I had to move on, and I had to completely change my architecture to say, okay, now I have this rate limit I have to deal with, so I have to make my service slower and work within the limits of who I'm working with.

SQS Queues

And so I moved on to try SQS queues, who has built anything with an SQS queue so far? So about 25 percent of you, or so. So this is how it works, is you can -- with a lambda, you can grab up to 10 items off of the SQS queue, and I have these tiny little clocks here. What happens, as soon as the lambda grabs these items for the queue, it does not take them off the queue. They are just invisible to other lambdas for a period of time, there's an invisibility time out, with a default of 30 seconds, but you can define it. As long as, based on how long you think the lambda will take to process the item, you make the lambda invisible. It is like dating, in a way, but the lambda gets 10 partners, because they are invisible. No other lambdas can process them. I don't know. Or it is like, the bachelorette or something, maybe, I don't know.

So the lambda will process the item, and then delete it from the queue. And you have to write the logic, you can write it the other way around, but I don't recommend it. Process the item and then delete it from the queue. If you fail to process it, you don't delete it off the queue. It stays in the queue and another lambda can try it. It goes through, might fail, and then it leaves it on the queue there. It keeps going, fails again, we are hitting the invisibility time out window, we are getting close, we hit it, so what happens? The items goes back on the queue, or they are visible again to other lambdas, and this other lambda swoops in, grabs them, and then these items are going to be processed twice. You have to make sure that you set your invisibility time out window to be longer than what you expect your lambda would take to process. So safer is, you can take five minutes and if it runs for that long, you should really figure out how long you think the lambda will run. In my use case, it is fine if the lambda grabs it and processes it again, it will just re-write the file over, I'm not duplicating anything, like a charge to a credit card, I'm just duplicating grabbing data. So this lambda processes them and deletes them, it goes away.

And CloudWatch rules, who has implemented anything with CloudWatch rules so far? Okay, so we have a few. So, like, I don't know, there was, like, five percent or something. So cloud watch rule is a cron job; you run an item on a schedule, and so this is -- so I'm using both of these things. So here we go.

So the first lambda, I decided that this process takes too, like, way too long to be anywhere near, like, getting it in near realtime with a new rate limit. I'm running it once a day. So the first cloud watch rule is set for once a day. It triggers this lambda, and it gets the Amazon warrior troops, sends them to the queue, and then these, I played with the time, but they are running every five minutes. So this one is running every five minutes and saying, grab 10 items off of this queue, try to go through those item and grab the Amazon warriors for each of the troops and send the results to the next queue. This is doing the same thing, grabbing from this queue, getting results, going to this queue, and then the data goes to an S3 bucket.

And the next thing I added to this architecture, was the data wasn't in a format I was super happy about, so I had -- I used an event trigger, every time the file was written to S3, a lambda would spin up, and convert that data into a format I liked and save it under a different leading key processed. So here I was able to use this highly scalable event trigger, I'm not interacting with another service, it is doing a data conversion and it was great.

So I'm looking at this graph, the queue depth over time. So the first is the little blue, there are not a lot of event groups, there are 400 Amazon warrior troops. And then the second one goes up to around 5,000, the second queue, and the last queue has a lot of API calls, this goes to 25,000 or so. And then I noticed this thing, oh, sorry, I almost lasered the screen there. Then I noticed this, like, trailing, here we go. So we have some items that are -- some stale items that are failing, and they are never getting wiped off, the queue is not going back to zero. So I wanted to investigate these items and see, why are they failing continually?

So there's a thing called a dead letter queue, which you can set up; it is super easy to set up with your queue in cloud formation. You define a dead letter queue, and then you associate it with a queue and you say, if an item gets taken off, or, if an item is tried from lambdas five times, you define it how many times, I want you to send that item to the dead letter queue, give up and then move it to another queue so I can investigate and see what is wrong with this item and why is it not processing correctly. So I set that up. And then my graph looked cleaner and prettier. So then I don't have the trailing anymore, and I can investigate into those.

But there was something else gross about this graph I didn't like. I didn't like that, for a large part of the day, there is nothing on the queue, on these queues. And especially the first one, the tiny little blue queue, there's a blip. There are items on the queue for a half hour, I'm running the cloud watch rules every five minutes, I'm running a lambda to see if it is empty, it seems like -- why would I do that. It doesn't seem like a great architecture.

So here is a little bit of the code that is inside the lambda, and I'm saying, while true, so keep running this for the whole five minutes essentially until I hit the time out of the five minutes, and grab 10 items off of the queue, and then if you don't, just okay, say that it got messages as false and don't do the rest of the processing, and keep trying to grab items off the queue.

And this looks like this, the -- each lambda is running for five minutes, dying, and the cloud watch is triggering the next lambda to run for five minutes, it is dying, and then the next lambda is running. You end up with a continuous process, essentially, that is running for all of twenty-four hours. So I added a tiny bit of these three lines of code. So this -- so this right here was running for about $8 a day, and then I ran this, I added this line 27-29. I said if there is nothing on the queue, if the message is zero, kill the lambda. And then it looks like this, I have the lambdas running every five minutes and when it checks the queue, it dies, and then that brought the cost to $1 a day. So that saved a lot, adding that tiny little line.

So SQS queues, they are highly scalable and resilient, they can store failed tasks indefinitely and they can send those tasks to a dead letter queue. And it gives you control to limit, to rate limit the processing of the tasks in the queue, which is really great. Because you could -- and even, like, just stepping out of my architecture using an SQS queue with EC2 instances and using, what's it called, like an auto scaling group, then you can scale up, you can -- or you could also do something like what I'm doing where you rate limit. So one thing to consider, if you have a long, like, a lot of items on this queue and a small process, you cannot estimate how long the process will take because it depends on what else is on the queue. So this small thing could be stuck on a queue with a lot of items. So what that means for the company, we use queues, and if one -- and we use them for multiple customers. So if one customer started to inject a ton of on to the system, it affects the other customers. So you should think about that and how you could re-route the data for -- if you start to detect one customer as, like, being a bully and putting a ton of items on the queue right now.

And lambdas triggered by cloud watch rules, this allows for quite specific schedules, which is good. And the fastest schedule is to run every minute, currently, and this worked for my architecture because specifically, because I wanted to slow it down, and I wanted to go to work within the rate limit, but it is not scaling according to what is in the queue. What could be cool is some solution where the queue depth was triggering the lambdas. So, like, based off of the queue depth, I trigger an auto scaling group for lambdas, that would be cool but does not exist right now. All right.

So I found this inspiring by miss Wonder Woman, she said it takes real character to admit one's failures, and not a lot of wisdom to take your profits from defeat. The only thing I don't like about this, I learned from Lean In that women like to hedge, and she says, like, not a little wisdom. Just say that you are wise, Wonder Woman, it is okay.

Takeaways

So here are some of the key technical takeaways from this talk that I hope you will have learned, if you didn't know already. Lambdas are not yet suitable for quick tests if they require large dependencies, so you might think, this is a small thing. But if it is like -- I was excited about using it for some data science things, but those packages might be too big for lambdas currently. And step functions does not yet provide dynamic fan-out parallel processing. And event triggers don't yet allow you to rate limit. And we can't yet trigger a lambda based off of an SQS queue depth. So these are key architectural things.

Lessons Learned

Stepping back, some of the lessons that I learned from this whole process was that, you know, when I first asked our vendor, well, what -- at what rate can I call your API, and they said don't worry about it, now I know not to, like, listen to that and to really ask, no, really, what rate do you guys have? If you don't know it yet, what do you think?

And I will just be aware that every other service out there has some limit, and with serverless, I can easily hit that limit. So I have to account for that in my architecture and I want to be able to rate limit. And there were a lot of failures along this data pipeline, there's a lot of different places that the process could fail, and I initially was looking at it, how do I get this working, and I was not thinking about how to handle the failures. And now in writing architecture, I will think about that and making sure that it handles the failures, using dead letter queues, for instance.

Reflections for Managers

And then I have this, like, general reflection for managers after this process, where -- I talked to my manager and we did a retrospective on this process, moving forward, how do we iterate faster. And just a general reminder, invest in training for new hires, and it helped to decouple the learning process of our company-specific build deploy pipeline which I didn't touch because I know it is different for you, and getting used to these new serverless principles, and this new architecture. And so once I decoupled that, that made it easier for me to iterate on the serverless architecture, once I get really comfortable with our deploy process.

And so, I have this, like, anecdotal example of how, like, what we learned from this has been successful. This is Victoria, she is an intern, she is 17 years old, joined us in June, and on July, right around this time I built some boilerplate, here's a hello introduction to a lambda, this is with our specific build deploy pipeline, so she can learn the pipeline, and I built one for step functions, too. And so she ended up deploying her -- so she came to us not knowing any Python, she had one computer science course and so, like, in high school. And she deployed her first lambda on July 12 and now she is maintaining six serverless applications right now. So huge success, I think that is because we decoupled the process for her, so she is comfortable with the build deploy process for the company and it is easier to keep up on the new serverless topics. Thank you.

I'm happy to take questions.

On the -- (speaker far from mic) -- they recommend going with Dynamite DB to handle throttling. Did you ever consider that?

So you route request to Dynamite DB and then you use a lambda off the stream to ensure that you can throttle the number of concurrent executions.

No, I wasn't aware of that with the Dynamo DB stream that you can rate limit, is that what you are suggesting? Okay, very cool. Thank you.

Are you asking that you just used dynamo DB Can for bookkeeping and then the lambdas can check that against the bookkeeping and go back to sleep?

You route to a Dynamite DB and then use the stream to trigger the lambda. And then there's a limitation that, there is only one lambda per Dynamo DB shard, so you don't have this proliferation of concurrent executions.

In this case, would you lose the, like, if you said there's a limit, does it save the state and run it after a period of time, or would it just not run? If I had multiple event -- to me, it seems like it is very similar to using the event triggers with S3, it triggers it and I cannot control the rate of when it is triggered.

Well, you are limiting the number of concurrent executions. So you can -- if you have two shards, there are only two lambdas that are ever running. And they will just process the stream in their own time.

Okay. I will look into that, thank you. I will look into it, thank you.

Any other questions?

Can you share what other type of the things, the application you are using serverless for? You have six serverless applications, one is for the pipeline.

Yes, so those are things that our intern is managing right now, I can give you examples. So she is a site reliability engineer, and so she is working on things like, when -- so one of our services will alert people if there's, like, something down. And so a lambda will trigger when it goes off the document base and puts documents in the right place for people to get the alert and the documentation of how to fix the alert at the same time. She writes things like that. That is one example. So let me think of others. I don't know, I will talk to you after, I'm blanking.

I wanted to find out, how do you deploy your serverless lambdas?

Yeah, good question.

We use cloud formation, and then Ansible, and now they have this built-out, it is on Ansible and I don't have to manage it myself. So essentially we -- our process, we use Jenkins, too, so our process will make a Jenkins job, and then we, oh, you are testing me. And then the -- we route, like, we send the lambda to S3, so we package the dependencies with the lambda code and we put that in S3 and then the Jenkins job will run that, the cloud formation with the end -- and reference the lambda in S3. I don't know. I'm sorry, I should have -- I didn't fully study.

So one thing I have tried, I use Tera Form to launch it and Gradle to update the artifact in the lambda. But I set up the permissions, the IM role, the event hook, all of that stuff, if it is triggered by a key or something. That is in Tera Form, but just the package update of the jar or zip, I do it in Gradle. So I have tried that, also. That is one way.

Yeah. And so, yeah, I have a make file that does the whole zip process.

Do you have versioning in lambdas, do you know?

Currently, I'm not using versioning on lambdas, but that does exist, yeah.

Aliases, they are called aliases. And without that, you can't roll back. So it is really important.

First, good talk, thanks for that. So I have a question around the queue theories. Do you have any metrics on how long those individual jobs took so you can see, over time, if they are getting longer and how to fix that, basically just logging that particular run times of the jobs.

Yeah, I currently don't have, like, a metric that I'm tracking on that. But I do have logs, so I could log the beginning -- like, in my logs it just says when the lambda started and ended, I can extrapolate from that, but I don't have it as a metric right now.

I think you get that for free with cloud watch metrics, you get the execution time of the lambdas.

And it is a graph.

You get it for free.

Thank you.

Any others? Okay. So thank you very much, Dana.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx

 

See more presentations with transcripts.

 

Recorded at:

Mar 23, 2018

BT