InfoQ Homepage Presentations The Not-So-Straightforward Road from Microservices to Serverless

The Not-So-Straightforward Road from Microservices to Serverless

Bookmarks

View Presentation

Speed:

Download

50:50

Summary

Phil Calçado discusses the fundamental concepts, technologies, and practices behind Microservices and Serverless, and how a software architect used to distributed systems based on microservices needs to change their mindset and approach when adopting Serverless.

Bio

Phil Calçado is the Director of Engineering, Developer Platform for Meetup/WeWork, where he leads the organization's efforts to modernize the most successful platform for people to meet in real life. Before that, he was an architect at Buoyant, working on the pioneering Service Mesh open source software, and director of engineering at DigitalOcean.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Calçado: I have been doing this microservices thing for a while. Most recently I have been at Meetup, which is a company owned by WeWork. You might have heard of them, you might have been to a Meetup. It turns out that I need to make an update - last Friday was my last day at Meetup. If I say anything here that sounds weird, it does not represent Meetup at all, so don't blame them, blame me. Back at SoundCloud, I spent many years there and prior to that, I was at ThoughtWorks for a few years. I've been a few times going through this microservices transition, which really is going from monolith to microservice architecture.

Most recently, when I actually joined Meetup, I faced a new kind of paradigm that some of you might be familiar with, which is all this serverless stuff. I use the word traditional microservices out there on Twitter and people made fun of me, and they should. I feel I'm a very traditional person when it comes to service-oriented architectures, and this was a great opportunity for me to check my biases which are based on experience. Experience and bias are really the same thing at this stage. We'll work out what's the new way to build microservices with some of these things that people usually call serverless.

What are Microservices?

The first thing I want to do is to create a shared vocabulary here. I won't spend too much time on this, but I’ve found it to be important in other talks I have given, which is try to create a working definition of what's a microservice, and hopefully, other words we might need. The first one is what's a microservice, or what are microservices? The interesting thing that I found in software engineering in my career is that most things have no actual definition. We reverse engineer some form of definition from what other people have done. Over time, at QCon's and many other conferences, people have come to say, "At Netflix, SoundCloud, Twitter, whatever, we've been trying this kind of stuff and it works," and somebody else is, "I'm going to call it microservices." "Dude, you call it whatever you want. It works." That's what matters.

We were able to reverse engineer the definition based on a few things that are shared by these companies that have been applying this kind of technology. It's very similar to how agile came to be a word. How these folks got together, "We are trying to do all these different things, it seems to be working, let's see what's common amongst them all and give it a name." The way I like to refer to microservices is, they seem to be highly distributed application architecture.

There are a few interesting words here. Highly distributed is one of them. By that, I mean that you don't have one or a few big processes software systems running; you actually have a boatload of them. You have many different software systems that implement business systems you might have. This seems to be something common across all these different use cases you might have seen. Something interesting is, I'd say, distributed application architectures, because I'm sure you've been to many talks, at this conference and others, and read articles, and whatnot. People keep talking about distributed systems, which is something everybody is talking about all the time and gets really boring, to be honest.

The main difference I see between what I'm talking about here and what usually is called a distributed system, is that these people often talk about infrastructure pieces. They often talk about databases, consensus, gossip, things like these, all this stuff that nobody really understands but we need to pretend to understand so we get a job at Google. That's not what I'm talking about; what I'm talking about is actual business logic pieces talking to other business logic pieces. That should be hopefully a good enough definition for what we want to do now.

What is Serverless?

Moving on, what is serverless? That's an interesting one because, like I said, I'm not really that experienced with this word. There are many people in this very room right now who have been doing this for way longer than me. I'm a biased traditional person on the microservices side of thing. My answer to this is, "Dude, who knows?" I don't know what is serverless. It seems like every single person has a different definition and it's weird and it seems to be really whatever a cloud provider wants to do.

One interesting difference I found between what's generally known as serverless and what became the concepts behind microservices, is that the microservices realm seems to be reverse engineered from practitioners, from people coming to stages like this, writing papers, writing blog posts, saying, "Who's actually done this? This is how we managed to get Twitter out of whatever it was back to sanity. We've done these things in Netflix. We are able to deliver all these stuff."

Serverless seems to be the other way around. There are a lot of cloud vendors pushing new technologies they want all of us to use and giving it a name and packaging it in a nice format. There's no problem with that, I used to work for a cloud provider, so I know how the market works a little bit, but it's different. I don't know if I'm qualified to give you a definition. I'm going to actually rely on a definition by a couple of smart people who are based in New York, Mike Robert and John Chapin. They run a company called Symphonia here. Full disclosure, I'm friends with them but also I have hired them many times to do various forms of consulting work. You should totally hire them, but very conveniently, they wrote a book called, "What Is Serverless?"

There is some definition there that's some kind of litmus test which I like, but we don't need to go too much into detail to tell the tale I am going to talk you about. Maybe the most interesting definition or part of this definition is that there is a disconnect between the codes you work with, and the concept of servers, hence serverless. There are a few ways that it manifests itself, but one of them is, you don't really think about CPU or memory or instance size or instance type, things that you might have to do with containers, or VMs, or things like this. I'll go with this, this is good enough. Go check out their book, it's available freely online.

Then, as for examples of what this is, a few things come up. Up to yesterday, this slide only had three logos. The first one on the left is Amazon's Lambda, then there is Google Cloud Functions, and I’ve actually never used that, and the other one is Microsoft Azure Functions - I’ve also never used that, I've only used the AWS flavor. I was here at the conference yesterday talking to some folks and they were mentioning that they're actually using this Kubeless framework or platform, whatever it is. [inaudible 00:07:00] serverless on top of Kubernetes. Never used that, just found it interesting. Those are some of the flavors of what's generally known as serverless frameworks, platforms, whatever you might want.

Keep Your Monolith for as Long as Possible

Trying to tie it back to the story I'm trying to tell you all here, there's a general piece of advice you receive all the time, which is, try to keep your monolith for as long as possible. Most people would tell you, "If you start a new company, or if you are starting a new initiative within a bigger company, don't try to just spread out your systems throughout many different microservices or get all this RPC going on. You should use this. You should use gRPC, HTTP, service mesh, whatever." This is probably not the right time for your company to be thinking about this.

Instead, what you could be thinking about is how to grow your business. You don't know if you are going to have a company or this initiative within your company in two months’ time. Don't focus too much on technology, focus on the business and just keep things on a monolith. Usually, this advice is paired with something like, by the time you need to move away from the monolith, usually what you do is that you have your big piece of code with various different concerns represented by different colors there. You start splitting one over there and one over there, and slowly moving things away from the monolith.

My experience is that you're actually never going to move things away or extract them from the monolith. The code is still going to be there, you just pretend it's not there and it's often ok. If you think about Twitter, they only turned off their monolith a few years ago. SoundCloud still uses the monolith and I left their company five years ago, and we were on a way into the microservice journey. Don't worry too much about it. That's the general framework people use when extracting systems.

What if You Waited So Long That There are Other Alternatives?

What happens if you followed this advice so well that it took you 10, 15 years to actually consider moving away from the monolith? Something that has happened is that, as I was saying, all the stuff like serverless and various other options came to become a thing. This wasn't available when I was first doing migrations from monoliths to microservices at SoundCloud back in 2011, 2012, but they are available now. This is a little bit of the situation Meetup found itself in.

Meetup is a very old company. I didn't know that before I joined it. It's 16 plus years old. It's one of the original New York City startups and it's mostly based on one big monolithic application, that for various reasons that are not actually funny, we call Chapstick. Chapstick is a big blob of Java, actually Java the platform, because there's also Jython. There is also Scala. There is probably some stuff that I never really cared to look at. It's just your regular monolith developed over many years. It's not the worst thing I've seen, but obviously it's very hard to work with.

Engineers at Meetup decided that at some point they wanted to move away from it. They tried different things, that was way before I joined, so there is only so much I actually know about this firsthand. They tried moving to your typical microservices setup with containers. There was an attempt at using Kubernetes locally, various different attempts and they all failed for various different reasons. Having had some experience with this, I think a lot of it had to do with trying to go from the Monolith straight to Kubernetes. There are other ways to get there.

The point is that there was a big problem. We had this monolith, and we, the company, had just been acquired by WeWork I believe, in 2017. We knew change was coming and we need to get ready for that. What does that mean? One engineering team in particular was really passionate about this new flavor of things, what we usually call serverless. They decided to give it a shot and built a few projects that looked a little bit like this. This is actually a screen-grab from one architecture document we had. It might be a little confusing to explain if you are not used to serverless asynchronous architecture, but I will try my best.

The first thing far on the left-hand side is Chapstick, our monolith. Chaptsick again, Java code writing through MySQL database. One thing we've done is that, we followed a pattern that was mostly popularized by LinkedIn, to my knowledge, with their database project where, one interesting thing about some of these databases like MySQL, is that whenever you have a massive replica setup, there is something called the Binlog. Where you send your writes to, it tells all the replicas that are read-only about some change that has happened. The Binlog contains every state change within MySQL setup, so all inserts, deletes, updates, whatever. It's part of the Binlog.

One thing that companies like LinkedIn have done in the past and we've done at Meetup, was to actually tap into this Binlog, this stream of data, and try to convert this into events. Every time a user was added to the user table, there was some code that would read from the Binlog and a user always created an event, then simulates that into a DynamoDB table. DynamoDB is a database by AWS. It's one of the original NoSQL databases, it's very good in my experience, but don't worry too much about it. Just know that it can hold a lot of data, way more than any typical MySQL or Postgres setup.

We keep pushing all these stuff into DynamoDB as events: User was added, User was deleted, User joined group, User left group. Then we had a lot of the Lambda functions that would consume from this database using something that's called DynamoDB Streams, which is actually very similar to the Binlog where DynamoDB forever insert, delete, sends this as a message. "This user was deleted. This row was deleted. This row was added." That can be consumed by other systems. We would get another Lambda consuming this information and doing some processing with it, and copying it to a different database that would be a very specialized version of what we wanted.

In this particular case here, this is the architecture that used to show the members who belonged to a group. If you go to Meetup right now, /groups, /whatever, /members and you see the list of people who belong to that group - it could be the New York Tech group for example, that's a big one that we have in New York City - this data comes directly from a database that is specialized for that. It went through all of this processing to generate this materialized view that serves that experience only. That was the general design we had for a while.

One thing that happened is that this project wasn't particularly successful. We are going to discuss some of the issues that made it not successful, but generally, we actually had to revert on this approach recently. It went back to just having an API talking to the legacy code base. I want to discuss a few of the challenges we had. Some of them are more interesting than others and they are long term for Meetup, because there is always some accidental things, like it's an old code base, it's an old problem, the team is formed by members who might or might not have some experience with this or that. Some things are probably general enough that we can extrapolate from them a little bit more.

Challenges We Faced

One of the challenges we faced was, first, solving a bug means reprocessing all the data. Remember that I said the legacy code base writes to MySQL, and then MySQL says, "There was an insert on this table. There was a delete on that table back to this stream that is converting to events." There is a user inserted event, user deleted event, all this kind of stuff. Then another system gets these events and say, "A user has joined the group." When I am creating my materialized view of all users who belong to a group, I need to add or remove that person depending on what goes on.

One challenge is that all this is code we wrote. As every single piece of code, we wrote, there are plenty of bugs. Every now and then, we would find something, and one piece of this transformation wasn't working very well. We would patch it. That's great, but as opposed to online systems you might have, that also meant that we had to run all the data again through this pipeline because the copy we had here, so the last database to my right-hand side here, was not correct anymore, because we had to fix some logic or we had to change something.

That has happened a lot during this project. Every now and then, we would have to go and find out that, "Actually, we thought that this field in the Legacy schema meant this, it actually means that, we need to remove that." Or, "Oh my God, we cannot add accounts that were disabled automatically by the system." We have a lot of spammers in the platform as everybody has, we need to make sure they are not there anymore. We add the logic to prevent spam and we have to run all the data pipeline again. That was a little complicated. There are a few ways to deal with this in kind of event sourcing CQRS architectures. We struggle with this a lot.

Another one is the write path issue, as I call it. I'm showing you the read path. I'm showing you how information gets into the screen that shows you who belongs to this group. I'm not showing you what happens when somebody, maybe an admin from a group, says, "Actually, I want to ban this person from my group." We've struggled with this a lot because that pretty much meant that we had to rewrite all the logic that was in the legacy code. Ultimately, the pragmatic decision was, "You know what? Let's just use the legacy code as an API for writes." Now you are like, "Ok, but then why did we go through all this work just to get the read-path if the write-path is actually going to do the legacy code base anyway? Are we any more decoupled from what we had?" Those were the questions we had, and these were the challenges we were facing.

Another one which is a little more common is that we actually were using Scala. Meetup was heavily a Scala shop, and we had this issue of cold start for JVM Lambdas. Cold start is what happens when the first request hits a Lambda that you just deployed. Amazon, we know it's magic, needs to get your code from some form of disk, load into memory, and if you've worked with JVM for a while, you know that JVM also does a lot of optimizations in itself. It takes a few milliseconds - depends on what could be milliseconds for that particular function to be up and running. If it's steadily receiving traffic, that's fine, but for the first requests it's always a little challenging. We had a lot of problems with JVM Lambdas based on this. This is one of the reasons why we migrated some of our JVM Lambdas to use Amazon Fargate, which is actually a container platform. We had all these complications where some things were Lambdas, some things were containers. It was a bit crazy.

Another one was DynamoDB costs. DynamoDB is an interesting one because we had a lot of copies of the same data. We were paying over and over again the price for storing the same actual data. There was no one canonical model we all use. We had different specialized copies from it. DynamoDB is interesting in the sense that, very similar to anything else on Amazon, it's mostly by default you do on-demand provisioning. "I want a new table." "Here is a new table." "I actually don't need this table anymore." "Ok, it's fine. Don't worry about that."

It turns out that if you ever had to deal with Amazon billing, this is the most expensive thing you can do. It could be ok for every use case, it's ok for a lot of people, but in our case, this was extremely expensive. The solution was really to do some good old capacity planning and sit down and work out, "Actually we need this much data." The Symphonia guys that I was talking about before helped us with this; we literally saved millions of dollars just by actually working out, "We actually need this data. Hey, Amazon, can we provision this beforehand as opposed to just asking for it on-demand?"

To me, as the person who was overseeing the architecture developer tools team, the biggest problem was that we had no governance whatsoever. It's like, who is calling what? What function is deploying where? The moment when you go on your Amazon console and you list all functions and you see people with names, you know you are in trouble. There is absolutely no way to know what's going on. To this day, we turned off some functions that had VP's names, and it turns out it was actually very important for billing. We did not know, nobody knows. She wrote that many years ago. Anyway, it was a little hard to manage all this stuff.

Challenges We Did Not Face

Thinking about me coming as this very biased microservice old school - I can't believe I'm saying that - person, I walked into this scenario, "I could see this was going to happen. Told you so." There were a few things that I 100% thought were going to happen and didn’t. One of them is that getting new engineers productive on Lambda and DynamoDB was actually pretty easy. There is something on top of this, which is we had hired a lot of new engineers from Node.js front-end backgrounds. All this stuff was in Scala. I've used Scala for a fair chunk, and I know that Scala is the kind of language that either you hate, or you love. There is no middle ground, nobody is lukewarm in Scala. I happen to like it, but I've managed 200 people writing Scala and swearing at me for a very long time.

The interesting thing is that if you are actually writing just 200 or 300 lines of Scala, it's not that hard, especially if you use things like IntelliJ that really help you out. The way I think about this, is the same way as I personally deal with makefiles. I've been in this industry for 16 years, I don't know how to write makefiles. I have never learnt, I copy, paste, change, and move on. It's probably like there's been one makefile that somebody gave me and I keep changing these things for 16 years now. I actually have no idea what I am doing, but it works, it's fine. It's the kind of approach that we were taking there. Obviously, we had tests and different support to make sure people were not doing something completely stupid, as opposed to what I do with makefiles. That was the principle, it's very discoverable. You can just change something and move on.

Another one is that I was talking about cold start. I mentioned that we moved some of the things to Fargate. Then we moved them back to Lambdas but we converted them to Node.js Lambdas. We didn't find any of the cold start problems anymore. Obviously, there was a little bit of an issue, but nothing that would hit us to the level that the JVM ones were hitting. Another one is operations allergy. I spent four years at ThoughtWorks, so I'm very indoctrinated into the right way of doing things as any ThoughtWorker would tell you. There is only one, it's the ThoughtWorks way. One of the things we were super big on is the DevOps culture, which is different from having a DevOps team.

The thing here is that I'm so used to building engineering teams and trying to get people to own and operate their own systems. They've been really allergic to this, they really actually hate it. Exactly the same thing that's going to happen here. It turns out that although Lambdas can be really weird at times and opaque, it doesn't really happen that much. In my experience, people are reasonably happy being on call and just operating their own system. I think it has to do with the size of things. If I'm responsible for something that even as a microservice - it's kind of big and I don't know everything that's going on there because I might be more of a front-end person versus a back-end person, it's complicated - if it's something that's using the language I'm used to and I have all this tooling and it's small, it might be ok.

Another one that I was 100% sure was going to happen is local development. If everything is on the cloud, how do you do local development on your machine? What if you take a plane? It turns out most planes just have WiFi, at least in the U.S. That's not so much of a problem. One interesting thing about local development is that most of the tooling we have available right now - and my only experience firsthand is with the same framework from Amazon - have enough simulators that you can use within your local development. I don't suggest you do much with them. One thing we've done at Meetup is to make it really easy for people to provision their own accounts so they can have a sandboxed space where they can deploy all their Lambdas, do their other stuff without messing with other people's in-production systems and things like this. The simulator reminds me a lot of Google App engine when it first came out. It works 80% of the time, but that 20% is what's going to kill you.

The last one is developer happiness. I found this would make no difference because it's just like back-end technology. Most of the engineers don't care about that, they want to develop a product. But it turns out that people are really happy with this whole stuff. They really wanted to keep using that, and it's not just because it's different from what we had as a legacy system, because they could use something else. They could use containers, they could use whatever they wanted really. They actually wanted to keep using that and they were asking for more help and tooling and training on this.

How Can We Keep the Good and Get Rid of the Bad?

The one question that then came to my mind and I was working with my team on, was how can we keep the good and get rid of the bad? I was coming in super biased, I had years of microservices I've done, the SoundCloud position, then I worked for Digital Ocean, went through the same stuff, I worked for a Service Mesh company for a while. So I have my opinions about how things should be done, but it was pretty clear to me that a lot of this tooling was just good, it works. It's not perfect but it works. I was thinking about this for a while and I realized that maybe the easiest way for me to rationalize this whole thing is not too dissimilar from the same way that I perceived that we went on through microservices and other things.

If you think about it, if you've been in this industry for a while, you usually have a few applications within a company. Here’s a silly example where we have the financial reporting application, you have some form of user manager application and some kind of point of sale application. They are all different, they are all independent. They might be because if you are talking about old school, they might be managed by the same team, but it could be completely different. Maybe it's even different consulting companies writing them.

This was the picture that anybody would show you on your first day working for any corporation, maybe 10, 15 years ago. What they didn't tell you is that actually it was kind of like this. It's one company, everybody needs the user data, everybody needs the sales information, whatever it is. There is always some kind of cross pollination, collaboration is probably a good one, across these different databases. We have perceived that this was a little bit of a problem because all these systems were coupling to each other. They rely on the schema information and various other things.

One way we managed to get out of this mess was by creating normal services. Service Oriented Architecture is a very old term in technology now. We've been talking about this since the late '90s. It means different things to different people, but mostly it means that you expose this kind of feature and data that you need to other systems on your network. One thing that I have seen happening a lot is that once you did that, you didn't really have this whole application or independent applications that had their own back-end and their own data. You might have been to projects like this. One day in history 10 years ago, you’re working on some sort of customer portal or admin portal because you are tying all these things together. Instead of having one application to do this little thing and the other application to do this other thing, we actually were having one front-end that would talk to many different services in the back. This was how the industry was evolving. This was generally interesting and good.

One interesting way that I see serverless applying to this whole picture is that even if with microservices you would have smaller boxes and smaller services, you would still only have so many. What serverless tends to do is something more like this, where actually, you start exploding your features into a million different functions. These functions are 100 lines of code, 200 lines of code, and it's very easy to make functions talk to each other. It's very easy for these kinds of super deep-in-your-system features just to access data from another super deep-in-some-other-system feature. These systems don't really exist anymore. The colored lines there are just to illustrate how these things are kind of related, but not really. That's a little bit of what we ended up with at Meetup.

It's funny because a few months ago, I was on Twitter, and I saw somebody describing this, I think, perfectly. A guy I used to work with, he used to work at ThoughtWorks, called Chris Ford. It's a whole paragraph, but pretty much he calls it a pinball machine architecture where you send a piece of data here and there is a Lambda there and a Lambda there, sends you a bucket, put it over there. You don't know what's going on anymore. This is pretty much what we had. As we progressed towards this kind of architecture from something more like this, we just don't know what's talking to what, what's going on. We were legit victims of this pinball machine architecture syndrome, if you will.

Thinking about this for a little while, it's like, "Here is a challenge we have." We want to keep the good things about serverless architectures we've talked about. We have good provisioning systems, people were generally happy to write code for it, we have good support, we have good tooling, we have all this stuff." What are the things that we need to do that will prevent us from getting to this kind of pinball machine architecture I was talking about?

I think in every presentation I've ever given, I’ve always fallen back to something that Martin Fowler has written before. One, if you've ever worked with me, you are going to receive a copy of this PDF at some point, because this is one of my favorite articles in software engineering of all times. I think it's super underrated. This is a piece Martin wrote - I don't even know when, it was many years ago - about the difference between published and public interfaces. The article is good, read it. You will understand it, he has very good examples, but mostly think about it this way: any programming language - I don't know if any programming language, but most programming languages - will offer you some key words like public and private. Java has that, Go has some convention system that goes above these, C#, Ruby. Either there is a key word or some kind of convention.

The interesting thing is that, for example, if you talk about Java or Ruby, you can go into the runtime library for these languages and you can find this class deep in God knows where, in some weird package you never heard about and you can access it. You can access it because what public and private in a programming language do doesn't really have to do with security. It's the different levels of access control. Even programming languages like C++ and others have package protected things. You can always find a way to access this stuff. The point is that semantically they could be expressed in a better way, but it doesn't actually matter, because what matters is that the person who is providing you with the library or whatever it is, needs to offer you some published interface. A published interface is what you want your users to actually use, what you are willing to support them using to do whatever they want to do.

It's not because something is public in a library that you might be using, that you are allowed to use it. Of course, you can use it, and sometimes it's a very good hack, it allows you to do many different things. But it's not what you should be doing. What you should be doing is using what Martin calls the published interface, which is what the users or what the library owner or the service provider, whatever, provides you as an API. I was reading this a little bit and thinking about other challenges we were having. We found one interesting way to map these concepts to our explosion of Lambdas everywhere.

You remember we had a situation where we had lots of Lambdas talking to a lot of Lambdas, and this doesn't even digest this to the kind of stuff we had, because we had no idea what was calling what. Obviously, yes, you can use this Amazon whatever tool, that Amazon whatever, but as you are writing code, you don't know. It needs to be a report that somebody generates. One thing we've done was to split them out and put some kind of API gateway in front it. Now that's interesting because at the same time, there is a lot of contention and no contention at all about this design. It's actually recommended by many people that you always use an API gateway in front of your Lambdas.

At the same time, people look at it as, "Ok, but then aren't you just creating services and using Lambdas as your compute units? Isn't that just the same?" To be honest, yes, it's very similar. Again, I'm coming from my bias, I'm trying to adapt this to keep working and to fix things I haven't seen working, but it works well. The thing here is that even though each one of these Lambdas is adjustable, it could be called by anybody. You can prevent this with some security configurations, but we didn't even have to do that.

Everybody in the company knows that the only way to access one of these Lambdas is through the API gateway. The API gateway is your typical HTTP interface. It has a well-known domain name that you access and do things. But that's the only way to go from A to B. Instead of having this kind of peer-to-peer communication network, you'd start piling things through an API. It's the case since forever as this should become during services. One interesting thing is that some of the internal Lambda functions in this diagram, and that is on purpose, call all the services, and they can call it directly. You don't need to have a boundary object that some people create to make these calls. It's fine if they actually call the API gateway for another service directly. This is something that we've done.

Use Serverless as Building Blocks for Microservices

What we've done was really just start using serverless as building blocks for the microservices we had. It's really using the platform as a service, part of what serverless provides, and not so much some of the more interesting bells and whistles that we could be using. One interesting challenge is how do you segment? How do you create these lines across services? One thing we've done at Meetup is that we started using AWS accounts for that, which works well except there is one big caveat, which is AWS is really bad at deleting accounts. You make a request to delete an account and it could take days. It might never be fulfilled. We have some kind of organizational unit which is a grouping of accounts. Every time you want to delete an account, you actually move that account to this organizational unit. It's the repair one and it's not active anymore. It's not perfect, but it works well enough for us.

Another interesting thing we found is that API gateway is actually quite expensive. You are going to see many articles online talking about how it's an interesting product, but it's not great. My own experience with AWS is that unless you really need to do something about the cost right now, wait for a little more; talk to your account manager. I don't know how Google and Microsoft would work on this, but talk to your account manager, see if there is any kind of price drop coming.

We didn't have this with API gateway when I was at Meetup. I mentioned that we moved some of the systems to Fargate. Fargate was very expensive, we just had no option. We knew that we were going to increase our bill a lot by using that. It turns out that one week before we moved some of our systems to Fargate, Amazon just slashed the price in half. That's the Amazon policy. That worked well. We could actually keep using Fargate because it was ok. I don't know if that's going to happen to API gateway. I haven't seen massive price drops there, but one of the reasons I wasn't worrying about this is because what API gateway does is actually not a lot for us. If that became a problem, I could actually just get a team on the side and create some load balancing structure that would replace that. It would be a bit ad hoc and not invented here-ish, but we could do that. That was an option for us.

This is the thing that I still I'm a grumpy old man about when it comes to serverless architectures. There is a lot of people who tell you that Amazon provides you with everything and it's all going to be awesome. At Meetup, we went full on Amazon. Meetup still runs some stuff on Google Cloud but it's moving through AWS. We had only cloud formation, we didn't even use Terraform because we wanted to make sure that we were using the right tools that our AWS people could help us with. We did everything the Amazon way and I still had about 10% of the engineering team working on tooling. There were a lot of loose ends. Teams that don't talk to each other - you can clearly see that AWS has many different teams and they kind of hate each other because this feature is available there but not integrated with these other things. Why? This happens a lot and you need to be ready for that.

Is This Really Serverless?

I think ultimately, one interesting question is, is this really serverless? Is this what the future looks like, because it looks very similar to what we do now? Are you sure? My point is that I don't actually care, because that was one way we managed to get from one point to the other. But more important to me - this is also something that I tend to use as a saying, I tend to repeat a lot - you don't move from two out of ten to ten out of ten in one jump. I think some of the mistakes we've made in the past were exactly trying to do that, moving from a legacy application that was big and complicated, and 16 years old, to the bleeding edge of serverless computing with asynchronous workflows using Kafka and God knows what. This is too much of a jump, the gap is too wide. We need to go inch by inch and try to get somewhere.

This is an approach where I can hire somebody who has experience with microservices and they immediately understand what's going on, as opposed to what we had before that was very complicated. Does that mean that CQRS and event source don't work? I'm not saying this right now, but it's just because I never actually worked in a project that did this in right way, whatever you want it to be. I'm sure it can work well; I haven't seen it working so far.

Going back to the whole you are coming from a traditional microservices background to the Lambda kind of approach. What does that mean? What's the future like? Not being cynical at all, I actually think that serverless definitely is the future. As I mentioned, I used to work for a company called Buoyant. Oliver Gould is going to giving a talk about some of the work we did there and that is still there. There is an amazing piece of technology that has been built on the platform around Kubernetes mostly, and things like this. The moment I saw how easy it was for an engineer fresh in the company to deploy things using serverless, I said "There is definitely something here. There is a future there."

To be honest, I think AWS is pretty bad at tooling. I've worked for a few years at Digital Ocean here in New York City. The main thing they do is fix the user experience things that Amazon cannot fix. I'm familiar with some of the challenges there, but once they get this through, I really legit think this is the right way to go.

Questions & Answers

Participant 1: On your last slide, you mentioned that serverless looks like the future but it's not there. Can you give a list of features that you think are currently missing?

Calçado: I don't know if there would be features per say because I think the building blocks are there; it's how to get these things to talk to each other. We have various different components that work great in isolation and I was mentioning that we have 10% of my team working on tooling. We are not actually writing any software or any actual software. We are just linking things together, making something that will spit out a cloud formation template that wires this thing with that thing over there. I don't know if there is a particular feature that I am missing, as much as just the workflow nature of things which I believe the cloud providers are going to get better at and some companies offer some alternatives on top.

Participant 2: On the slide where you showed how you use an API gateway to essentially gate serverless functions, there was one place where you had a call from a gateway to an actual function. Can you elaborate on when it makes sense to do that and when it makes sense to always go through the gateway?

Calçado: I have two answers to this question. One is that there is actually a mistake on the slide. The second one though is that one thing that I have seen mostly happening at the edge - it will happen sometimes downstream as well - is that you need the API gateway to do some form of authorization and authentication, something like this. Usually, it wouldn't be so much of a big jump from one system to the other, but it might be something like this. Actually, it's funny because I haven't mentioned this in this presentation, but I used to believe in really dumb pipes, and now I'm becoming more and more a believer that the pipe should be smart enough and including the API Gateway piece of infrastructure - rate limiting authentication, authorization, and various other things that you might need to do from that layer, but that one is totally a mistake.

Participant 3: You mentioned just briefly about Service Meshes. I wonder, how does that fit into this landscape here? Would the Service Mesh be East-West then the API gateway North-South?

Calçado: The only way I have been thinking about this right now - Service Mesh wasn't the problem that I was thinking about, but I think it's the same - I was sitting down with somebody from one of the monitoring providers who was trying to tell us some things. We already had some of them, so just negotiating contracts and things like this. By the way, we have decided we are going to invest more in serverless. If I go with you because I'm signing a two-years contract right now, what does that mean? Their response was kind of iffy. Eventually they got me to talk to the CTO of the company. It turns out that all they could do was read the Amazon logs and plot them.

Extrapolating from this to your question is, this ecosystem is so closed up that I have no idea if anybody is going to be able to offer something at the Service Mesh level or any of these things. It worries me a lot, because I have been an Amazon customer for many years. SoundCloud was the second largest AWS account in Europe, and I know that AWS is great at some things, but I don't like the way they develop software. I don't want them telling me how to develop software. I trust other folks to do that, but not them. Yes, I'm worried about this, but I don't know how it's going pan out.

Participant 4: You've got a bunch of lines between Lambdas. My limited experience so far is that that's problematic and we still end up doing it sometimes, but we are using a lot of step functions and queues and publishing topics and reading them, and so forth. Can you talk about your experiences there?

Calçado: I tried to simplify a little bit this diagram but it's funny because when I was preparing this, I thought that it would also not super match the quote I had from Chris there, because he is talking about exactly what I'm talking about; many different step function, buckets, whatever it is. In my experience, it's what you ended up using to communicate these things, and you are right. It's funny now to match this. I'm not going to call it wrong, but if there is something you feel, that probably should be revisited.

One thing I found in this more actual serverless architecture using different objects or resource types is that unless you apply this rule of tying them all together with something - it could be tags but we use accounts specifically - you are going to have a lot of trouble trying to figure out what's used by what and what's going on. We have a lot of interfaces that are just databases. They get exposed to other services, and not being a natural published interface as I was saying creates a lot of trouble. That's how we build it with the step functions and various other objects and things. I actually don't like step functions that much, to be honest, but buckets and DynamoDB tables all over the place and also the Kafka thing from Amazon. If you have too many fanning out from one function, probably a function is not 200 lines of code.

Participant 5: I wanted to ask in terms of your architecture using multiple AWS accounts, it wasn't clear to me, what's the benefit in doing that? Also, in terms of your deployment, how is that done? If you have multiple accounts, how do you get around that kind of dynamic account ID type thing?

Calçado: The abstract benefit is that you need to group your resources by something. Usually, people do it by team or by environment, but we decided to group them by service. We could use tags for that, but we figured out that one interesting boundary between services was going to be accounts, because you have to explicitly jump from one to the other. It's a bit safer in our experience. One challenge we had at Meetup was that we had one account for everybody, all environments and everybody had access to it. We figured out that people were really afraid of doing anything in production, because any wrong thing they had done could impact the whole company, so containing it further and further was going to be an option. We followed this to the point that we decided that actually their unit was going be a disservice to itself. Again, it's arbitrary, it could be anything, it could be just tags and things like this.

We have a little bit of a complicated network architecture at Meetup because of the legacy account, the things we need to talk to, because the monolith is still there. Usually for deploy, you do an assume role, whatever the primitive is called, and you deploy to that account. Engineers have access directly to all their accounts, because our main thing is not that people cannot access the other accounts. They need to say, "I am going to use the user service account now," as opposed to just applying something to a generic course grained account. Big caveat, a lot of the tooling mentioned surrounds this. We have a lot of tooling to create accounts, to delete accounts, and jump between accounts as people jump between services. It's the working context or the work space you are working with.

See more presentations with transcripts

Recorded at:

Sep 24, 2019

Phil Calçado

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?