Key Takeaways
- With serverless infrastructure stacks oriented around compute instances are replaced by resource types abstracting servers away
- Managing serverless infrastructure stacks requires some planning and discipline to ensure they can be effectively maintained as systems evolve
- Splitting large infrastructure stacks into smaller ones oriented around functional areas and system-wide capabilities helps to make deployments quicker and less risky
- Dependencies between separate stacks have to be understood and managed to make infrastructure management more effective
- Infrastructure management should be integrated into the continuous delivery pipeline so that infrastructure stacks are updated together with any code changes
The smell of new hardware in the morning
I still fondly remember the morning when some new servers were delivered to the office of one of the companies I worked for during mid noughties. The excitement we all felt when we first powered them on and could smell this distinct scent the new electronic device has when first used. Still the same day, we meticulously started working on some basic configuration tasks. These new servers would become the “next generation” hardware platform for our small company for the years to come.
No surprises then that we quickly worked out how we would name our new servers. We went with the names based on Tolkien characters, so the first one was named Gandalf, obviously! After some weeks of tweaking the setup, the servers made their way to the data centre facility where they would soon start handling the production traffic.
I bet many of you, reading this article, have similar stories to share. Back then, the hardware was something much more tangible and closer to the minds and hearts of people responsible for the infrastructure management. Any project would involve hardware infrastructure planning and execution phase. Regular visits to the data centre were commonplace, sometimes hours or days had to be spent installing and testing servers and other network and power equipment. Even during normal system operation, anything involving hardware changes would require careful planning and even more visits to the data centre.
Infrastructure as Code primer
With cloud computing becoming the de facto standard for infrastructure delivery and management, the hardware itself has become a commodity the same way as electricity. Except for companies with very particular hardware requirements, most organisations turn to the cloud for their infrastructure needs. And due to the dynamic nature of cloud infrastructure and the footprint most organizations require it’s inconceivable to still treat the hardware infrastructure with the same degree of affection as it was once getting.
Around 2012, Randy Bias first used the term “treat your servers like cattle not pets”in the context of cloud computing to illustrate the philosophy required to effectively manage cloud infrastructure at scale. Fast forward to 2018, and Infrastructure as Code (IaC) is now the standard practice for cloud infrastructure management.
Infrastructure as Code practice requires all aspects of the infrastructure configuration to be captured and versioned using some form of source code versioning, as well as the tooling for the configuration to be applied to the actual infrastructure resources (physical or virtual).
Server is dead
Since the dawn of cloud computing (2006) it was pretty much all about virtual servers (also known as virtual machines or compute instances) so the IaC practice and tooling has largely evolved around server-oriented infrastructure stack, which is commonly referred to as Infrastructure as a Service (IaaS). But servers in the cloud are slowly becoming a commodity too.
In the last few years a new set of cloud services has enabled moving away from the server (aka compute instance) as the primary resource type. This new paradigm is called serverless. Now, serverless doesn’t mean that servers are not used. In fact, there are servers powering all types of serverless services but as a user you don’t get to see or interact with them. The cloud platform takes care of managing servers (and containers too) for you and offers services that provide common building blocks for creating cloud-native architectures: compute, storage, api management, messaging, security, monitoring, etc.
Long live cloud resources
Even though the servers are gone from the serverless picture, this doesn’t mean you can forget about infrastructure configuration altogether. Rather than configuring compute instances and many network related resources, which was commonplace for the traditional IaaS stack, we now need to configure functions, storage buckets or/and tables, APIs, messaging queues/topics and many additional resources to keep everything secured and monitored.
When it comes to infrastructure management, serverless architectures usually require more resources to be managed due to the fine-grained nature of serverless stacks. At the same time, without servers in sight, infrastructure configuration can be done as a single stage activity, in contrast with the need to manage IaaS infrastructure separately from the software artifacts running on different kinds of servers (web servers, various application runtimes, container orchestration systems, databases, message brokers, etc).
Even with this somewhat simplified way of managing infrastructure resources one still needs to use specialised tools for defining and applying infrastructure stack configurations. Cloud platform providers offer their proprietary solutions in this area. AWS has Cloud Formation, Azure has Resource Manager and Google has Cloud Deployment Manager. Alternatively, a more generic solution like HashiCorp Terraform can be used.
Serverless online learning platform
Let’s use a relatively simple but fairly realistic example of a serverless architecture as the basis for further analysis and discussion. Imagine we need to create a system that delivers online learning courses to our paying customers. Such a system would need to cater for the following functional requirements:
- manage user details and provide authenticated access to the system
- store course details and users enrolled on them
- store course materials, including videos
- serve course materials to enrolled users
- track course progress
- accept credit card payments
- send email notifications
To achieve quick time to market and fully embrace the cloud philosophy we can shorten our time to market by using third-party cloud services for some non-differentiating parts of the system like user management and authentication, card payment processing and email campaign management. The rest of the system, covering the course management and delivery business domain, can be delivered using a serverless stack on AWS. Other major serverless platforms (like Azure or Google Cloud) could be equally used but given AWS’ popularity and maturity - as well as my familiarity with it - I’ve opted to use it for the purposes of this exercise.
The very first design of the system, one that perhaps lacks much thought about maintainability and operability, might look something like depicted below. All the application components, directly mapped to cloud resources, belong to a single infrastructure stack.
Diagram 1: The initial design using a single function
In an attempt to simplify the system design the entire backend logic is encompassed by a single AWS Lambda function that is responsible for handling API calls from the API gateway for all functional areas, storing/retrieving data from a set of DynamoDB tables as well as interacting with 3rd-party services for actions that can only be securely performed from the backend component. This initial design, perhaps the quickest to implement, suffers from multiple drawbacks:
- A single function has too many responsibilities and its code will quickly become very difficult to maintain effectively, especially if many developers get involved
- Additionally, when working on new functional requirements it may be difficult to coordinate work between developers and each change will require function deployment - this will get worse as the system grows
- This one uber-function has access to all system data as well as all 3rd party services, which can introduce security risks
- As this one function plays so many roles it may be challenging to determine the optimal resource allocation (memory + CPU) and configure other properties (timeout, concurrency) appropriately
The diagram above depicts very much a monolithic architecture, both in terms of the coupling between application components but also because all resources belong to a single infrastructure stack.
Now, perhaps this initial version is all that’s needed to successfully deliver the online learning platform and any further investment wouldn’t be justified to start with. However, in most cases, after the initial success, users will ask for new features to be added to the platform so it usually pays off to spend a bit more effort to make the solution ready for further growth and evolution.
In order to mitigate some of the problems identified in the first version of the platform, we could for instance break up the single, monolithic function into several smaller functions, each responsible for a specific functional area.
Diagram 2: Separate functions implementing API for each functional area
With the revised design, the system should be more maintainable, easier to secure and easier to configure for specific traffic patterns and resource requirements that may vary for each function. One important problem still remains though: the entire system sits on a single infrastructure stack.
Challenges with large infrastructure stacks
But why is using a single infrastructure stack a problem?
One thing to highlight about the solution diagrams shown above is that they only include the main resources within the architecture. There are many more additional resources that have to be configured in order for the solution to function correctly, be secure and operable. Examples of additional resources that will be part of the infrastructure stack are:
- API gateway path, request and response specific resources - each API endpoint exposed with AWS API Gateway requires a set of configuration resources, so for APIs with several endpoints API gateway alone will require tens of resource in the infrastructure stack
- security roles and policy resources - almost any interaction between resources in a infrastructure stack requires explicit AWS IAM roles and access control policies to be defined and attached
- DNS specific resources - if the API endpoints or S3 objects should be accessible under a custom domain name, a set of AWS Route53 resources need to be configured
- CDN specific resources - if S3 objects need to be accessible via the Content Delivery Network (CDN), an AWS CloudFront web distribution needs to be configured
In the case of the rather simple online learning system we are discussing here it may not be a massive challenge (even considering resources not depicted in diagrams). However, for larger serverless (or even more traditional) systems, using a single large infrastructure stack comes with numerous challenges:
- The more resources included in the stack, the longer it takes to provision or update it
- When many functional areas or logically separate services depend on the same stack, then the entire stack has to be reconciled (move from current state to the desired state) for each change, which is often unrelated to the previous
- Some infrastructure resources can take a lot of time to (re)create/update (CloudFront distributions for instance) so including these in the same stack as application specific resources makes it harder to run deployments
- In case of a problem updating the stack, no other changes can be made until the problem is resolved, even though it may be only affecting a small part of the system
- Since all resources are managed together, it’s easy to keep introducing interdependencies between them, which over time can lead to a tangled web of references between resources that will be difficult to manage
- Similarly, security policies applied between infrastructure resources can become too permissive and introduce security risks
- Lastly, using a single stack affects the continuous delivery (CD) setup as it forces changes to a possibly large stack being applied each time the application code changes, or excluding infrastructure configuration step from the pipeline altogether
For these reasons, large infrastructure stacks should be avoided when possible. It’s understable and justifiable that a serverless system may start its life with a single infrastructure stack. However, after a while it might be beneficial to split the ever growing stack into a set of smaller ones, supporting separate functional areas within the system. With that approach, dependencies between infrastructure resources can be identified and managed more explicitly. In a later section, I discuss best practices for managing shared infrastructure stacks and cross-stack resource dependencies.
Divide and conquer
When it comes to splitting the infrastructure stack of the serverless system, the principles to go by are similar to any other modularisation effort. There are number of things to consider but the primary drivers should be aiming to minimise coupling and maximise cohesion. Secondly, it’s worth remembering that different parts of the system will change at different rates.
Given the strong relationship between the infrastructure and application layer that serverless architectures exhibit, the modularisation exercise at the infrastructure level will affect the modularity of the application and vice versa. This is quite different from the traditional cloud architectures where the infrastructure stack, predominantly composed of compute instances (optionally with a container orchestration solution sitting on top), can be thought of and evolved completely independently of the application components or services.
Even though it’s definitely possible to use many smaller stacks for a traditional cloud infrastructure, it’s not a pattern I’ve seen much in the wild. Most infrastructure estates are managed using a single (or a handful) carefully crafted, large stack that is managed exclusively by the (Dev)Ops team and usually changes rather slowly. Additionally, the recent move towards container orchestration solutions favour large infrastructure stacks that are used to run containerised workloads.
Going back to the online learning platform, we can easily identify some functional areas and the corresponding infrastructure resources that can be separated into independent stacks, in particular: content management, course management, user management and payment processing.
Diagram 3: Separate infrastructure stacks for each functional area
Now each stack can be managed independently and when a new feature requires changes to be made to the infrastructure layer, these can be isolated to one or just a couple of stacks.
Shared infrastructure
Earlier, I’ve touched on DNS specific configuration as an example of infrastructure resource type that wasn’t explicitly called out in the solution diagrams. It’s also a good example of the kind of infrastructure resource that will most likely be defined at the system level and won’t change much throughout the system’s life, even when the application specific resources keep evolving to cater for new functional requirements.
Our online learning platform should definitely be accessible under its own custom domain name (as opposed to the domain name of the cloud provider, like AWS). Setting it up would require a domain name to be registered in AWS Route53 and then a hosted zone and an API gateway DNS record configured for that domain. DNS specific infrastructure resources don’t belong to any of the previously identified stacks and should be part of a new stack dedicated to the DNS setup. The stacks responsible for platform services can then refer to the resources in the DNS stack if necessary.
Diagram 4: The additional shared stack for DNS related resources
DNS specific infrastructure usually doesn’t change much once initially configured, so making it separate from the resources responsible for the functionality of the system makes a lot of sense. Unless the DNS configuration changes, there is no point in checking whether it needs to be updated for each deployment.
Any time the system needs a set of infrastructure resources that are not immediately involved in supporting the functional requirements and change at a different rate than other parts of the infrastructure stack, they should be kept separate from the application specific elements of the stack so they can be managed independently.
However, any infrastructure resources that depend on the shared infrastructure have to be able to obtain their physical identifiers at deployment time. AWS uses Amazon Resource Name (ARN) as the primary resource identifier for infrastructure level references. Shared infrastructure stacks need to allow the identifiers of resources that are likely to be referenced by other stacks to be discoverable at deployment time. This can be done by declaring stack outputs in AWS Cloud Formation or by looking up resources against the stack state when using HashiCorp Terraform.
Infrastructure stack cross-dependencies
So far the application specific stacks of the online learning platform are fairly isolated and don’t have any dependencies, except for the shared, DNS specific infrastructure discussed in the previous section. Now, let’s consider that we need to record the fact the user has paid for the course (the payment is finalized using the Payment API), activate the course the user has enrolled on and send an email notification using the third party email service.
This new functionality would probably be best implemented by attaching a new Payment Publisher function to the Payment Table (part of the Payment Stack) and sending a message on a Payment Topic (using AWS SNS) so that Payment Subscriber function (part of the User Stack) can be invoked and update the relevant details in the User Table as well as send the notification message.
Diagram 5: Cross-stack dependencies between resources
With these changes done, the User Stack now has the dependency on the Payment Stack so it can subscribe the Payment Consumer function to the Payment Topic. This type of cross-stack dependency places some ordering constraints on how stack deployments are made to ensure the dependent resources are provisioned before they are required. If the system is deployed into a new environment (for demo or testing purposes) the Payment Stack has to be deployed before the User Stack.
In case of cross-stack dependencies it’s important to ensure that resource identifiers remain stable over time, otherwise the data flow that is spanning multiple stacks can break if the resource it refers to suddenly changes its identifier (ARN in case of AWS resources).
Infrastructure management in the continuous delivery pipeline
Serverless computing is quite unique for its ability to support rapid pace of delivery while offering scalability at low cost to organisations using it. But in order to take the most advantage from the agility that serverless can bring, it’s essential to employ continuous delivery, effectively plugging the infrastructure management into the delivery pipeline.
It’s quite natural for serverless architectures to keep the infrastructure configuration together with the application code in the source code repository. When the delivery pipeline kicks off, it usually has to determine whether the infrastructure has to be created or updated before updating the code of any functions that might have changed.
Keeping the infrastructure stacks small and separated between functional areas of the system helps to ensure that changes to the infrastructure can mostly be done within seconds or at most couple of minutes as the task of comparing the current state of the infrastructure resources against the desired one (something that infrastructure management tools like AWS Cloud Formation or HashiCorp Terraform are responsible for) can be limited to the parts of the infrastructure stack that are meant to be updated.
Diagram 6: Stack deployment as part of continuous delivery pipeline
In the case of the online learning platform it’s possible and even necessary to define a separate delivery/deployment pipeline for each stack, which is responsible for applying changes to the infrastructure as well as deploying new versions of the function code. Changes to a specific functional area (stack configuration or application logic) should trigger the pipeline to execute and apply the changes to this particular stack only.
Additionally, for any stacks that don’t have code deployment requirements (as no functions are included) some of the pipeline steps can be skipped, for instance commit stage testing or code deployment itself.
Summary
The shift away from the infrastructure stacks dominated by servers brings new opportunities as well as new challenges. Serverless architectures employ a wider range of cloud services and make infrastructure stacks more heterogeneous and quite often trickier to organise and manage.
In order to effectively manage cloud infrastructure in the serverless era, the philosophy, practices and tools will have to evolve. As modern cloud architectures adopt serverless technologies, infrastructure management will become an integral part of software delivery and will become practiced by many rather than the few.
As always, applying sound design principles to drive towards modularity by minimising coupling and maximizing cohesion between infrastructure resources is key. This will help avoid large and difficult to manage infrastructure stacks and, instead, support the rapid value delivery that serverless architectures promote.
About the Author
Rafal Gancarz is an IT Consultant, currently working with Starbucks. He is a versatile technologist with over 15 years of commercial experience building high quality distributed systems. Rafal is a technical architect with broad expertise in numerous architectural styles and patterns as well as excellent hands-on developer, able to tackle complexity in the heart of any IT solution while providing mentoring and guidance within the technical team. He is also a Certified Scrum Master, experienced agile practitioner and evangelist, passionate about improving project delivery and building highly performing teams. Rafal has recently spent over a year working on a large scale project using serverless technologies on the AWS platform and has first-hand experience using the serverless stack for building enterprise-grade distributed systems.