InfoQ Homepage Articles A Recipe to Migrate and Scale Monoliths in the Cloud

A Recipe to Migrate and Scale Monoliths in the Cloud

May 13, 2022 21 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Migrating monoliths to the cloud can be scary, but you don’t have to massively re-engineer your application to do that.
A monolithic 3-tier application (frontend, backend, and database) can be safely and easily migrated to AWS with the additional benefit of gaining higher levels of resiliency and scalability.
Having some previous experience with AWS or other cloud providers certainly helps to speed up the process, but you can also use a monolith migration project as a way to start learning and taking advantage of the cloud.
Adopting the cloud and building a distributed system with replicated instances requires you to shift your mindset a bit. It’s best not to consider your machines as perpetually running resources. You will be tearing machines up and down dynamically so you need to embrace automation a lot more, build machine images, and start leveraging observability and CI/CD tools.
Taking your monolith to the cloud is a great opportunity to start an innovation journey. Once you have your monolith on the cloud you can start thinking about decomposing some components to their own applications and move toward a microservices architecture. You could also start to use more managed services like Lambda or SQS and take advantage of a pool of services that can speed up the development of new features or products.

As a consulting cloud architect at fourTheorem, I see many companies struggling to scale their applications and take full advantage of cloud computing.

Some of these companies are both startups and more consolidated organizations that have developed a product in a monolithic fashion and are finally getting some good traction in their markets. Their business is growing, but they are struggling to scale their deployments.

Their service is generally deployed on a private server on-premise or managed remotely by some hosting providers on a virtual server. With the increased demand for their service, their production environment is starting to suffer from slowness and intermittent availability, which eventually hinders the quality of the service and the potential for more growth.

Moving the product to a cloud provider such as AWS could be a sensible solution here. Using the cloud allows the company to use resources on demand and only pay as they go. Cloud resources could also be scaled dynamically to adapt to bursts of traffic keeping the user experience always up to great standards.

Interestingly enough, some of the companies that I have been talking to believe that, in order to transition to the cloud, they necessarily have to re-engineer the entire architecture of their application and switch to microservices or even serverless.

In most circumstances, re-engineering the entire application would be a prohibitive investment in terms of cost and time and it would divert focus that should otherwise be spent on building features that can help the business to grow more. This belief makes the business skeptical about the opportunities the cloud could bring them and they end up preferring a shorter-term scale-up strategy where the current application server is upgraded to a more powerful and expensive machine.

Of course, there is a limit on how big a single server can get, and eventually, the business will need to get back to square one and consider alternative solutions.

In this article, I want to present a simple cloud architecture that can allow an organization to take monolithic applications to the cloud incrementally without a dramatic change in the architecture. We will discuss the minimal requirements and basic components to take advantage of the scalability of the cloud. We will also explore common gotchas that might require some changes in your application codebase. Finally, we will analyze some opportunities for further improvement that will arise once the transition to the cloud is completed.

I have seen a good number of companies succeed in moving to the cloud with this approach. Once they have a foothold into the cloud and their application is stable they can focus on keeping their customers happy and grow their business even more. Moreover, since technology is not a blocker anymore, they can start experimenting and transition parts of their application to decoupled services. This allows the company to start transitioning to a microservices architecture and even new technologies such as Lambda functions, which can help to achieve greater agility in their development process and lead to additional growth opportunities for the business.

A fictitious company

Let’s make things a bit more tangible here and introduce a fictitious company that we will use as an imaginary case study to explore the topic of cloud migrations.

Eaglebox, Ltd. is a file storage company that offers the Eaglebox App, a web and mobile application that helps legal practitioners keep all their files organized and accessible remotely from multiple devices.

To get familiar with what Eaglebox App looks like, let’s present a few specific use cases:

A user logs into the application and they see all their previously uploaded legal documents.
A user uploads new documents and organizes them by providing specific tags (client id, case number, etc.).
A user might search for documents containing specific keywords or tags.

Eaglebox App is developed as a monolithic application written using the Django framework and PostgreSQL as a database.

Eaglebox App is currently deployed on a server on the Eaglebox premises, and all the customer files are kept in the machine drive (yes, they are backed up often!). Similarly, PostgreSQL is running as a service in the same machine. The database data is backed up often, but it is not replicated.

Eaglebox has recently closed a few contracts with some big legal firms, and since then, they are struggling to scale their infrastructure. Their server is becoming increasingly slow, the disk is saturating quickly, requiring a lot of maintenance. The user experience has become sub-optimal, and the whole business is currently at risk.

Let’s see how we can help Eaglebox to move to the cloud with a revisited and more scalable architecture.

The challenge

Based on what the engineers at EagleBox are telling us, we have identified a few crucial problems we need to tackle:

Too much load on one machine makes the whole application slow and unresponsive, sometimes even unavailable.
Too many files on a local drive are filling up the disk fast. What happens when there is no more disk space available?
The PostgreSQL database runs as a service in the same machine running the application, which puts the entire server even more under pressure. Database reads and writes are themselves becoming another bottleneck for the application.
The single monolithic virtual machine is a single point of failure. If it fails for any reason, the entire application goes down.

On top of these technical problems, we also need to acknowledge that the team at EagleBox does not have experience with cloud architectures and that a migration to the cloud will be a learning experience for them. It’s important to limit the amount of change required for the migration to give the team time to adapt and absorb new knowledge.

Our challenge is to come up with an architecture that addresses all the existing technical problems, but at the same time provides the shortest possible path to the cloud and does not require a major technological change for the team.

A simple and scalable cloud architecture

To address EagleBox challenges we are going to suggest a simple, yet very scalable and resilient cloud architecture, targeting AWS as the cloud provider of choice.

Such architecture will have the following components:

An application load balancer (the entry point)
A set of EC2 virtual machines (running multiple instances of an application)
File storage (S3)
Session Storage (in-memory cache—Redis ElastiCache)
A multi-availability zone Postgres Database running on RDS

Figure1. High-level view of the proposed architecture.

In Figure1, we can see a high-level view of the proposed architecture. Let’s zoom in on the various components.

Data centers and networking

Before we discuss the details of the various components it is important to briefly explore how AWS exposes its data centers and how we can configure the networking for our architecture. We are not going to go into great detail but we need to cover the basics to be able to understand what kind of failures we can expect and how we can keep the application running even when things do fail. And how we can make it scale when the traffic increases.

The “cloud” is not infallible, things break even in there. Cloud providers like AWS, Azure, and Google Cloud give us tools and best practices to be able to design resilient architectures, but it’s a shared responsibility model where we need to understand what the provider’s assurances are, what could break, and how.

When it comes to networking, there are a few high-level concepts that we need to introduce. Note that I will be using AWS terminology here, but the concepts should apply also to Azure and Google Cloud.

Region: a physical location around the world (e.g. “North Virginia,” “Ireland,” or “Sydney”) where AWS hosts a group of data centers. Regions help to provision infrastructure that is closer to the customers so that our applications can have low latency and feel responsive.
Availability Zone: a discrete data center with redundant power, networking, and connectivity in an AWS Region. Data centers in different availability zones are disjointed from one another, so if there’s a serious outage, that’s rarely affecting more than one availability zone at the same time. It’s good practice to spread redundant infrastructure across different availability zones in a given region to guarantee high availability.
VPC: a virtual (private) network provisioned in a given region for a given AWS account. It is logically isolated from other virtual networks in AWS. Every VPC has a range of private IP addresses organized in one or more subnets.
Subnet: a range of IPs in a given VPC and in a given availability zone that can be used to spin up and connect resources within the network. Subnets can be public or private. A public subnet can be used to run instances that can have a public IP assigned to them and can be reachable from outside the VPC itself. It’s generally a good practice to keep front-facing servers (or load balancers) in public subnets and keep everything else (backend services, databases, etc.) in private subnets. Traffic between subnets can be enabled through routing tables to allow for instance a load balancer in a public subnet to forward traffic to backend instances in a private subnet.

For the sake of our architecture, we would go with a VPC configuration like the one illustrated in Figure 2.

Figure 2. VPC configuration for our architecture

The main idea is to select a Region close to our customers and create a dedicated VPC in that region. We will then use 3 different availability zones and have a public and a private subnet for every availability zone.

We will use the public subnets only for the load balancer, and we will use the private subnets for every other component in our architecture: virtual machines, cache servers, and databases.

Action point: Start by configuring a VPC in your region of choice. Make sure to create public and private subnets in different availability zones.

Load Balancer

The load balancer is the “entry point” for all the traffic going to the Eaglebox App servers. This is an Elastic Application Load Balancer (layer 7), which can manage HTTP, HTTPS, WebSocket and gRPC traffic. It is configured to distribute the incoming traffic to the virtual machines serving as backend servers. It can check the health of the targets, making sure to forward incoming traffic only to the instances that are healthy and responsive.

Action point: Make sure your monolith has a simple endpoint that can be used to check the health of the instance. If there isn’t one already, add it to the application.

Through an integration with ACM (AWS Certificate Manager), the load balancer can use a certificate and serve HTTPS traffic, making sure that all the incoming and outgoing traffic is encrypted.

From a networking perspective, the load balancer is configured to use all the public subnets, therefore, using all the availability zones. This makes the load balancer highly available: if an availability zone suddenly becomes unavailable, the traffic will automatically be routed through the remaining availability zones.

In AWS, Elastic Load Balancers are well capable of handling growing traffic and every instance is capable of distributing even millions of requests per second. For most real-life applications we won’t need to worry about doing anything in particular to scale the load balancer. Finally, it’s worth mentioning that this kind of load balancer is fully managed by AWS so we don’t need to worry about system configuration or software updates.

Virtual Machines

Eaglebox App is a web application written in Python using the Django framework. We want to be able to run multiple instances of the application on different servers simultaneously. This way the application can scale according to increasing traffic. Ideally, we want to spread different instances across different availability zones. Again, if an availability zone becomes unavailable, we want to have instances in other zones to handle the traffic and avoid downtimes.

To make the instances scale dynamically, we can use an autoscaling group. Autoscaling groups allow us to define the conditions under which new instances of the application will automatically be launched (or destroyed in case of downscaling). For instance, we could use the average CPU levels or the average number of requests per instance to determine if we need to spin up new instances or, if there is already plenty of capacity available, we can decide to scale the number of instances down and save on cost. To guarantee high availability, we need to make sure there is always at least one instance available in every availability zone.

In order to provision a virtual machine, it is necessary to build a virtual machine image. An image is effectively a way to package an operating system, all the necessary software (e.g. the Python runtime), the source code of our application, and all its dependencies.

Having to define images to start virtual machine instances might not seem like an important detail, but it is a big departure from how software is generally managed on premise. On premise, it’s quite common to keep virtual machines around forever. Once provisioned, it’s common practice for IT managers to login into the machine to patch software, restart services or deploy new releases of the application. This is not feasible anymore once multiple instances are around and they are automatically started and destroyed in the cloud.

A best practice in the cloud is to consider virtual machines “immutable”: once they are started they are not supposed to be changed. If you need to release an update, then you build a new image and start to roll out new instances while phasing out the old ones.

But immutability does not only affect deployments or software updates. It also affects the way data (or “state” in general) is managed. We cannot afford to store any persistent state locally in the virtual machine anymore. If the machine gets shut down we will lose all the data, so no more files saved in the local filesystem or session data in the application memory.

With this new mental model “infrastructure” and “data” become well-separated concerns that are handled and managed independently from one another.

As we go through the exercise of reviewing the existing code and building the virtual machine images, it will be important to identify all the parts of the code that access data (files, database records, user session data, etc.) and make the necessary changes to ensure that no data is stored locally within the instance. We will discuss more in depth what are our options here as we go through the different types of storage that we need for our architecture.

But how do we build a virtual machine image?

There are several different tools and methodologies that can help us with this task. Personally, the ones I have used in the past and that I have been quite happy with are EC2 Image Builder by AWS and Packer by Hashicorp.

Database

In AWS, the easiest way to spin up a relational database such as PostgreSQL is to use RDS: Relational Database Service. RDS is a managed service that allows you to spin up a database instance for which AWS will take care of updates and backups.

RDS PostgreSQL can be configured to have read replicas. Read replicas are a great way to offload the read queries to multiple instances, keeping the database responsive and snappy even under heavy load.

Another interesting feature of RDS is the possibility to run a PostgreSQL instance in multi-AZ mode. This means that the main instance of the database will run on a specific AZ, but there will be at least 2 standby replicas in other AZs ready to be used in case the main AZ should fail. AWS will take care of performing an automatic switch-over in case of disaster to make sure your database is back online as soon as possible and without any manual intervention.

Keep in mind that multi-AZ failover is not instantaneous (it generally takes 60-120 seconds) so you need to engineer your application to work (or at least to show a clear descriptive message to the users) even when a connection to the database cannot be established.

Now, the main question is, how do we migrate the data from the on-premise database to a new instance on RDS? Ideally, we would like to have a process that allows us to transition between the two environments gradually and without downtimes, so what can we do about that?

AWS offers another database service called AWS Database Migration Service. This service allows you to replicate all the data from the old database to the new one. The interesting part is that it can also keep the two databases in sync during the switch over, when, due to DNS propagation, you might have some users landing on the new system while others might still be routed to the old server.

Action point: Create a database instance on RDS and enable Multi-AZ mode. Use AWS Database Migration Service to migrate all the data and keep the two databases in sync during the switch-over phase.

File Storage

In our new architecture, we can implement a distributed file storage by simply adopting S3 (Simple Storage Service). S3 is one of the very first AWS services and probably one of the most famous.

S3 is a durable object storage service. It allows you to store any arbitrary amount of data durably. Objects can be stored in buckets (logical containers with a unique name). S3 uses a key/value storage model: every object in a bucket is uniquely identified by a “key” and content and metadata can be associated with a key.

To start using S3 and be able to read and write objects, we need to use the AWS SDK. This is available for many languages (including Python) and it offers a programmatic interface to interact with all AWS services, including S3.

We can also interact with S3 by using the AWS Command Line Interface. The CLI has a command that can be particularly convenient in our scenario—the sync command. With this command, we can copy all the existing files into an S3 bucket of our choice.

To transition smoothly between the two environments, a good strategy is to start using S3 straight away from the existing environments. This means that we will need to synchronize all our local files into a bucket, then we need to make sure that every new file uploaded by the users is copied into the same bucket as well.

Action point: Files migration. Create a new S3 bucket. Synchronize all the existing files into the bucket. Save every new file in S3.

Session Storage

In our new architecture, we will have multiple backend servers handling requests for the users. Given that the traffic is load balanced, a user request might end up on a given backend instance but the following request from the same user might end up being served by another instance.

For this reason, all the instances need to have access to a shared session storage. In fact, without a shared session storage, the individual instances won’t be able to correctly recognize the user session when a request is served by a different instance from the one that originally initiated the session.

A common way to implement a distributed session storage is to use a Redis instance.

The easiest way to spin up a Redis instance on AWS is to use a service called Elasticache. Elasticache is a managed service for Redis and Memcached and as with RDS, it is built in such a way that you don’t have to worry about the operative system or about installing security patches.

ElastiCache can spin up a Redis Cluster in multi-AZ mode with automatic failover. Like with RDS, this means that if the Availability Zone where the primary instance of the cluster were to be unreachable, Elasticache would automatically perform a DNS failover and switch to one of the standby replicas in one of the other Availability Zones. Also, in this case, the failover is not instantaneous, so it’s important to account at the application level that it might not be possible to establish a connection to Redis during a failover.

Action point: Provision a Redis cluster using ElastiCache and make sure all the session data is stored there.

DNS

The final step in our migration is about DNS, how do we start forwarding traffic to our new infrastructure on AWS?

The best way to do that is to configure all our DNS for the application in Route 53. Route 53 is a highly available and scalable cloud DNS service.

It can be configured to forward all the traffic on our application domain to our load balancer. Once we configure and enable this (and DNS has been propagated) we will start to receive traffic on the new infrastructure.

If your domain has been registered somewhere else you can either transfer the domain to AWS or change your registrar configuration to use your new Route 53 hosted zone as a name server.

Action point: Create a new hosted zone in Route 53 and configure your DNS to point your domain to the application load balancer. Once you are ready to switch over, point your domain registrar to Route 53 or transfer the domain to AWS.

Other recommendations

As we have seen, this new architecture consists of a good amount of moving parts. How can we keep track of all of them and make sure all our environments (e.g. development, QA, and production) are as consistent as possible?

The best way to approach this is through Infrastructure as a Code (IaaC). IaaC, allows you to keep all your infrastructure defined declaratively as code. This code can be stored in a repository (even the same repository you already use for the application codebase). By doing that, all your infrastructure is visible to all the developers. They can review changes and contribute directly. More importantly, IaaC gives you a repeatable process to ship changes across environments and this helps you to keep things aligned as the architecture evolves.

The tool of choice, when it comes to IaaC on AWS, is CloudFormation which allows you to specify your infrastructure templates using YAML. Another alternative tool from AWS is Cloud Development Kit (CDK) which provides a higher-level interface that can be used to define your infrastructure in code using programming languages such as TypeScript, Python, or Java.

Another common alternative is a third-party cross-cloud tool called Terraform.

It’s not important which tool you pick (they all have their pros and cons) but it’s extremely important to define all the infrastructure as code to make sure you can start to build a solid process around how to ship infrastructure changes to the cloud.

Another important topic is observability. Now that we have so many moving parts, how do we debug issues or how do we make sure that the system is healthy? Discussing observability goes beyond the scope of this article, but if you are curious to start exploring topics such as distributed logs, tracing, metrics, alarms, and dashboards make sure to have a look at CloudWatch and X-Ray.

Infrastructure as code and observability are two extremely important topics that will help you a lot to deploy applications to the cloud and keep them running smoothly.

We are live! What’s next?

So now that we are in the cloud, is our journey over? Quite the contrary, this journey has just begun and there is a lot more to explore and learn about.

Now that we are in the cloud we have many opportunities to explore new technologies and approaches.

We could start to explore containers or even serverless. If we are building a new feature we are not necessarily constrained by having to deploy in one monolithic server. We can build the new feature in a more decoupled way and try to leverage new tools.

For instance, let’s say we need to build a feature to notify them by email when new documents for a case have been uploaded by another user. One way to do this is to use a queue and a worker. The application can publish to a queue the definition of a job related to sending a notification email. A pool of workers can process these jobs from the queue and do the hard work of sending the emails.

This approach allows the backend application to stay snappy and responsive and delegate time-consuming background tasks (like sending emails) to external workers that can work asynchronously.

One way to implement this on AWS is to use SQS (queue) and Lambda (serverless compute).

This is just an example that shows how being in the cloud opens up new possibilities that can allow a company to iterate fast and keep experimenting while leveraging a comprehensive suite of tools and technologies available on demand.

The cloud is a journey, not a destination and this journey has just begun, enjoy the ride!

Bonus: a checklist to move the monolith to the cloud

Create an AWS Account
Select a tool for IaaC
Create and configure a VPC in a region (3 AZs, Public/Private subnets)
Create an S3 bucket
Update the old codebase to save every new file to S3
Copy all the existing files to S3
Spin up the database in RDS (Multi-AZ)
Migrate the data using Database Migration Service
Provision the ElastiCache Redis Cluster (Multi-AZ)
Create an AMI for the application
Create a security group and an IAM policy for EC2
Make the application stateless
Create a health check endpoint
Create an autoscaling group to spin up the instances
Create a certificate in ACM
Provision an Application Load Balancer (public subnets)
Configure HTTPS, Targets, and Health Checks
Configure Route 53
Traffic switch-over through DNS

About the Author

Luciano Mammino

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?