InfoQ Homepage Presentations Orchestrating Hybrid Workflows with Apache Airflow

Orchestrating Hybrid Workflows with Apache Airflow

View Presentation

Speed:

48:29

Summary

Ricardo Sueiras discusses how to leverage Apache Airflow to orchestrate a workflow using data sources inside and outside the cloud.

Bio

Ricardo Sueiras has over 30 years spent working in the technology industry, helping customers solve business problems with open source and cloud. Currently he is a Developer Advocate at AWS focusing on open source, where he helps raise awareness of AWS and works towards making AWS a great place to run open source software.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Sueiras: For many customers who are moving to the cloud, they want to know how to build and orchestrate data pipelines across on-premise, remote, and cloud environments. My name is Ricardo Sueiras. I'm going to show you how you can leverage Apache Airflow to orchestrate workflows using data sources both inside and out of the cloud. I'll be covering why customers care about orchestrating these hybrid workflows, and some of the typical use cases you might see. Explore some of the options you might have within Apache Airflow, and some of the tradeoffs you need to think about. Before diving into some code and a demo of building and orchestrating a hybrid workflow.

Data Orchestration in Hybrid Environments

Apache Airflow has become a very popular solution for customers who want to solve the problem of how to orchestrate data pipelines. Rather than building and relying on a homegrown solution, Apache Airflow provides a proven set of capabilities to help you create, deploy, and manage your workflows. It also has a great community of contributors who are driving and innovating the project forward. For customers who have a strong preference for open source, it has become a key part of their data engineering infrastructure. As customers move to the cloud, they look for help in how they build these hybrid solutions that can integrate with existing legacy systems and leverage the data that resides in those, wherever that data might be. As customers build those data pipelines they encounter, however, a number of challenges. For some data, for example, there may be some strong regulatory or compliance controls that limit where that data can be processed or where that data can reside. Yet, they still want to get insights from that data. Customers may want to integrate into legacy and heritage systems who can't move to the cloud for whatever reason. They want to do this in a way that's simple and doesn't rely on overly complex solutions. They also want a cost-effective solution in case they want to move large volumes of data.

Reviewing Apache Airflow Operators

It's no surprise, Apache Airflow has a number of operators that allow you to simplify how you can orchestrate your data pipelines. Which one of these is best suited for these hybrid scenarios? The reality is that most of the Apache Airflow operators can work as long as you've got network connectivity to the systems that you're connecting to. Architecting solutions that work across hybrid environments require some planning and consideration of the different strengths of each approach, especially when you're considering the compliance and regulatory concerns. For example, if we wanted to build a workflow that performed a regular batch processing of some data, for example, in a MySQL database that we've got deployed across our remote, our data center, and our cloud environments. We have a number of different options we might want to consider. We could use, for example, a PythonOperator to remotely access our data sources. We could run and create code in Python, and that code would run in an Apache Airflow Worker node. We could reuse that code. We can package it up as a plugin to make it simpler and more reusable. The code would still be running on the worker node in the cloud, so it may not meet our compliance and regulatory requirements. If you've got Kubernetes skills, you may be able to implement your ELT logic within a container image, and then deploy and run this image in a Kubernetes cluster that you might have. This is a very popular option for many customers. You would need to implement a solution locally that would allow you to deploy your container. You'd then have to make sure you've got the right skills locally to manage and maintain those Kubernetes clusters.

You can also use the MySQLOperator. Again, if you've got the network connectivity and networking infrastructure in place, you will also need to put some additional controls in place to enable access to the various components. Again, like the PythonOperator, this would potentially be limited in those use cases where compliance and regulatory requirements meant that you weren't able to process that data centrally. Now, if we're using AWS managed services, we could use a number of operators such as the AthenaOperator, which allows you to create federated queries using an open source SDK that helps you build connectors to whatever data sources you've got. As long as you've got that network VPN connectivity, you create your federated query over a Lambda function, and that accesses the data. Again, this requires the processing of that to be done in the cloud, and so may not meet your regulatory and compliance needs. This is what it would typically look like when you would deploy it. Similar to the KubernetesOperator, the ECSOperator allows you to run container workloads but with a much simpler deployment model. Rather than having to manage and operate a Kubernetes cluster remotely, we can use something called the ECS Anywhere agent that allows you to deploy and run containers easily via something called the ECS agent. I'm going to explore more how you can use this container-based approach together with that operator for the rest of the presentation.

Enabling Hybrid ECSOperator

ECS Anywhere allows you to extend the data plane from AWS into your own environments. It could be on-prem, in a remote office, it could even be in other cloud providers. It provides a simplified deployment that makes it really easy to get up and running and requires only outbound connectivity. It runs on a number of Linux operating systems, as you can see there. Very recently, we added also Windows support as well. Once you've installed this agent on a host, you effectively have a managed container, which allows you to then deploy via the ECS control plane into that environment.

Development Workflow

A common pattern for Apache Airflow is actually to containerize your ETL logic, and then run this, one of the container operators, whether it's the Kubernetes one I talked about, or the ECSOperator. We're going to take this approach in a demo. We're going to typically show you how we can develop some simple ETL logic. We can package that as a container image. We can push that container image to our repository. Then we can deploy it and run it anywhere. When we're using ECS Anywhere, that means anywhere where the ECS Anywhere agent is running. We can then use the Apache Airflow Operator to then allow us to orchestrate the running of those container images to anywhere where the agent is running.

Demo - Hybrid Data Pipeline

Let's take a look at actually what we're going to build in the demo. This is going to be our hypothetical scenario. We have got our cloud environment. We've got a number of MySQL databases, and we want to store extracts of that in our central data lake. We want to bring that information from all of those environments, but we don't want to deploy complex VPNs, we want a really simple solution. We've got limited IT resources at these external locations. We want to create a workflow in Apache Airflow to orchestrate this on a regular basis, ensure that we can centralize the management and operations of those workflows so that we can execute them, schedule them, but also see the output and make sure they're running as they need to be.

These workflows will use the ECSOperator in conjunction with the ECS Anywhere agent to allow us to deploy our container from AWS into that remote environment, and then run the ETL logic, and then push that data into our data lake. This is the actual architecture of what we're going to build in the demo. We can break this down, and we can see that these are MySQL databases. The one at the top is the cloud based one, the one at the bottom are local. Then we've got our data lake there. We're going to create our ETL image, container image. We're going to Dockerize it. We're going to push that to a repository. We're then going to run that and test it locally on the ECS control plane. Then we're going to actually deploy that locally via the ECS agent. Then, finally, we're going to create a workflow that puts all these pieces together and shows us how we could orchestrate and start to use all those moving parts to manage and kick off those ETL processes, both in the cloud as well as in our remote location.

I'm going to switch now to code. I want to take it from there. Let's go into code. This is the repository that contains all the code I'm going to use. I've checked it out already into this repository. We got a couple of databases. One is in RDS, and it contains a sample table, customers. We have a look at those. We can see just some sample which I've generated using a sample data generator. We also have a EC2 instance running a similar database schema but different data. The EC2 instance here is simulating a remote office. It's not connected to anything at the moment. We'll show how we can install the ECS agent, and bring that in. We've got the data. Let's come up here. We've got our ETL logic, our code, which we can see in the app here. If we look at this code here, this is a sample Python code that takes a number of parameters, which we can see at the bottom here. First of all, it's an S3 bucket we want to create and store this, the file name, we want to use the query, we want to parse in, and then the AWS parameters. The reason why we need the AWS parameters is because we need to know which database to connect to. If we go on to the AWS Secrets, where we're using and storing this information, we can come to this sample here, which I'm not using. These values are stored here. We have a database name, a username, a password, and host. These are encrypted. This allows you just to access that information from your code without having to know the details, all I need to know is the name of the secret itself. That allows everything to be nice and secure.

We can try this out. I've got some sample queries here. If we run this, two things should happen. Because what this code does is it runs a query, and then it uploads this to our sample data lake on Amazon S3 bucket. If we run this, and we've got a nice result, which is good. We can see it's returned just the data from the Polish users. If we go to our data lake, which is on S3, we refresh, we can see we've got this folder here, which contains the name of the file. If we download it and have a look at this file. We can see that is our data. We've now got this script working, which is the first part. The next part is to actually containerize this. Within this folder, I've got a setup script, which goes through the process of actually how you typically would do this. I'm using AWS, so I'm going to build and package this up and store this in Amazon ECR, which is the container repository that Amazon provides. We're going to create a manifest and tag it so that we can then run this. To make this easier, I've created a setup script. I'm just going to run this script here. That's going to go through all the processes. That's now finished, it took about 15 minutes on my slow broadband connection.

Let's go over to the console and go to the Amazon ECR repository, and we should be able to see that we've now got a new repository called hybrid-airflow. Within it, we've got a container image. We can try running this. If we try and run it through docker run. We should hopefully get it running. It's doing exactly what we expect, which is basically to return the fact that we haven't provided enough command line arguments, which if we look at the script, it's expecting a number of arguments. Otherwise, it generates this error message. What we can do is we can take the parameters we did here before and append that. Let's see what happens. It's actually failed. This is actually to be expected. The reason why is we've created a container and we're now executing that container. That container is now trying to access AWS services, but there's no credentials in that. What we need to do is we need to parse in the AWS credentials. It's a long command line string. I've actually got this already prepared. What I've done is I've created two environment variables, which contain my access key and secret key, parse them in, and that will actually then be parsed into the container. This time, if it works, hopefully, it should return exactly the same results.

The next step is to take our container and now deploy it in the cloud. For this, we're going to use AWS ECS, Elastic Container Service. We're going to provision an ECS cluster, create a configuration task, which will take our container that we've just created, and then run that. How do we do that? What we're going to do is we're going to create an environment using infrastructure as code. For that, I'm going to use an open source project called CDK, or Cloud Development Kit. I've already got the code ready to go, which I'll share here, and it's in the repository. We have got some parameters we need to define for our cluster, it's just the cluster name. We need the container image, which we just created in the script, so we define that here. Then the S3 bucket, which we're going to use later on. We then have two stacks, the first creates some networking infrastructure, our Virtual Private Cloud. Then the second one actually creates the actual cluster itself, including all the permissions we need, scoped down to the bare minimum so that we only can do what we need to do in order to run this task. To deploy this, what we do is, from the command line, if we go to the CDK folder, we do cdk list, which lists all the available stacks we can deploy. We can then do a cdk deploy. I've already deployed the VPC ones, so let's deploy the actual ECS cluster. This will prompt me to confirm whether I want to do this, and also alerting me to any security stuff that's changing or being added. This is going to take about 5 minutes to do so. I'm going to put this on pause while it deploys. This has now finished. We can see it's generated some output.

If we go to the AWS console, if we want to actually see what actually was deployed, we can go to CloudFormation. We should see up here a number of stacks. Here we can see the VPC stack, and this is the one we've just deployed. We can see all the resources that this stack has deployed. What we can do is we can go actually to the console for Amazon ECS. If we refresh, we should now see our cluster. This is the cluster here. We go, we look at the code, we can see that we gave it a name here and it matches to that. We can see that this cluster has got some resources. In this instance, it's a virtual machine. When we run and execute, one of our containers is going to be running in the cloud. The actual configuration for the application, that is, that script we ran earlier on is defined in a task definition. We can see here it's called Apache Airflow. When we click on here, we can see the configuration of this task. We go down to here, we can see that we've got the container image name that we created. Then we've got the command that we're going to parse into that container.

We can quickly test this out to make sure it's working by running the task, and then defining where we want this to run. In this instance, we're going to run it in the cloud, so we're going to run on the EC2 instance. We click on Run Task. It's now parsed that task to the resources of the cluster. The cluster is going to take that task, run it on the available resources, so that one virtual machine. It's running here, we can see. It's in the state of pending because the first time the image is running, it's going to download it from Amazon ECR. Then it's going to execute it. Subsequent runs will be quick unless you change the image of course. That's now finished. We can click on here, and we can see that the task is finished. We can go over to logs. We can see that we've actually had the output which should match what we had before. What we've done here is we've taken that ETL script which we developed locally, containerized locally, pushed to the cloud, and we're now running on the ECS cluster.

The next step now is to show how we can create a hybrid version of this, running in your remote location, so it could be your data center, a remote office, or even another public cloud. For that, we're going to use ECS Anywhere. I have an EC2 instance, which I can connect to it. I've got the details here. I've got the IP address. I can connect to it because I have the key here. That's running Amazon Linux, so I connect using EC2 user. I'm now connected. On this EC2 instance, even though it's running on a device, I'm mimicking a remote office. On this EC2 instance, I've actually got MySQL running, which I can log into, hopefully, use localdemo, and from customers. We could see it's got a similar set of data to what we showed when I was in Visual Studio code. We've got this machine. If you recall, when we were in the cluster, we could see that this cluster has, under ECS instances, we have one node, and this is the node currently that is being used to run all containers. To create a hybrid version, what we can do is we can use this button here called Register External Instances. When we click on this, it's going to walk us through a couple of screens that will allow us to generate a configuration file, which we can then deploy on this EC2 instance. It will install the ECS Anywhere agent, configure all the permissions and roles, and then integrate that into this cluster. Thus, allowing us to effectively extend the resource capabilities of this ECS cluster, and allow us to run that container in that EC2 instance or that remote machine.

When I created via CDK, the ECS cluster, we created a number of roles. I'm going to use one of those roles here, which is this one here. This is the one that is going to be used on the remote machine to give it just enough permissions to access the ECR repository so it can download containers, as well as access CloudWatch, so we can write logs. We specify this one here. We only want one instance. The activation key, you can leave the defaults limit there right now. We can see here that because I'm running as a Linux machine, I use this command. Very recently, we added Windows. If you were running a Windows machine, you could do this. I click on Copy here. I go to my Linux machine here. I'm going to run this as root. I'm going to put this on pause because it takes a while. During this process, it's downloading the agent, installing, and configuring the agent, registering it into AWS Systems Manager, and then the ECS control plane. I'll come back once it's finished. It took about 5 minutes to run. We can see that it's been successful, it's giving us a message saying that it's been successfully integrated.

If we now go back to our ECS cluster, and we refresh, we should now see, we've got two resources. We've got this new one here, which has got a different identifier. The MI stands for Managed Instance. We can see that external instance equals to true. What this means is we can now run our task on this external instance. Again, what we can do is take the task we had, and we can run it. This time, we're going to select the external launch type. The external launch type will match that instance that we've just added via ECS Anywhere. We can click on run. It's going to start running. Again, because it's the first time that the containers run on this external instance, it's going to take a few seconds to download that image before it runs. One of the advantages of it being an EC2 instance, is it's got a good speed. When I was doing this on my machine here, it took a while. My internet speed here is not great. Give it a few seconds, it's run. We can see that it's run here. We can check the logs and we can see the scripts run. We've now demonstrated that we've actually run our ETL script, the container code on the remote machine. If I was in a remote office somewhere, it would have been running there rather than in AWS. So far, what it's done is it's accessed the same database. If we look here, within the code, I've instrumented it, so we can see that it's connecting to the databases in RDS.

The next step really, is to change this, so that we run and access that local database we've got running. To do that, we're going to create a new Task Definition. I click on my Task Definition. I click on new revision. I then can configure the database I'm going to connect to by changing this value here. If you recall, from the beginning, I defined the database environments that I want to connect to via some parameters I store in secrets manager. If I go back to here, this is the current one RDS. If I select this one here, this points to and contains the information to connect to the local database. I can go back to ECS cluster, and select this one here, and click on Update. I click on Create. I now have a new revision of my task. If I click on Run, and select external, and I click on Run. This time it should not take very long to run because the container image has already run. I click on stopped.

Has it run? Which one is it? It's this one here. We can check the logs, and we can see that it's connected to the local database, and it's brought back this data. Now we've actually got the ECS to execute this running container running in that remote instance, running against a local database. Then if we look ultimately at what we're trying to do here, on our S3 bucket, go back to here, we can see that we've got some data. If we look back at actually what we were looking to configure here in the task? I can't remember which bucket we were trying to put it into. We can quickly find that out. It was basically period1/hq-data. If we look here, period1, we can see the timestamp match as well. This is the data that we've just got from that local machine. We've now demonstrated how we can containerize that ETL script, run it both in the cloud and remote. It could have been in a remote branch office. It could have been in another cloud or our data center.

The next stage now is to see how we can incorporate that into our workflow within Apache Airflow. To configure and provision an Apache Airflow environment, I am using an AWS service called Managed Workflows for Apache Airflow. To create the environment, again, I've used more infrastructure as service code. Again, this code is on the repository, which uses another file to, via parameters, configure the key values. In this instance, it's the name and the location of the S3 bucket, where we will store our DAG from our workflow files. There are two stacks, one configures the VPC network, the second one creates the actual environment itself. The provisioning and creating of this environment takes about 30 minutes. I've already pre-provisioned this environment. Again, I can look at what gets provisioned by looking at CloudFormation, so I can see all the different resources that get created. I can go to the actual Managed Workflows for Apache Airflow console. I can see my environment here. Then I can actually access the console this way. By default, in this code, I deploy to a very simple workflow. I have my environment ready to go. I now need to work on the actual workflows themselves.

I have actually these as well within the code, within the repository, within the DAG folder. I've got three that are of interest, one is called ecs-hybrid-ec2. One is called hybrid-external. Then the other one's called hybrid-local. What these do is they take the things that I've just shown you how we've run them through the ECS console, but they wrap them around an Apache Airflow Operator called ECSOperator. Then get the Apache Airflow Worker to execute these on our behalf. Let's walk through this. We have our standard Apache Airflow import. We're going to be using the ECSOperator, so we import that from the Amazon provider. We then create some default args. We define the name of our workflow. We're not going to put this on a schedule. We're just going to run it on demand. Then we actually have the parameters of our actual operator itself. We need to give it a task_id. In this instance, I've called it cloudquery. We need to define the ECS cluster that we want to run the task on. This is the name of the ECS cluster. It is called qcon-hybrid-airflow.

We then need to give it the name of a task definition. The task definition contains the actual task, this is this screen here, of what we want to run. As you can see here, it's called Apache Airflow. You can define just the name, and it will always use the latest. You'll notice that it has a number. Every time you create a new version, the number increments. If you have no number here, it will always use the latest. If you want to be specific, you can actually put a colon and a number to tie it and pin it to a specific version. We can then parse in overrides to the container. We'll see an example of that. Then here we define the launch type. The first time we run the ETL scripts we run on the EC2 launch type, then when we run it via the hybrid ECS Anywhere mode, we selected external. Here we're going to be running it on EC2, just so we can see it running. We then define the log group that this ECS cluster has been configured, as well as the prefix for this particular task, so we can see the output of this task when it runs.

If I look at the external one, this actually looks exactly the same, the only difference being is the launch type is external. Both this one and the EC2 one are running the task as-is, and if we recall, that is running it against the RDS instance. What if we wanted to, for example, parse in new queries or connect to different databases? Here, we can actually use the override. Here we specify the same information as for the external one, and we override the actual command we want to parse in. Here, for example, I'm specifying the fact that I want to use this folder, this file I want to upload. I select a query. Here, I specify a different MySQL instance. Just to make this interesting, what I can do is I can change that to a different name. We can deploy these into our Apache Airflow cluster. Typically, you would do this through a CI/CD pipeline. I'm going to just upload them into the folder here. I'm going to just upload these projects, QCon, so local, external. Load these here. This is uploaded into the Apache Airflow dags folder. We can see them here. It will take a few seconds before they actually appear in the Apache Airflow console. We'll give it a few seconds. That took a couple of minutes. We can now see that the DAGs, the workflows are in the Apache Airflow UI.

If we open up this one, and check the code, we can see that it's the same code that we saw in Visual Studio. Let's enable this and trigger it. What we should see is after a short while, the Apache Airflow has handled this over to the scheduler. Scheduler is running it on a Apache Worker node. The task is being executed on the worker node. As we can see here, the task has completed. If we look at the logs, we should see the SQL return, as we can see here. As we can see from here, this was accessing the database in the cloud on RDS.

Let's now try and do the same thing with the local. Here, we can actually look at the code. We can see here that we're parsing on new parameters. In this instance, we're specifying the local database. Let's trigger this one. We should see exactly the same thing. We should see it again parsing over to the worker node. The worker node is going to parse it on to ECS. Because it's set to external, it's going to run on the ECS Anywhere. This is our fake external remote branch. We can see it's run. We can take a look at the logs. We can see that this time, it's only brought about one record. We can see actually from the source IP, that it is actually our remote instance, and it's accessing the database locally here. I've shown these as two separate workflows, but in reality, when you create your own hybrid workflows, you would use these in conjunction with other activities, other tasks in order to create your data pipeline.

Security and Permissions

One thing I didn't cover in any depth during the demo was the security and permissions and how these are configured. There are a number of important policies and roles that we need to create in order to make sure that we only provide access to the bare minimum services when setting this up. If we start from the bottom of the diagram and the local environment, we have the ECS Anywhere agent. This needs a role and set of permissions in order for it to access and interact with AWS to do things such as access the ECR repository, download containers, send logging information to CloudWatch, that kind of thing. This is called the ECS Anywhere task execution role. We will have a very small scoped and narrow set of permissions for that task to run. Within the ECS cluster environment, we have a similar set of permissions, which is the task execution role. These are the permissions that the ECS cluster has for doing similar access, so downloading container images, CloudWatch logs, and running tasks. Within Apache Airflow, the worker nodes have a role, which is called the MWAA Execution Role. That contains a set of policies that allowed it to interact with other AWS services. For example, allow us to run and execute ECS tasks.

Then, finally, within the actual containerized application itself, we have then the task permissions. Within the actual code that we're running, what permissions do we need that code to have in order for it to work? In this instance, we need to make sure that we've got access to RDS, that we've got access to the S3, so we can copy the files. That is the same policy that's used both in ECS cluster when it's running in the cloud, but also when it's running locally in the ECS Anywhere tasks. It's still using the same policy, but the difference is that one when it's running on ECS Anywhere, it's running and using the policy locally, in your local or remote environment. Then, when it's running in the ECS cluster, it's running in the cloud.

Separation of Duties

To further illustrate this approach is how different sets of users within your environment need different sets of access to make a solution like this work. Typically, we've got two personas here. At the top, we've got the data engineer. What they care about is actually creating the code and orchestrating the data pipelines. They don't really want to know much about the plumbing underneath that. Then we've got the system administrator, maybe your DevOps team, who build the infrastructure via infrastructure as code, who manage secrets, who provision everything so that the data engineers can do the work. Here we can see this solution in the context of the activities which each of those personas would do. The sysadmin would typically deploy the ECS Anywhere locally, and get that up and running. They would configure and make sure that your ECS cluster is up and running. They will configure and define the secrets. They will set up your RDS database. They will set up the permissions and policies, as well as creating and deploying the Apache Airflow environment.

Then the data engineers at the top, they would actually create the ETL scripts and containerize those using typically your standard software development lifecycle approach. They would also create the DAGs, the workflows within Apache Airflow and deploy those. Typically, most organizations will have a CI/CD system for both of those approaches, both for creating the DAGs, but also for the ETL workflow. The output of that being a workflow DAG file, which gets deployed on to the Apache Airflow, or a container image that gets pushed into your container repository. Here, in this example, I'm showing that it's S3 for Apache Airflow, and Amazon ECR as the image repository. We can see here how this enables a nice, clean separation of duties for both those activities, so that as a data engineer, I can focus on writing just the scripts that I have, committing my code. It will automatically deploy those into the right repositories. I can then use the ECR operator to actually execute those. I don't have to provision any of that stuff. I don't need to know things such as the secrets for the resources I'm accessing.

Benefits of Hybrid Orchestration

One of the things I like about using containers, and specifically the ECSOperator approach in conjunction with ECS Anywhere, is there are a couple of strengths. First of all, implementing the solution is super simple. From an infrastructure perspective, there's no complex firewalls to manage. It's a simple local install of the ECS agent, which integrates that system into the ECS control plane, and also, the AWS System Manager as well. From a development perspective, you can reuse your existing development patterns to create and manage your container images. From a management perspective, you can manage everything from a single pane of glass, that is, the ECS control plane. It's easy to separate the duties as I've just shown. You've got a clear separation from what the data engineers are doing versus your system admin, and potentially, the people who manage your secrets. Your data engineers have standard operators they work with, they don't have to learn or create new operators. The ECSOperator is a standard operator that is part of the Amazon provider package.

Demo Resources

The code that I used in demo can be found at this link, https://github.com/094459/blogpost-airflow-hybrid. There's also a link to a blog post, which will actually walk you through building the whole thing, if you are interested in doing that.

Questions and Answers

Polak: What's the conversation like around hybrid?

Sueiras: I think the conversation really is around customers are moving to the cloud, but they've still got a significant amount of assets, for data applications in their current environments. These are important applications. You can't just leave them. Quite often, they're still running important business processes. Other customers who are in industrial segments, they've got remote offices with very little local IT skills. What's really needed, or at least why they're telling me is that this is a nice solution is that they don't have the expertise to run a remote Kubernetes cluster. They want a really simple solution that literally, they can just run, install, and then leave it. I think that's the beauty of these kinds of solutions.

Polak: I remember when the conversation about hybrid just started, it was always confusing between, do you mean hybrid, like having some of the workloads on-prem, and they're on the cloud, or hybrid like along the different clouds?

Sueiras: It could be absolutely both. I think it depends on really what the customer is trying to do. Sometimes, I have had customers who are moving from one cloud to another, and so they've got data assets in one cloud, and they want to access them and incorporate them in their data pipelines. I think the key thing, though, is we've always been able to do this with Apache Airflow, as long as you've got network connectivity, but network connectivity comes at some complexity. I think solutions like these make it super simple, because effectively, it's just an outbound connection from the agent. There's no VPNs required. It's all encrypted, so it's secure. I think that's the key. I think the reason why I think we'll still see that hybrid is that, especially in Europe, there are lots of regulatory and compliance use cases where you can't do the processing using your Apache Airflow operators in the cloud. You've got to do them locally. You've got to do that locally. That's where solutions like these can help you out.

Polak: How does Apache Airflow compare to AWS Glue?

Sueiras: A better comparison is Apache Airflow and Step Functions. They do a very similar thing, they're orchestrators. I'm a little bit familiar with AWS Glue, but that is more used for doing more of the ETL stuff. The stuff that I was doing in the container is typically what you would do with AWS Glue. You can actually, obviously orchestrate AWS Glue with Apache Airflow, which is a very common pattern.

See more presentations with transcripts

Recorded at:

Mar 24, 2023

Ricardo Sueiras

InfoQ Software Architects' Newsletter