InfoQ Homepage Articles The Future of Data Engineering

AI, ML & Data Engineering

The Future of Data Engineering

Feb 16, 2021 31 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

The job of data engineers is to help an organization move and process data but not to do that themselves.
Data architecture follows a predictable evolutionary path from monolith to automated, decentralized, self-serve data microwarehouses.
Integration with connectors is a key step in this evolution but increases workload and requires automation to correct for that.
Increasing regulatory scrutiny and privacy requirements will drive the need for automated data management.
A lack of polished, non-technical tooling is currently preventing data ecosystems from achieving full decentralization.

"The future of data engineering" is a fancy title for presenting stages of data pipeline maturity and building out a sample architecture as I progress, until I land on a modern data architecture and data pipeline. I will also hint at where things are headed for the next couple of years.

It's important to know me and my perspective when I'm predicting the future, so that you can couch mine with your own perspectives and act accordingly.

I work at WePay, which is a payment-processing company. JPMorgan Chase acquired the company a few years ago. I work on data infrastructure and data engineering. Our stack is Airflow, Kafka, and BigQuery, for the most part. Airflow is, of course, a job scheduler that kicks off jobs and does workflow things. BigQuery is a data warehouse hosted by Google Cloud. I make some references to Google Cloud services here, and you can definitely swap them with the corresponding AWS or Azure services.

We, at WePay, use Kafka a lot. I spent about seven years at LinkedIn, the birthplace of Kafka, which is a pub/sub, write-ahead log. Kafka has become the backbone of a log-based architecture. At LinkedIn, I spent a bunch of time doing everything from data science to service infrastructure, and so on. I also wrote Apache Samza, which is a stream-processing system, and helped build out their Hadoop ecosystem. Before that, I spent time as a data scientist at PayPal.

There are many definitions for "data engineering". I've seen people use it when talking about business analytics and in the context of data science. I'm going to throw down my definition: a data engineer's job is to help an organization move and process data. To move data means streaming pipelines or data pipelines; to process data means data warehouses and stream processing. Usually, we're focused on asynchronous, batch or streaming stuff as opposed to synchronous real-time things.

I want to call out the key word here: "help". Data engineers are not supposed to be moving and processing the data themselves but are supposed to be helping the organization do that.

Maxime Beauchemin is a prolific engineer who started out, I think, at Yahoo and passed through Facebook, Airbnb, and Lyft. Over the course of his adventures, he wrote Airflow, which is the job scheduler that we and a bunch of other companies use. He also wrote Superset. In his "The Rise of the Data Engineer" blog post a few years ago, Beauchemin said that "... data engineers build tools, infrastructure, frameworks, and services." This is how we go about helping the organization to move and process the data.

The reason that I put this presentation together was a 2019 blog post from a company called Ada in which they talk about their journey to set up a data warehouse. They had a MongoDB database and were starting to run up against its limits when it came to reporting and some ad hoc query things. Eventually, they landed on Apache Airflow and Redshift, which is AWS's data-warehousing solution.

What struck me about the Ada post was how much it looked like a post that I’d written about three years earlier. When I landed at WePay, they didn't have much of a data warehouse and so we went through almost the exact same exercise that Ada did. We eventually landed on Airflow and BigQuery, which is Google Cloud's version of Redshift. The Ada post and mine are almost identical, from the diagrams to even the structure and sections of the post.

This was something we had done a few years earlier and so I threw down the gauntlet on Twitter and predicted Ada’s future. I claimed to know how they would progress as they continued to build out their data warehouse: one step would be to go from batch to a real-time pipeline, and the next step would be to a fully self-serve or automated pipeline.

I'm not trying to pick on Ada. I think it's a perfectly reasonable solution. I just think that there's a natural evolution of a data pipeline and a data warehouse and the modern data ecosystem and that's really what I want to cover here.

Figure 1: Getting cute with land/expand/on demand for pipeline evolution

I refined this idea with the tweet in Figure 1. The idea was that we initially land with nothing, so we need to set up a data warehouse quickly. Then we expand as we do more integrations, and maybe we go to real-time because we've got Kafka in our ecosystem. Finally, we move to automation for on-demand stuff. That eventually led to my "The Future of Data Engineering" post in which I discussed four future trends.

The first trend is timeliness, going from this batch-based periodic architecture to a more real-time architecture. The second is connectivity; once we go down the timeliness route, we start doing more integration with other systems. The last two tie together: automation and decentralization. On the automation front, I think we need to start thinking about how we operate not just our operations but our data management. And then decentralizing the data warehouse.

Figure 2: The six stages of data-pipeline maturity

I designed a hierarchy of data-pipeline progression. Organizations go through this evolution in sequence.

The reason I created this path is that everyone's future is different because everyone is at a different point in their life cycle. The future at Ada looks very different than the future at WePay because WePay may be farther along on some dimensions - and then there are companies that are even farther along than WePay.

These stages let you find your current starting point and build your own roadmap from there.

Stage 0: None

You're probably at this stage if you have no data warehouse. You probably have a monolithic architecture. You're maybe a smaller company and you need a warehouse up and running now. You probably don't have too many data engineers and so you're doing this on the side.

Figure 3: Stage 0 of data-pipeline maturity

Stage 0 looks like Figure 3, with a lovely monolith and a database. You take a user and you attach it to the database. This sounds crazy to people that have been in the data-warehouse world for a while but it's a viable solution when you need to get things quickly up and running. The data appears to the user as basically real-time data because you're reading directly from the database. It's easy and cheap.

This is where WePay was when I landed there in 2014. We had a PHP monolith and a monolithic MySQL database. The users we had, though, weren't happy and things were starting to tip over. We had queries timing out. We had users impacting each other - most OLTP systems that you're going to be using do not have strong isolation and multi-tenancy so users can really get in each other's way. Because we were using MySQL, we were missing some of the fancier analytic SQL stuff that our data-science and business-analytics people wanted and report generation was starting to break. It was a pretty normal story.

Stage 1: Batch

We started down the batch path, and this is where the Ada post and my earlier post come in.

Going into Stage 1, you probably have a monolithic architecture. You might be starting to lean away from that but usually it works best when you have relatively few sources. Data engineering is now probably your part-time job. Queries are timing out because you're exceeding the database capacity, whether in space, memory, or CPU.

The lack of complex analytical SQL functions is becoming an issue for your organization as people need those for customer-facing or internal reports. People are asking for charts, business intelligence, and all that kind of fun stuff.

Figure 4: Stage 1 is the classic batch-based approach to data

This is where the classic batch-based approach comes in. Between the database and the user, you stuff a data warehouse that can accomplish a lot more OLAP and fulfill analytic needs. To get data from the database into that data warehouse, you have a scheduler that periodically wakes up to suck in the data.

That's where WePay was at about a year after I joined. This architecture is fantastic in terms of tradeoffs. You can get the pipeline up pretty quickly these days - when I did it in 2016, it took a couple of weeks. Our data latency was about 15 minutes, so we did incremental partition loads, taking little chunks of data and loading them in. We were running a few hundred tables. This is a nice place to start if you're trying to get something up and running but, of course, you outgrow it.

The number of Airflow workflows that we had went from a few hundred to a few thousand. We started running tens or hundreds of thousands of tasks per day, and that became an operational issue because of the probability that some of those are not going to work. We also discovered - and this is not intuitive for people who haven't run complex data pipelines - that the incremental or batch-based approach requires imposing dependencies or requirements on the schemas of the data that you're loading. We had issues with create_time and modify_time and ORMs doing things in different ways and it got a little complicated for us.

DBAs were impacting our workload; they could do something that hurt the replica that we're reading off of and cause latency issues, which in turn could cause us to miss data. Hard deletes weren't propagating - and this is a big problem if you have people who delete data from your database. Removing a row or a table or whatever can cause problems with batch loads because you just don't know when the data disappears. Also, MySQL replication latency was affecting our data quality and periodic loads would cause occasional MySQL timeouts on our workflow.

Stage 2: Realtime

This is where real-time data processing kicks off. This approaches the cusp of the modern era of real-time data architecture and it deserves a closer look than the first two stages.

You might be ready for Stage 2 if your load times are taking too long. You've got pipelines that are no longer stable, whether because workflows are failing or your RDBMS is having trouble serving the data. You've got complicated workflows and data latency is becoming a bigger problem: maybe the 15-minute jobs you started with in 2014 are now taking an hour or a day, and the people using them aren't happy about it. Data engineering is probably your full-time job now.

Your ecosystem might have something like Apache Kafka floating around. Maybe the operations folks have spun it up to do log aggregation and run some operational metrics over it; maybe some web services are communicating via Kafka to do some queuing or asynchronous processing.

Figure 5: Stage 2 of data-pipeline maturity

From a data-pipeline perspective, this is the time to get rid of that batch processor for ETL purposes and replace it with a streaming platform. That's what WePay did. We changed our ETL pipeline from Airflow to Debezium and a few other systems, so it started to look like Figure 6.

Figure 6: WePay’s data architecture in 2017

The hatched Airflow box now contains five boxes, and we're talking about many machines so the operational complexity has gone up. In exchange, we get a real-time pipeline.

Kafka is a write-ahead log that we can send messages to (they get appended to the end of the log) and we can have consumers reading from various locations in that log. It's a sequential read and sequential write kind of thing.

We use it with the upstream connectors. Kafka has a component called Kafka Connect. We heavily use Debezium, a change-data-capture (CDC) connector that reads data from MySQL in real time and funnels it in real time into Kafka.

CDC is essentially a way to replicate data from one data source to others. Wikipedia’s fancy definition of CDC is "… the identification, capture, and delivery of the changes made to the enterprise data sources." A concrete example is what something like Debezium will do with a MySQL database. When I insert a row, update that row, and later delete that row, the CDC feed will give me three different events: an insert, the update, and the delete. In some cases, it will also provide the before and the after states of that row. As you can imagine, this can be useful if you're building out a data warehouse.

Figure 7: Debezium sources

Debezium can use a bunch of sources. We use MySQL, as I mentioned. One of the things in that Ada post that caught my eye was the fact that they were using MongoDB - sure enough, Debezium has a MongoDB connector. We contributed a Cassandra connector to Debezium a couple of months ago. It's incubating and we're still getting up off the ground with it ourselves but that's something that we're going to be using heavily in the near future.

Last but not least in our architecture, we have KCBQ, which stands for Kafka Connect BigQuery (I do not name things creatively). This connector takes data from Kafka and loads it into BigQuery. The cool thing about this, though, is that it leverages BigQuery’s real-time streaming insert API. One of the cool things about BigQuery is that you can use its RESTful API to post data into the data warehouse in real time and it's visible almost immediately. That gives us a latency from our production database to our data warehouse of a couple of seconds.

This pattern opens up a lot of use cases. It lets you do real-time metrics and business intelligence off of your data warehouse. It also allows you to debug, which is not immediately obvious - if your engineers need to see the state of their database in production right now, being able to go to the data warehouse to expose that to them so that they can figure out what's going on with their system with essentially a real-time view is pretty handy.

You can also do some fancy monitoring with it. You can impose assertions about what the shape of the data in the database should look like so that you can be satisfied that the data warehouse and the underlying web service itself are healthy.

Figure 8: Problems at WePay in the migration to Stage 2

Figure 8 shows some of the inevitable problems we encountered in this migration. Not all of our connectors were on this pipeline, so we found ourselves between the new cool stuff and the older painful stuff.

Datastore is a Google Cloud system that we were using; that was still Airflow-based. Cassandra didn't have a connector and neither did Bigtable, which is a Google Cloud equivalent of HBase. We had BigQuery but BigQuery needed more than just our primary OLTP data; it needed logging and metrics. We had Elasticsearch and this fancy graph database (which we're going to be open-sourcing soon) that also needed data.

The ecosystem was looking more complicated. We're no longer talking about this little monolithic database but about something like Figure 9, which comes from Confluent and is pretty accurate.

Figure 9: The data ecosystem is no longer a monolith

You have to figure out how to manage some of this operational pain. One of the first things you can do is to start integration so that you have fewer systems to deal with. We used Kafka for that.

Stage 3: Integration

If you think back 20 years to enterprise-service-bus architectures, that's really all data integration is. The only difference is that streaming platforms like Kafka along with the evolution in stream processing have made this viable.

You might be ready for data integration if you've got a lot of microservices. You have a diverse set of databases as Figure 8 depicts. You've got some specialized, derived data systems; I mentioned a graph database but you may have special caches or a real-time OLAP system. You've got a team of data engineers now, people who are responsible for managing this complex workload. Hopefully, you have a happy, mature SRE organization that's more than willing to take on all these connectors for you.

Figure 10: Stage 3 of data-pipeline maturity

Figure 10 shows what data integration looks like. We still have the base data pipeline that we've had so far. We've got a service with a database, we've got our streaming platform, and we've got our data warehouse, but now we also have web services, maybe a NoSQL thing, or a NewSQL thing. We've got a graph database and search system plugged in.

Figure 11: WePay’s data ecosystem at the start of 2019

Figure 11 depicts where WePay was at the beginning of 2019. Things were becoming more complicated. Debezium connects not only to MySQL but to Cassandra as well, with the connector that we'd been working on. At the bottom is Kafka Connect Waltz (KCW). Waltz is a ledger that we built in house that's Kafka-ish in some ways and more like a database in other ways, but it services our ledger use cases and needs. We are a payment-processing system so we care a lot about data transactionality and multi-region availability and so we use a quorum-based write-ahead log to handle serializable transactions. On the downstream side, we've got a bunch of stuff going on.

We were incurring a lot of pain and have many boxes on our diagram. This is getting more and more complicated. The reason we took on this complexity has to do with Metcalfe's law. I'm going to paraphrase the definition and probably corrupt it: it essentially states that the value of a network increases as you add nodes and connections to it. Metcalfe's law was initially intended to apply to communication devices, like adding more peripherals to an Ethernet network.

So, we're getting to a network effect in our data ecosystem. In a post in early 2019, I thought through the implications of Kafka as an escape hatch. You add more systems to the Kafka bus, all of which are able to load their data in and expose it to other systems and slurp up the data of in Kafka, and you leverage this network effect in your data ecosystem.

We found this to be a powerful architecture because the data becomes portable. I'm not saying it'll let you avoid vendor lock-in but it will at least ameliorate some of those concerns. Porting data is usually the harder part to deal with when you're moving between systems. The idea is that it becomes theoretically possible, if you're on Splunk for example, to plug in Elasticsearch alongside it to test it out - and the cost to do so is certainly lower.

Data portability also helps with multi-cloud strategy. If you need to run multiple clouds because you need high availability or you want to pick cloud vendors to save money, you can use Kafka and the Kafka bus to move the data around.

Lastly, I think it leads to infrastructure agility. I alluded to this with my Elasticsearch example but if you come across some new hot real-time OLAP system that you want to check out or some new cache that you want to plug in, having your data already in your streaming platform in Kafka means that all you need to do is turn on the new thing and plug in a sink to load the data. It drastically lowers the cost of testing new things and supporting specialized infrastructure. You can easily plug in things that do one or two things really well, when before you might have had to decide between tradeoffs like supporting a specialized graph database or using an RDBMS which happens to have joins. By reducing the cost of specialization, you can build a more granular infrastructure to handle your queries.

The problems in Stage 3 look a little different. When WePay bought into this integration architecture, we found ourselves still spending a lot of time on fairly manual tasks like those in Figure 12.

Figure 12: WePay’s problems in Stage 3

In short, we were spending a lot of time administering the systems around the streaming platform - the connectors, the upstream databases, the downstream data warehouses - and our ticket load looked like Figure 13.

Figure 13: Ticket load at WePay in Stage 3

Fans of JIRA might recognize Figure 13. It is a screenshot of our support load in JIRA in 2019. It starts relatively low then it skyrockets and it never fully recovered, although there's a nice trend late in the year that relates to the next step of our evolution.

Stage 4: Automation

We started investing in automation. This is something you've got to do when your system gets this big. I think most people would say we should have been automating all along.

You might be ready for Stage 4 if your SREs can't keep up, you're spending a lot of time on manual toil, and you don't have time for the fun stuff.

Figure 14: Stage 4 adds two new layers to the data ecosystem

Figure 14 shows the two new layers that appear in Stage 4. The first is the automation of operations, and this won’t surprise most people. It's the DevOps stuff that has been going on for a long time. The second layer, data-management automation, is not quite as obvious.

Let’s first cover automation for operations. Google’s Site Reliability Engineering handbook defines toil as manual, repeatable, automatable stuff. It's usually interrupt-driven: you're getting Slack messages or tickets or people are showing up at your desk asking you to do things. That is not what you want to be doing. The Google book says, "If a human operator needs to touch your system during normal operations, you have a bug."

But the "normal operations" of data engineering were what we were spending our time on. Anytime you're managing a pipeline, you're going to be adding new topics, adding new data sets, setting up views, and granting access. This stuff needs to get automated. Great news! There's a bunch of solutions for this: Terraform, Ansible, and so on. We at WePay use Terraform and Ansible but you can substitute any similar product.

Figure 15: Some systemd_log thing in Terraform that logs some stuff when you're using compaction (which is an exciting policy to use with your systemd_logs)

Figure 16: Managing your Kafka Connect connectors in Terraform

You can use it to manage your topics. Figures 15 and 16 show some Terraform automations. Not terribly surprising.

Yes, we should have been doing this, but we kind of were doing this already. We had Terraform, we had Ansible for a long time - we had a bunch of operational tooling. We were fancy and on the cloud. We had a bunch of scripts to manage BigQuery and automate a lot of our toil like creating views in BigQuery, creating data sets, and so on. So why did we have such a high ticket load?

The answer is that we were spending a lot of time on data management. We were answering questions like "Who's going to get access to this data once I load it?", "Security, is it okay to persist this data indefinitely or do we need to have a three-year truncation policy?", and "Is this data even allowed in the system?" As a payment processor, WePay deals with sensitive information and our people need to follow geography and security policies and other stuff like that.

We have a fairly robust compliance arm that's part of JPMorgan Chase. Because we deal with credit cards, we have PCI audits and we deal with credit-card data. Regulation is here and we really need to think about this. Europe has GDPR. California has CCPA. PCI applies to credit-card data. HIPAA for health. SOX applies if you're a public company. New York has SHIELD. This is going to become more and more of a theme, so get used to it. We have to get better at automating this stuff or else our lives as data engineers are going to be spent chasing people to make sure this stuff is compliant.

I want to discuss what that might look like. As I get into the futuristic stuff, I get more vague or hand-wavy, but I'm trying to keep it as concrete as I can.
First thing you want to do for automated data management is probably to set up a data catalog. You probably want it centralized, i.e., you want to have one with all the metadata. The data catalog will have the locations of your data, what schemas that data has, who owns the data, and lineage, which is essentially the source and path of the data.

The lineage for my initial example is that it came from MySQL, it went to Kafka, and then it got loaded into BigQuery - that whole pipeline. Lineage can even track encryption or versioning, so you know what things are encrypted and what things are versioned as the schemas evolved.

There's a bunch of activity in this lineage area. Amundsen is a data catalog from Lyft. You have Apache Atlas. LinkedIn open-sourced DataHub as a patch in 2020. WeWork has a system called Marquez. Google has a product called Data Catalog. I know I'm missing more.

These things generally do a lot, more than one thing, but I want to show a concrete example. I yanked Figure 17 from the Amundsen blog. It has fake data, the schema, the field types, the data types, everything. At the right, it has who owns the data - and notice that Add button there.

Figure 17: An example of Amundsen

It tells us what source code generated the data — in this case, it's Airflow, as indicated by that little pinwheel — and some lineage. It even has a little preview. It's a pretty nice UI.

Underneath it, of course, is a repository that actually houses all this information. That's really useful because you need to get all your systems to be talking to this data catalog.

That Add button in the Owned By section is important. You don't as a data engineer want to be entering that data yourself. You do not want to return to the land of manual data stewards and data management. Instead, you want to be hooking up all these systems to your data catalog so that they're automatically reporting stuff about the schema, about the evolution of the schema, about the ownership when the data is loaded from one to the next.

Figure 18: Your data ecosystem needs to talk to your data catalog

First off, you need your systems like Airflow and BigQuery, your data warehouses and stuff, to talk to the data catalog. I think there's quite a bit of movement there.
You then need your data-pipeline streaming platforms to talk to the data catalog. I haven't seen as much yet for that. There may be stuff coming out that will integrate better, but right now I think that's something you’ve got to do on your own.

I don't think we've done a really good job of bridging the gap on the service side. You want your service stuff in the data catalog as well: things like gRPC protobufs, JSON schemas, and even the DBs of those databases.

Once you know where all your data is, the next step is to configure access to it. If you haven't automated this, you're probably going to Security, Compliance, or whoever the policymaker is and asking if this individual can see this data whenever they make access requests - and that's not where you want to be. You want to be able to automate the access-request management so that you can be as hands off with it as possible.

This is kind of an alphabet soup with role-based access control (RBAC), identity access management (IAM), and an access-control list (ACL). Access control is just a bunch of fancy words for a bunch of different features for managing groups, user access, and so on. You need three things to do this: you need your systems to support it, you need to provide tooling to policymakers so they can configure the policies appropriately, and you need to automate the management of the policies once the policymakers have defined them.

There has been a fair amount of work done to support this aspect. Airflow has RBAC, which was a patch WePay submitted. Airflow has taken this seriously and has added a lot more, like DAG-level access control. Kafka has had ACLs for quite a while.

Figure 19: Managing Kafka ACLs with Terraform

You can use tools to automate this stuff. We want to automate adding a new user to the system and configuring their access. We want to automate the configuration of access controls when a new piece of data is added to the system. We want to automate service-account access as new web services come online.

There's occasionally a need to grant someone temporary access to something. You don't want to have to set a calendar reminder to revoke the access for this user in three weeks. You want that to be automated. The same goes for unused access. You want to know when users aren't using all the permissions that they're granted so that you can strip those unused permissions to limit the vulnerability of the space.

Now that your data catalog tells you where all the data is and you have policies set up, you need to detect violations. I mostly want to discuss data loss prevention (DLP) but there's also auditing, which is keeping track of logs and making sure that the activities and systems are conforming to the required policies.

I'm going to talk about Google Cloud Platform because I use it and I have some experience with its data-loss solution. There's a corresponding AWS product called Macie. There's also an open-source project called Apache Ranger, with a bit of an enforcement and monitoring mechanism built into it; that's more focused on the Hadoop ecosystem. What all these things have in common is that you can use them to detect the presence of sensitive data where it shouldn't be.

Figure 20: Detecting sensitive data

Figure 20 is an example. A piece of submitted text contains a phone number, and the system sends a result that says it is "very likely" that it has detected an infoType of phone number. You can use this stuff to monitor your policies. For example, you can run DLP checks on a data set that is supposed to be clean - i.e., not have any sensitive information in it - and if a check finds anything like a phone number, Social Security number, credit card, or other sensitive information, it can immediately alert you that there's a violation in place.

There’s a little bit of progress here. Users can use the data catalog and find the data that they need, we have some automation in place, and maybe we're using Terraform to manage ACLs for Kafka or to manage RBAC in Airflow. But there's still a problem and that is that data engineering is probably still responsible for managing that configuration and those deployments. The reason for that is mostly the interface. We're still getting pull requests, Terraform, DSL, YAML, JSON, Kubernetes ... it's nitty-gritty.

It might be a tall order to ask security teams to make changes to that. Asking your compliance wing to make changes is an even taller order. Going beyond your compliance people is basically impossible.

Stage 5: Decentralization

You're probably ready to decentralize your data pipeline and your data warehouses if you have a fully automated real-time data pipeline but people are still coming to ask you to load data.

Figure 21: Decentralization is Stage 5 of the data ecosystem

If you have an automated data pipeline and data warehouse, I don't think you need a single team to manage all this stuff. I think the place where this will first happen, and we're already seeing this in some ways, is in a decentralization of the data warehouse. I think we're moving towards a world where people are going to be empowered to spin up multiple data warehouses and administer and manage their own.

I frame this line of thought based on our migration from monolith to microservices over the past decade or two. Part of the motivation for that was to break up large, complex things, to increase agility, to increase efficiency, and to let people move at their own pace. A lot of those characteristics sound like your data warehouse: it's monolithic, it's not that agile, you have to ask your data engineering team to do things, and maybe you're not able to do things at your own pace. I think we're going to want to do the same thing - go from a monolith to microwarehouses - and we're going to want a more decentralized approach.

I'm not alone in this thought. Zhamak Dehghani wrote a great blog post that is such a great description of what I’m thinking. She discusses the shift from this monolithic view to a more fragmented or decentralized view. She even discusses policy automation and a lot of the same stuff that I'm thinking about.

I think this shift towards decentralization will take place in two phases. Say you have a set of raw tools - Git, YAML, JSON, etc. - and a beaten-down engineering team that is getting requests left and right and running scripts all the time. To escape that, the first step is simply to expose that raw set of tools to your other engineers. They're comfortable with this stuff, they know Git, they know pull requests, they know YAML and JSON and all that. You can at least start to expose the automated tooling and pipelines to those teams so that they can begin to manage their own data warehouses.

An example of this would be a team that does a lot of reporting. They need a data warehouse that they can manage so you might just give them keys to the castle, and they can go about it. Maybe there's a business-analytics team that's attached to your sales organization and they need a data warehouse. They can manage their own as well.

This is not the end goal; the end goal is full decentralization. But for that we need much more development of the tooling that we're providing, beyond just Git, YAML, and the RTFM attitude that we sometimes throw around.

We need polished UIs, something that you can give not only to an engineer who’s been writing code for 10 years but to almost anyone in your organization. If we can get to that point, I think we will be able to create a fully decentralized warehouse and data pipeline where Security and Compliance can manage access controls while data engineers manage the tooling and infrastructure.

This is what Maxime Beauchemin meant by "... data engineers build tools, infrastructure, frameworks, and services." Everyone else can manage their own data pipelines and their own data warehouses and data engineers can help them do that. There’s that key word "help" that I drew attention to at the beginning.

Figure 21 is my view of a modern decentralized data architecture. We have real-time data integration, a streaming platform, automated data management, automated operations, decentralized data warehouses and pipelines, and happy engineers, SREs, and users.

About the Author

Chris Riccomini is a software engineer with more than a decade of experience at Silicon Valley tech companies including PayPal, LinkedIn, and WePay, a JPMorgan Chase company. Over his career, Riccomini has held titles as data scientist, staff software engineer, and distinguished software engineer. He has managed engineering teams in the distributed systems and payments space. Riccomini is active in open source. He is co-author of Apache Samza, a stream-processing framework, and is also a member of the Apache project management committee (PMC), Apache Airflow PMC, and Apache Samza PMC. He is a strategic investor and advisor for startups in the data space, where he advises founders and technical leaders on product and engineering strategy.

InfoQ Software Architects' Newsletter

The Future of Data Engineering

Follow us on

Key Takeaways

Related Sponsors

Stage 0: None

Stage 1: Batch

Stage 2: Realtime

Stage 3: Integration

Stage 4: Automation

Stage 5: Decentralization

About the Author

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter