InfoQ Homepage Presentations Future of Data Engineering

Future of Data Engineering

View Presentation

Speed:

Download

51:38

Summary

Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing.

Bio

Chris Riccomini works as a Software Engineer at WePay.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Riccomini: Today I'm going to be talking about the future of data engineering, which is a highfalutin fancy title. In actuality, what I want to cover is, I want to give you a little bit of context about me and I want to present some various stages of data pipeline maturity. I'm going to twist the blog post that Gwen [Shapira] alluded to a bit so that if you've already read it, you'll still get something new, if you haven't read it, you'll also be getting something new. Lastly, what I want to do is build out an architecture as we progress along through the stages so that we land on something that in my view is a modern data architecture and data pipeline, and also maybe a little bit of hint of where I think we're going in the next couple years.

Context

I'll get started with context. The reason I want to cover this in a little more detail than I might otherwise is because I think when I'm doing things like predicting the future or telling you how I think things are going to be, it's important to know where I'm coming from so that you all can get a sense for a little bit about my perspective and couch that with your own perspectives and act accordingly. I also want to give you a little context about what I mean by data engineering because as it turns out, that can mean a lot of different things. Lastly, I want to do a really brief overview of just why I'm giving this talk and what led to the blog posts that Gwen [Shapira] had been talking about.

My name is Chris [Riccomini], I work right now at WePay which is a payment processing company. We're actually part of JPMorgan Chase at this point, we got acquired a few years ago. I work on data infrastructure there and data engineering. Our stack is Airflow, Kafka, and BigQuery for the most part. Airflow is, of course, a job scheduler that kicks off jobs and does workflow kind of things. BigQuery is a data warehouse hosted by Google Cloud. You get a hint of this, I make some references to Google Cloud in here, you can definitely swap them out with corresponding AWS or Azure services. For the most part, I think they can be dropped in in the context of this talk.

Lastly, Kafka. Kafka is a big one, we use it a lot at WePay and that leads into my previous history at LinkedIn where I spent about seven years. LinkedIn was the birthplace of Kafka, which is, for those of you that might not be aware, a Pub/Sub, write-ahead log. At this point, it's through the backbone of a distributed infrastructure around logging. While I was at LinkedIn, I spent a bunch of time doing everything from data science, service infrastructure, and so on. I also wrote Apache Samza which is a stream processing system, so I'm very interested to hear about the Netflix Stateful Stream Processing talk that's on this track. I spent some time with Hadoop, more job schedulers and more data warehouses at PayPal. That's me.

Data Engineering?

When it comes to data engineering, there are all kinds of different definitions. I've seen people using it when they're talking about business analytics, I've seen people talk about it in the context of data science, so I'm going to throw down my definition. I'm going to claim that a data engineer's job is to help an organization move and process data. On the movement front, we're talking about either streaming pipelines or data pipelines. On the processing front, we're talking about data warehouses, stream processing.

Usually, we're focused asynchronous, batch or streaming-based stuff as opposed to synchronous real-time things. I want to call out the keyword help here - I'll tie this end at the end of the talk, so just bookmark that, that we're not actually, in my view, supposed to be moving and processing the data ourselves, we're supposed to be helping the organization do that. That was the what, this is a little bit of the how. Maxime Beauchemin, for those of you that don't know, is a prolific engineer. He started out, I think, at Yahoo, Facebook, Airbnb, Lyft and over the course of his adventures wrote Airflow, which is the job scheduler that we use as well as a bunch of other companies and he also wrote Superset.

In this blog posts on "The Rise of the Data Engineer" that he published a few years ago, he said that data engineers build tools, infrastructure, frameworks, and services. This is the, how we go about moving and processing the data.

Why?

Why am I giving this talk? The reason that I got down this path was I came across this blog post. It's from a company called ADA and it's a really nice blog post where they talk about their journey trying to set up a data warehouse. I think they do virtual assistants, I actually don't know that much about the company. They had a MongoDB database and they were starting to run up against the limits of it when it came to reporting and some ad hoc query things.

Eventually, they did some exploration and landed on using Apache Airflow and Redshift, which is, of course, AWS's data warehousing solution. The thing that struck me about this post was how much it looked like this post. This is a post that I wrote about three years ago. When I landed at WePay, they didn't have much of a data warehouse and so we got went up and running and we went through almost the exact same exercise that ADA did. We did some evaluation and eventually landed on Airflow and BigQuery, which is Google Cloud's version of Redshift.

The striking thing about the post is that they're so verbatim that the images showing the architectures are almost identical and even the structure of the post themselves, like what sections are in the post, are identical. I thought this was interesting because from my perspective, this was something we had done a few years ago and so I threw down the gauntlet, I make the claim that I know if they are successful and want to continue building out their data warehouse where they might end up. I thought it might be useful to share some of those thoughts and so I made the claim that one step would be going from batch to real-time, and the next step might be going to a fully self-serve or automated pipeline.

I want to be clear, I'm not trying to pick on that particular blog post or anything, I think it's a perfectly reasonable solution. It just so happens that I think that there's a natural progression about the evolution of a data pipeline and a data warehouse and the modern data ecosystem and that's really what I want to cover in the talk today. I refined this a little bit and I got cute with it and tried to do a Land, Expand, On Demand thing. I was toying with past-present-future but trying to figure out how to categorize this stuff and these thoughts in a way that really resonated and made sense.

The initial idea was, initially you land, you've got nothing, so you need to set up a data warehouse quickly. Then, you expand, you start doing more integrations, maybe you go to real-time because you've got Kafka in your ecosystem and then finally, you do an automation where you start doing on demand stuff. That eventually led to this post where I talked about four trends that I see coming down the road. The first one is timeliness, where I see us going from this batch-base periodic architecture to a more real-time architecture and the second one is connectivity, where once you go down the timeliness route, you start doing more integration with other systems.

Then, the last two, I think tied together automation and decentralization. On the automation front, I think we need to start thinking about operating not just our operations, but our data management and I'll go into that today and then decentralizing the data warehouse as a total. What I didn't do in that talk or in that post and what I want to do today is put a hierarchy up in front of it and I'm going to walk through these stages sequentially. As I mentioned earlier, I build out a little bit of an architectural diagram so we can see where we end up.

The reason I wanted to go down this path is because I found, as I was thinking about future, it was occurred to me, everyone's future is different because you're all at a different point in your life cycle. If you're ADA, your future looks very different than somebody like WePay where we may be farther along on some dimensions and then there are companies that are even farther along than us. I think this lets you choose your own adventure and build out a little bit of a roadmap for yourself.

Step 0: None

I'm going to start with the "None" stage. I wasn't sure what to call this, I couldn't think of anything clever, but you're probably at this stage if you have no data warehouse, you've probably got a monolithic architecture, you're maybe a smaller company and you need a warehouse up and running now. You probably also don't have too many data engineers and so you're doing this on the side.

It looks like this, we're all familiar with our lovely monolith in our database and this is where you take a user and you attach it to this. This sounds crazy to people that have been in the data warehouse world for a while but it's actually a pretty viable solution when you need to get things up and running, the latency of the data that the users are looking at is basically real time because you're creating the database and it's pretty easy and cheap. This is actually, when I landed at WePay, where they were at.

About 2014 when I joined, we had a PHP monolith and basically a monolithic MySQL database. The users I had, though, weren't quite as happy and noticed there's more than one of them there, so things are starting to tip over a little bit. We had queries timing out, we had users impacting each other. Most OLTP systems that you're going to be using are not going to be fairly strong on the isolation front, so users can really get in each other's way. Because we were using MySQL, it was missing some of the more fancy analytic SQL stuff that our data science and business analytics people wanted and report generation was starting to tip over. Pretty normal story.

Step 1: Batch

We started to go down the batch path and this is where the ADA post comes in; and that earlier post I mentioned also comes in. On this path you have a monolithic architecture, probably. You might be starting to trend away from that a little bit but usually it works best when you have relatively few sources. Data engineering is now probably your part time job. You've got queries, as I mentioned, as we were suffering from that are timing out. You're exceeding the database capacity, so whether it's space, memory, or CPU, you're starting to see queries just not come back.

I mentioned the complex analytics SQL stuff and reports are becoming more and more of an issue for your organization and those could be customer facing reports or internal reports and people are starting to ask for things like charts and business intelligence and all that kind of fun stuff. That's where the classic batch-based approach that I think most people are familiar with comes in. In between the database and the user, you stuff data warehouse that can accomplish a lot more OLAP and analytic needs. To get data from the database into that data warehouse, you have a scheduler that will periodically wake up and suck the data in.

That's where we were at maybe about 2016, this is probably about a year after I joined. This architecture is actually pretty fantastic, in terms of tradeoffs, you can get the pipeline up pretty quickly these days. At the time I did it in 2016, it took a couple weeks. The data latency we had was about 15 minutes, so we were doing incremental partition loads where we would take little chunks of data and load them in. We we were running, I think, a few hundred tables. If you think back to that land, expand, on demand hierarchy that I was attempting to impose, if you're just trying to get something up and running, this is a really nice place to start off with but, of course, you outgrow it.

At some point, the number of Airflow workflows that we had went from a few hundred to a few thousand, we started running tens or hundreds of thousands of tasks per day, so the likelihood that all those are going to work starts to not be that probable so that became a bit of an operational issue. We also discovered - and this is a little less intuitive for people that haven't actually run complex data pipelines - in an incremental or batch-based approach, you start having to impose dependencies on the schemas or requirements on the schemas of the data that you're loading. We had issues with create_time and modify_time and ORNs doing things in different ways and it got a little complicated for us.

DBAs were impacting our workload, so if they do something that hurts our replica that we're reading off of, it can cause latency issues, which can in turn cause us to miss data. Hard deletes weren't being propagated and this is a big one if you have people that delete data from your database. This is something that I think most people don't want to do but most people do do, whether it's removing a row or a table or whatever it is, that can cause problems with batch loads as well because you just don't know when the data disappears. Latency I mentioned and, of course, more timeouts, this time the timeouts are happening on your workflow, though.

Step 2: Realtime

This is where real-time kicks off and this is where I'm going to maybe go a little more in depth than I have over the first two stages. Now we're approaching what I would consider the cusp of the modern era of real-time data architecture. You might be ready for this if your load times are taking too long. You've got pipelines that are no longer stable, whether that's workflows being failing or MySQL or whatever your RDBMS is having issues serving the data. You've got complicated workflows, data latency is becoming a bigger problem. What I mean here is, maybe the 15-minute jobs you started off with in 2014 or 2016 are now taking an hour or a day and people that are using it aren't as happy about it. Data engineering now is probably your full-time job.

Lastly, you look around the ecosystem, you might have something like Apache Kafka in your back pocket and maybe the operations folks have spun it up to do log aggregation and run some operational metrics over it, maybe some web services are communicating via Kafka to do some queuing or asynchronous processing. It's probably floating around in your ecosystem at this point. From a data pipeline perspective, what we're going to do is get rid of that batch processor for ETL purposes and replace it with a streaming platform.

That's what we did, we wrote up a post where, I shouldn't say we got rid of Airflow, but we changed our ETL pipeline from Airflow to Debezium and a few other systems, so it started to look a little bit like this. This is where we were in about 2017. You can see the little Airflow box now has five boxes in it and we're talking about many machines, so the operational complexity has gone up but in exchange for that, we've got a real time pipeline now. We've got Kafka, and I'm not going to go into too much detail about what that is. I will do a very brief overview, which is that it is a write-ahead log that you can use to either produce messages to, they get appended to the end of the log and you can have consumers that are reading from various locations in that log. It's the sequential read and sequential write a thing.

We use it with these connectors, it has this ecosystem in this framework called Kafka Connect. One of the connectors that we're heavy user of is Debezium, this is a Change Data Capture connector that reads data from MySQL in real time and funnels it into Kafka also in real time. I said the magic word there that actually may not be familiar to many people here - Change Data Capture is essentially a way to replicate data from one data source to others. Wikipedia has got this nice little write up where they're talking about the identification, capture, and delivery of changes made to the enterprise data source, it sounds very fancy.

To give you a concrete example what something like Debezium will do is if I have, in our case, a MySQL database and I insert a row and then maybe I update that row and at some future time I delete it, the CDC feed, Change Data Capture feed will give me three different events: an insert, the update, and the delete. In some cases, it will actually give me the before and the after, so if an update occurs, it will show what it was like before and what the row look like after. You can imagine this can be useful if you're building out a data warehouse.

Debezium got a bunch of sources, we use MySQL, as I mentioned. If you think back to that ADA post that I referenced at the beginning of this talk, one of the things about that post that caught my eye was the fact that they were using Mongo DB and sure enough, Debezium has a Mongo DB connector. I should also call out that Cassandra is something that I'll be talking a little bit about in the later portions of this talk. It's a connector that we contributed to Debezium just a couple of months ago. It's incubating and we're still getting up off the ground with it ourselves but that's something that we're going to be using pretty heavily in the near future.

Lastly, quick shout out for the datatrack. This is Gunnar [Morling] right up here in the front row, he's going to be talking later today, so if you guys are interested in Debezium more, you should definitely come to his talk. Ok, back to our pipeline. Last but not least, we have KCBQ, which stands for Kafka Connect BigQuery. I do not name things creatively. This is just a connector that takes data from Kafka and loads it into BigQuery. The cool thing about this, though, is that it leverages BigQuery real-time streaming insert API. When you think about most data warehouses, they tend to be more batch load because they're just assuming that you're going to be doing batch load.

Going back to my LinkedIn days, HDFS at the time was all batch based, we had actually MapReduce tasks that would spin up read from Kafka since the last time they ran and then load them into HDFS and then shut down again and they would do this periodically. Even though the Kafka feed was real time, HDFS was just not set up for it at that time. One of the cool things about BigQuery is, it has basically a RESTful API that you can use to post data into the data warehouse in real time and it's visible almost immediately. What that gives us is a data warehouse where the latency from our production database to our data warehouse is about five seconds, give or take.

It's actually a little less, usually more like a couple seconds. This pattern really opens up a lot of use cases. First off, it lets you do real time metrics and business intelligence off of your data warehouse. It also allows you to do debugging, which is something that's not immediately obvious but if your engineers need to see the state of their database in production right now, being able to go to the data warehouse to do that is actually a pretty nice way to expose that state to them so that they can figure out what's going on with their system and the fact that they're seeing a real time within five second view of that world is pretty handy.

Lastly, you can do some fancy monitoring stuff with it, you can start to impose assertions about what the shape of the data should look like in the database so that not only that you know that the data warehouse is healthy, but that the underlying web service itself might be healthy. There are, of course, some problems with this. Not all of our connectors at that point in time when we first did the migration were on this pipeline, so we're now in this world where we have the new cool stuff and the older painful stuff. Datastore is a Google Cloud system that we were using that was still Airflow-based.

Cassandra, as I mentioned, didn't have a connector really, and Bigtable, which is a Google Cloud hosted version - I hate to say version - of HBace. We use all these systems as well. In addition to that, we've got BigQuery but BigQuery needed more than just our primary OLTP data, it needed logging and metrics. We had Elastic search in the mix now and we've got this fancy Graph database that we're going to be open sourcing soon that needs data as well, so the ecosystem starts looking more complicated. We're no longer talking about this little monolithic database. Huge hat tip to Confluence for the image here, I think it's pretty accurate.

Step 3: Integration

This is tough, we have to start figuring out how to manage some of this operational pain. One of the first things you can do is start to do some integration so that you have fewer systems to deal with here and for us, we leverage Kafka for that. You might be ready for data integration and, if you think back 20 years to Enterprise Service Bus architectures, that's really all this is in a nutshell. The only difference is that platforms and streaming platforms like Kafka along with the evolution in stream processing that's happened over the last 10 years or so has made this really viable.

You might be ready if you've got a lot of microservices, you got a diverse set of databases, as I showed in that last picture. You've got some specialized derived data systems; I mentioned Graph databases but you may have special caches, you might have a real time OLAP system. You've got maybe a team of data engineers now, enough people that are responsible that they can start to manage some of this complex workload. Lastly, hopefully, you've got a really happy, mature SRE organization that's more than willing to take on all these connectors for you.

This is what it looks like, you'll see we still got the base data pipeline that we've had so far. We've got a service with a dB, we've got our streaming platform, and we've got our data warehouse, but now we've got the web services, maybe we've got a NoSQL thing, or we've got this NewSQL thing. We've got a graph database there. Then, I've also got some search diagram there to plug it in. This is an example. In our case, a concrete instance of this would be around where we were at the beginning of the year. Things are getting even more complicated now.

Now we've got Debezium connected not only to MySQL, but we've got it connected to Cassandra as well - we highlight that. This is, as I mentioned previously, the connector that we've been working with, with Gunnar and company on to try and get it off the ground. You'll see down in the bottom there, we've also got KCW, which stands for Kafka Connect Waltz. Waltz is a ledger that we've built in-house that's Kafka-ish in some ways and more like a database in some ways, but it services our ledger use cases and our ledger needs. We are a payment processing system, we care a lot about data transactionality and multi-region availability and so it's a quorum-based write-ahead log that handles serializable transactions.

On the downstream side, as I mentioned, we've got a bunch of this stuff going on. Why are we incurring all this pain? Why are there so many boxes? This is getting more and more complicated. The answer has to do with Metcalfe's law. For those of you that don't know, I'm going to paraphrase and probably corrupt it quite a bit, but essentially what it is, is a statement that the value of a network increases, the more nodes and connections you add to it. It's usually used in the context of social networking where people are always talking about adding more nodes and edges. It was initially actually intended to be used for communication devices, so adding more peripherals to an Ethernet network is what the Wikipedia page said.

This is what we're talking about - we're talking about getting to a network effect in our data ecosystem. This leads me to another post that I wrote earlier in the year where I thought through the implications of Kafka as your escape hatch. When you leverage this network effect in the data ecosystem, what you're doing is adding more and more systems to the Kafka bus that are now able to load their data in and expose it to other systems and slurp up the data of the system.

We found this to be a pretty powerful architecture because your data becomes really portable, and so at least there's some advantages. First off, I'm not going to say it'll let you avoid vendor lock-in but at least ameliorate some of those concerns because if your data is portable, usually that's the harder part to deal with when you're moving between systems. The idea that you could, if you're on Splunk, plugin Elastic Search alongside it to test it out, suddenly becomes theoretically possible, the cost to do so certainly gets lowered.

It also helps with multi-cloud strategy, so if you're need to run a multiple clouds because you need really high availability or, you want to just be able to pick the cloud vendors against each other to save money, you can do that and you can use Kafka and the Kafka bus is a way to move the data around. Lastly, I think it leads to infrastructure agility. I alluded to this with my Elasticsearch example but if you come across some new hot real-time OLAP system that you want to check out or some new cache that you want to plug in, the fact that your data is already in your streaming platform in Kafka means that all you really need to do is turn on the new system and plug in a sink for it to load the data. You can at least start to get a feel for how the system is going to behave and how the pipeline might behave.

It drastically lowers the cost of testing the water with new things and supporting specialized infrastructure, so these are things that maybe do one or two things really well that normally you might have to decide on a tradeoff between, "Operationally, do we want to support this specialized piece of infrastructure like a Graph database or do we want to use an RDBMS which just so happens have JOINs?" By reducing the cost, you can start to get a little bit of a more granular set of infrastructure to handle your queries.

The problems here look a little different. What we found ourselves doing when we moved to this architecture and bought in and did a bunch of the integration was we were still spending a lot of time doing fairly manual stuff, so adding channel for MySQL DBs, adding topics for Kafka, setting up Debezium connectors, creating data sets, you can read the whole list here. In short, we were spending a lot of time administering the systems around the streaming platform, so a lot of the connectors, the upstream databases, the downstream data warehouses and our ticket load started to go something like this.

For those of you that are huge fans of JIRA, you might recognize this. This is a screenshot of our support load in JIRA over the past 300 days or so and you can see it's happy at the beginning of the year, it's relatively low. Then, around May or March, it's skyrockets and it hasn't ever really fully recovered, although there's a nice trend over the last couple months that I'll get into right now.

Stage 4: Automation

We started investing in automation. This is something you've got to do when your system gets so big. It's a "No, duh," thing, I think most people would say, "Yes, you should have been automating all along." That's like table stakes.

You might be ready for this step if your SREs can't keep up, you're spending a lot of time on manual toil - I'll get into what I mean by toil a little bit - and you don't have time for the fun stuff. I want to add two new layers here. The first one is the automation of operations and this is something, like I said, that most people are going to not be too surprised about. It's just the DevOps stuff that has been going on for a long time, I think a lot of people are very familiar with it. There's a second layer in here that I don't think is quite as obvious and that is the data management automation layer, so I'm going to go into both of these now.

First off, we'll cover automation for operations. I'll do that relatively quickly because I don't think there's a ton of new ground to cover here. There's a great quote from Google SRE handbook where - I think the chapter is actually on toil - they define toil as manual, repeatable, automatable stuff. It's usually interrupt-driven, you're getting slacks or tickets, or people are showing up at your desk asking you to do things. They're saying, "If a human operator needs to touch your system during normal operations, you have a bug." That is not what you want to be doing.

What are normal operations for data engineering? It's all the stuff we were spending our time on. Anytime you're managing a pipeline, you're going to be adding new topics and adding new data sets and setting up use and granting access. This stuff needs to get automated. Great news, there's a bunch of solutions for this: Terraform, Ansible, and so on. I saw there was a good talk on Terraform yesterday from one of my old co-workers. At WePay we use Terraform and Ansible but like I said, you can substitute any one of these out if you want to and it doesn't look terribly surprising. You can use it to manage your topics.

Here's an example where you've got some systemd_log thing where you're logging some stuff when you're using compaction, which is an exciting policy to use with your systemd_logs, but I'll digress. You can also manage your Kafka Connect connectors. These are the Debeziums and the KCBQs and the KCWs of the world. Not terribly surprising, we should be doing this, but we are arguing this. We have Terraform, we've had Ansible for a long time, we're moving now to Terraform and Packable and more Hacker and more immutable deploys, but long story short, we've got a bunch of operational tooling.

We're fancy and we're on the cloud and we have a bunch of scripts that we use to manage BigQuery and automate a lot of our what I would call toil, things like creating views in BigQuery, creating data sets, and so on. Why are we still having such a high ticket load? The answer is, we were spending a lot of time on data management. We were answering questions like these, "Who's going to get access to this data once I load it?" "How long am I allowed to keep this data? Hey, Security, is it ok to persist this data indefinitely or do we need to have a three-year truncation policy?" "Is the data allowed in the system even?" WePay is a payment processor so we deal with some pretty sensitive information. What geography can it be in? Should there be certain columns that get stripped out or redacted? Stuff like that.

I won't say we're on the forefront of this, but we certainly deal not necessarily with regulation, but with policy and compliance stuff. We have a fairly robust compliance arm that's part of JPMorgan Chase. In addition to that, because we deal with credit cards, we have PCI audits and we deal with credit card data, so we really need to think about this. I don't think we're alone and I think this is going to become just more and more of a theme, so get used to it.

We're going to have to start doing a lot of this stuff and that's the reality of situation. I think if you're in Europe, you're talking about GDPR, CCPA is for California, we have PCI if you've got credit card data, HIPAA for health, SOX if you're public, SHIELD is one I didn't even know about but apparently that's in New York. We're going to have to really start getting better at automating this stuff or else our lives as data engineers is mostly just going to be spent chasing people around trying to make sure this stuff is compliant.

I want to talk a little bit about what I think that might look like. Now I'm getting into the more futuristic stuff and so things might get a little more vague or hand wavy, but I'm trying to keep it as concrete as I can. First thing you want to do is probably set up a data catalog. This is something that over the past year or two, I've seen a lot of activity in. A data catalog - I should mention, you probably want it centralized, i.e., you want to have one with all the metadata - it's going to have the locations of your data, what schemas that data has, who owns the data, lineage which is essentially where the data came from.

In my initial examples, it would be like it came from MySQL, it went to Kafka, and then it got loaded into BigQuery, knowing that whole pipeline. Maybe even encryption or versioning, so you know what things are master encrypted and what things are version as the schemas evolved. I'm going to do a big shout out to Amundsen which is a data catalog from Lyft but I should be clear, there's a bunch of activity in this area, it's getting very hot over the last year. You have Apache Atlas, you have DataHub which was recently open source, there's a patch that's from LinkedIn. WeWork has a system called Marquez, Google has a product called Data Catalog and I know I'm missing at least two or three more from this list.

What these things do is, they generally do a lot and they generally do more than one thing, but I wanted to show an example just to make it concrete. We've got an example here with some fake data that I yanked from the Amundsen blog where they've got the schema, the field types, the data types, everything. They've got who owns the data; and notice that "Add" button there - I want to get back to that in a moment but just keep that in the back your mind.

We've got the source, so this is starting to get into a little bit of the lineage, so it's telling you what source code generated it, what system generated it - in this case, it's Airflow, that's what that little pinwheel is - some lineage about where the data came from and they even got a little preview, it's pretty nice UI. Underneath it, of course, is a repository and I think you can even use Apache Atlas to power it or Neo4j, if I'm not mistaken, that actually houses all this information. That's really useful because you need to get all your systems to be talking to this catalog. That little plus thing that I mentioned on the owner part, you don't as a data engineer want to be entering that data in yourself. That is not where you want to be, that back in the land of manual data stewards, data management. Instead, what you want to be doing is hooking up all these systems to your data catalog so that they're automatically reporting stuff about the schema, about the evolution of the schema, about the ownership when the data is loaded from one to the next. First off, you need your systems like Airflow and BigQuery, your data warehouses and stuff, to talk to data catalog. I think there's quite a bit of movement there.

You then need your data pipeline streaming platforms to talk through data catalog. I haven't seen as much yet from that area, although maybe there's stuff coming out that will integrate better, but right now I think that's something you got to do on your own. Then, something that I don't think we've done a really good job of bridging the gap with is on that service side, I will claim, one has your service stuff in the data catalog as well. So this would be like gRPC protobufs, it would be JSON schemas, and even the DBs of those databases.

Once you know where all your data is, the next step is you got to configure your access to it. Right now, if you haven't automated this stuff, what you're probably doing is, going to Security or your Compliance or whoever the policy maker is and being, "Scan someone that will see this data whenever they make access requests," and that's not where you want to be. You want to be able to automate the access request management so that you can be as hands off with it as possible.

This is kind of an alphabet soup. What we're really talking about here is RBAC role-based access controls, identity access management. Access control is just a bunch of fancy words for a bunch of different features for managing groups, user access, and so on. You need three things to do this, the first thing is you need your systems to support it. The second thing is you need to provide some tooling to Security and Compliance to configure the policies appropriately. The third thing is, you need to automate the management of the policies once they've been defined by your Security and Compliance folks.

I'm going to start with some good news and that is that I think there has been a fair amount of work done on a lot of the system supporting it, the aspect of this. Airflow has RBAC, which is a role-based access control. It was a patch we submitted last year from WePay. I should mention Airflow has had a lot more work done on it since then, they now have DAG-level access control, they've really taken it pretty seriously. Kafka also has ACLs and has had that for quite a while and here's an example of managing it with Terraform.

You can use those tools to start to automate some of this stuff. We want to be automating when a new user is added to the system, their access automatically gets configured. When a new piece of data is added to the system, their access controls automatically get configured. We want to start automating service account access, so as new web services are coming online. The last two are less obvious, there's occasionally a need for someone to get temporary access to something and you don't want to be in a position where you're setting a calendar reminder three weeks in the future to say, "Please remember to revoke the access for this user." You want that all to be automated.

The same deal with unused access - you want to know when users aren't using all the permissions that they're granted so that you can start to strip them away to limit the security vulnerability space. We know where all our data is and we know we've got the policies setup, we need to detect violations and this area of my talk is a little thin. I mostly want to talk about data loss prevention but there's also auditing as well, which is keeping track of logs and making sure that the activities and the systems are conforming to the required policies, so we need to monitor and verify that the policies aren't being violated.

I'm going to pick out, because I'm in Google Cloud and I have some experience with this piece of software, the data loss solution from GCP. There's a corresponding one from AWS called Macie. There's also an open source project called Apache Ranger, which has a little bit of an enforcement and monitoring mechanism built into it, it's more focused on the Hadoop ecosystem. The theme with all these things, though, is that you can use them to detect sensitive data where it shouldn't be.

This is an example where we've got a piece of text that says, "My phone number is blah-blah," and you get a result back saying it detected an info type of phone number and the likelihood is very likely. You can use this stuff to start to monitor the policies that you set forth. For example, if you have, say, a data set that is supposed to be cleaned, i.e., not have any sensitive information in it, you can run DLP checks on that clean data set and if anything pops up like a phone number, a social security number, or credit card, you can be alerted immediately that there's a violation in place.

We have a little bit of progress here. Users can find the data that they need, they can use their data catalog, we have some automation in place, maybe we're using Terraform to manage ACLs for Kafka, maybe we're using Terraform to manage RBAC controls in Airflow, but there's still a problem here. The problem is, data engineering is probably still the one managing that configuration and those deployments and the reason for that is mostly due to the interface. We're still in the land at this point of get pull requests Terraform, DSL, YAML, JSON, Kubernetes, it's nitty-gritty.

Even going to some security teams might be a tall order and asking them to make changes to that, going into your compliance wing is an even taller order and going beyond compliance is basically, "There is no way."

Step 5: Decentralization

This leads to the last stage that I want to talk about and that is decentralization. You're probably ready to decentralize your data pipeline and your data warehouses if you have a fully automated real-time data pipeline but people are still coming to you asking to get data loaded. That's a hint that you’re probably ready for this step.

The question is, if everything's automated, why do we need a single team to manage all this stuff? I, of course, don't think you do. I think the place where we're going to see this first, and we're already seeing this in some ways, is a decentralization of the data warehouse. I think we're moving towards a world where people are going to be empowered to spin up multiple data warehouses and administer and manage their own. The way I frame this line of thought is really around our migration from monolith to microservices that we've had going on in the past decade or two.

Part of the motivation there was really to break up large complex things, increase agility, increase efficiency, let people move at their own pace. A lot of that stuff sounds like your data warehouse, it's monolithic, it's not that agile, you're having to go to your data engineering team to ask to do things, maybe you're not able to do things at your own pace. I think we're going to want to do the same thing and we're going to want to get towards a more decentralized approach.

A quick shout out - I'm not alone in this, there's a really great blog post. When I read this post, I was like, "Yes, this is exactly what I've been thinking about and this is just such a great description of it." It turns out that the author of this blog posts, Zhamak Dehghani, is the next speaker in this track right after me, so I'm very excited to hear what she has to say. In the post, she talks about the shift from this monolithic view to more fragmented or decentralized view and she even talks about policy automation and a lot of the same stuff that I'm thinking about.

I think this shift towards decentralization will take place in two phases. If you've got this set of raw tools, you've got your Git, your YAML, and your JSON, and you're this beaten down engineering team that is just getting requests left and right and you're just running scripts all the time - if you're just trying to escape that, the first step is simply to expose that raw set of tools to your other engineers. They're comfortable with this stuff, they know Git, they know pull requests, they know YAML and JSON and all that thing, so you can at least start exposing the automated tooling and pipelines to those teams so that they can begin to manage their own data warehouses.

An example of this would be, maybe you've got a team that does a lot of reporting or reconciliation and they need a data warehouse that they can manage, you might just give them keys to the castle, and they can go about it. Maybe there's a team that's attached to your sales organization that's a business analytics team and they need to have a data warehouse, they can do it as well. This is not the end goal; the end goal is full decentralization and for that, what we really need to see here is a lot of development and evolution in the tooling that we're providing beyond just the Gits and the YAMLs and the RTFM attitude that we throw around sometimes.

I'm talking here more about UIs, something that's polished, something that you can give not just to an engineer with 10 years under their belt writing code, but to people outside of that in your organization. If we can get to that point, I think we will have a fully decentralized warehouse and data pipeline where security and compliance can manage access controls, data engineering can manage the tooling and the infrastructure. If you think all the way back to the beginning of this talk, it was really what Maxime was talking about and everyone else can manage their own data pipelines and their own data warehouses and we can help them do that, which is that key phrase that I wanted to call out at the beginning of this talk.

With that, this is where I landed on my view of a modern data architecture. We got real-time data integration and streaming platform, we've got automated data management, we have automated operations, decentralized data warehouses and pipelines, and happy engineers, SREs, and users.

Questions and Answers

Participant 1: I'm new to this space so if this is an elementary question, but with all the loading of low level data that's in these, like online transactional system databases, are you concerned about coupling to that data at all? And do you think any of our reporting and analytics should move to the APIs of these services that they exposed to other consumers?

Riccomini: I'm going to call out another blog post that I have. There's a blog post that I wrote earlier in the year called, "Change Data Capture Breaks Database Encapsulation," or microservice encapsulation, and it is exactly what you're talking about. There is a problem. In the post, I enumerated the various strategies that you can employ and they can vary from the draconian like banning backwards and forwards compatibility changes on the microservice database, to more flexible things where you might put a streaming shim in between that's allowing the team that owns the microservice to mulch their data before it gets exposed to the public.

I think Gunnar also has a really good post on an outbox pattern that he talks about with Debezium. Yes, absolutely a problem, something you got to deal with. I think we're headed towards a world where we are going to provide that data API and much like the public facing API that the microservices exposed. These problems exists for the microservice APIs as well. You're talking about versioning, you're talking about migrating the users from one version to another and so on. The tooling is a little more nascent, but it is there, the rudimentary parts, whether it's stream processing or schema management detection, that stuff is there but it's a little more painful.

Participand 2: Can you elaborate a little bit about on how you handle historical data in case, for example, when you found out that there is some defect that seeped into the data warehouse years' worth of data and you have to recalculate from scratch?

Riccomini: There are a couple ways to handle that. One is a manual - we will just, in the data warehouse, go and mulch the thing to make it look like the upstream thing and that's painful and not really rigorous in a sense that you may make a mistake. The other way is lazier and a little bit slower usually but is more accurate, and that is what we call re-bootstrap or re-snapshot the data.

Specifically for us with Debezium, when you first start the connector up, obviously there's no data in Kafka yet, but that table might have existed for two years, so Debezium will do a consistent snapshot where it will select all the data out and then load it into Kafka. If you decide that something went wrong, you can essentially re-trigger and re-snapshot the thing again.

Participant 3: Excellent talk, thank you for providing this amazing framework to understand the challenges in data engineering. I think it's a good talk because it launches a lot of conversations and I'm sure you'll get a lot of questions from people here. I have many but I'll ask one or two. One has to do with data quality and data ownership, I think there's always this question of publishers of data, subscribers of data, who owns the data? How do you tackle that one?

Riccomini: I'm a firm believer that the publisher should own it and that's coming from data engineer. At WePay, we take the stance that we are responsible for taking the data from the upstream database and making it look identical in the data pipeline and so if it looks identical, we've done our job. I think as we decentralize, the ownership is going to have to shift to either the engineering teams or whomever and so then it becomes about detection. When things aren't matching up, alerting the proper people.

We have some systems that we employ that do data quality checking at WePay. Right now, all those alerts come to us and then the problem we have is - and I think this is true of a lot of data engineering - is they don't necessarily have as much context around the business case, like what the data is. They're just moving the data around, so we usually end up having to chase people down. My opinion is, if we go with the microservice example, if the pipeline is accurately moving the data and reflecting what is in the source database, then data engineering is doing its job. Usually, the problem is not in that pipeline. There's a confusion about the semantic meaning of a column or there was a schema change upstream that affected the downstream users in a way that they're not happy about, or they're stuffing JSON into a string and they've stopped using the field that they were using and that stuff, we just push. That's on the individual engineering teams to own.

See more presentations with transcripts

Recorded at:

Dec 04, 2019

Chris Riccomini

InfoQ Software Architects' Newsletter