InfoQ Homepage Presentations Databases and Stream Processing: a Future of Consolidation

Databases and Stream Processing: a Future of Consolidation

View Presentation

Speed:

Download

46:55

Summary

Ben Stopford digs into why both stream processors and databases are necessary from a technical standpoint but also by exploring industry trends that make consolidation in the future far more likely. He examines how these trends map onto common approaches from active databases like MongoDB to streaming solutions like Flink, Kafka Streams or ksqlDB.

Bio

Ben Stopford is a Senior Director at Confluent (a company that backs Apache Kafka) where he runs the Office of the CTO. He's worked on a wide range of projects from implementing the latest version of Kafka’s replication protocol through to assessing and shaping Confluent's strategy. He is the author of the book “Designing Event Driven Systems”.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Stopford: My name is Ben, I work at a company called Confluent. We're one of the companies that sits behind Apache Kafka. I used to work on the technology itself. I wrote latest versions of replication protocol, throttling a few other things. These days, I work in the office as the CTO, which is a slightly more general function. One of the things we did last year was we actually looked very closely at the differences between stream processors and databases. This led to a new product we have, it's called ksqlDB. It's actually pretty interesting, I think, to compare and contrast the different models, both in the way that we interact with data that's moving and interact with data that is a passive or resident. That's what we're going to talk about today.

Before we get to that, I wanted to start by taking a little bit of an aside. I think this helps phrase why it's important in future that we do get a degree of consolidation. There's a very good post by a guy called Marc Andreessen. He's actually an investor out in Silicon Valley. This post basically says that software is going to end up eating the world. If you're an investor, you probably want to invest in software because over time, we're going to consume more software. You guys will be buying more applications on your phone. Companies buy more software that makes their business more efficient. That's like a general trend.

Actually, if you think about it, there's a couple of different forms of this. This is like weak form, like basic idea that we as an industry just consume more software. There's actually this strong form. The strong form is interesting because in this particular way of thinking about it, our businesses become effectively more automated. It's less about just us buying more software, companies consuming more software, and more about actually taking our business processes and automating them using software. That's a bit abstract. What does that mean?

Think about something like a loan processing application. This could be a mortgage that you're taking out, or some other kind of collateralized loan. This process hasn't changed for 100 years, pretty much. There's a credit officer, there's risk officer, there's a loan officer. Each of them has their particular stage in this process. We have some software that makes it more efficient, that companies have when they will write or purchase software, which makes each part of this better. [inaudible 00:03:15] helping the credit officer do their job better, or helping the risk officer do their job better. That's kind of the weak form of this analogy, or buy more software to help these people do a better job.

These days, you can actually get a loan application through in a few seconds. A traditional mortgage will take maybe a couple of weeks because it follows that manual-driven process. Today, in these more contemporary applications, where it's more to contemporary companies, you can automate the whole thing. This is still using software, but it's using software in a different way. It's using software which is effectively talking to other pieces of software, so that you can automate the process of getting a loan very quickly, in a few seconds.

If you think about what this actually means for the architecture of the systems that we built, if you want to help a credit officer do his job better, that's where the three-tier architecture comes in. We have our user, they talk to a user interface, there is a some kind of back end server application that's running behind that and a database, and you're helping the user, let's say, do risk analysis more efficiently, or whatever it might be.

As companies become more automated, and those business processes become more automated, we end up with services talking to one another. This is like software talking to software. It's less focused on the idea of specifically helping a person, and more focused on achieving some end goal, some business end goal from different pieces of software. A good example of this is like a taxi, ride-sharing applications. In these, you actually have streams of events. Both phones on the customer side and on the taxi driver side, both have GPS information which is being streamed back to the server side applications. You have many different services running in the back end. They're doing geospatial matching to try and work out which taxis they should send alerts to, to ask them if they want the ride, buffering the responses to work out which is the most efficient driver for a particular user, etc.

You've got streams of events. You've also got business events, the facet they've actually decided to accept that particular ride. All of these different pieces of software sitting at the back end allow you to automate this process. This is very different to phoning up a taxi company and asking them whether or not you can have a cab arrive at your house at some time. This is only possible by basically automating the whole process. Many pieces of software talking to many other pieces of software.

Evolution of Software Systems

You see the same thing in the evolution of software architecture. The monolith to the distributed monolith to microservices. These are all -- certainly the second two -- implementations that involve distributing or processing, but it's mostly focused on users. It's mostly focused on the user interface. Actually, the data that lives in a microservice architecture, it's actually often joined together on the front end. The user interface is actually talking to a bunch of different services. The front end's actually doing that join because it's displaying different parts of the page to the user.

What happens when you want more data centric things? That's where you tend to get into event-driven microservices. This is a pattern where you still have the user interface, you still have server-side programs which responded to those, but you're journaling events, which then allow you to decouple an architecture and process, do a lot of the mechanics which makes the business work asynchronously. You've got that kind of user-centric side, and then what I call this software centric side. We're actually automating the business and we're growing an architecture.

Why am I telling you all this? This trend kind of leads to this idea that the user of the software in the future is likely to be more and more, not just you and me, not just users of an application, but actually software which is talking to other pieces of software.

What Does This Mean For Databases?

What does that mean for databases? Think for a moment, and think about what a database is. You've all used databases. Many of you, I'm sure, use them every day. Certainly, you would have all have used them at some point. What does it feel like to use a database? You write a query, and there's some server off in the distance somewhere. It's got lots of data on it. You can write this query, which allows you to answer a question. There's no way you'd ever be able to answer that question yourself, because there's too much data and your mind can't pass through it. You can send this question off, and you can get an answer. It's wonderful. It's really powerful idea. That's what I think of when I think of a database. I haven't spent many years building on and working with databases.

We actually have hundreds of these things today. We all kind of feel the same. We have lots of different variants, something like Cassandra will let you store a very large amount of data for the amount of memory that it's using. Something like Elasticsearch provides you with a very nice, rich interactive query model. Neo4j will let you do something like query, the relationship between entities, not just the entities themselves. Things like Oracle or PostgreS are workhorse databases that can morph to different types of use case. Hundreds of these different things, and they all have slightly different variants that make them more appropriate to a certain type of use case.

They all have a couple of fundamental assumptions. One of the main ones is that the data is passive. They just sat there on the database and waiting for you to do something. The database as a piece of software is actually designed to help you, it's actually cast as a technology because it grew up in a time when we built applications that were designed to help users, whether it's you or me, or a credit officer or a risk officer, or whatever it might be. If there's not a user interface waiting, if there's not somebody sitting there clicking buttons and expecting things to happen, you've got to ask the question, does it have to be synchronous?

One alternative is to use event streams, just like a different way of doing it. Stream processors are a technology which allow us to manipulate event streams in a very similar way to the way that databases manipulate data that's held in files. They're built, importantly, for active data, and they're built for asynchronicity. Let's just classify what that means. If you think about a stream processor, it actually has a very different model of interaction. Anybody here use something like ksql? A few of you. Keep your hand up if it felt like using a database. We got one. I actually think it feels quite different. ksqlDB, the newer one, does feel like one, but ksql feels not so much like one.

It's because of this interaction model, this very familiar interaction model that we get with the database. We don't really quite get that with a stream processor, because a stream processor isn't really designed for us. For a traditional database, the query is active, and the data is passive. What does that mean? That means that this is about what initiates the action, what makes things happen. In a database, you make things happen by running a query, the data is just sitting there passively waiting for you to run that query. If you're off making a cup of tea, the database doesn't do anything, it's just doing nothing. You have to actually initiate that action.

In a stream processor, and eventually in processing, it's the other way around. The query is passive. The query sits there, it doesn't change because it runs forever, it just sits there on the server running. The thing that makes it do something isn't somebody running a query. It's data, it's some event happening in the world, something being emitted by some other system, or whatever it might be. This is a very different interaction model.

Streams or Tables?

With that in mind, I want to dig a little bit more deeply into some of those fundamental structures that we have in stream processors, and compare how those relates specifically to databases. Probably the most fundamental one is this concept of streams and tables. Because I think streams define in many ways what a stream processor does or an event streaming system does. Stream processors work on events. This sounds like a very simple idea. In fact, it is a pretty simple idea. We can record what happens in the world based on a number of events. This could be somebody placing an order, maybe they're buying a pair of trousers, this could be you paying for the pair of trousers, it could be changing a customer details. It could actually be the position of your phone, so another kind of continuous event streams. It could actually be a whole bunch of different things, record of a fact, something that happens.

Actually events as a data model are slightly different. Because they include this notion of intent to tell you about a state change, not just state. To take the kind of canonical LinkedIn example, if Bob works at Google, that's the state of Bob at some point in time. The event would be that Bob has moved from Google to Amazon. The reason that that is actually different is that typically when we drive processing, we drive our applications, we don't just want the state, we also want to be triggered of the state changes, of this intent. Stream processors are always forwarding these events, this intent, this fact that something happens so that you can react to them.

A stream is an exact recording of exactly what happened, published, normally on a topic. A table represents the current state. This is actually important, because we're going to come back to it. It's this idea of recording absolutely every single thing that happened, are recorded in a stream. If any of you ever used event sourcing, it's a very similar concept, this idea that we're keeping this journal of every single state change. If we want to derive our current state, we can do that by replaying the stream. The stream tells us where we've been versus where we are now, or payments that you've made versus your account balance.

A way that I really like to think about it is thinking, let's use chess, it's a good analogy. I can think of a database table as being the position of each piece. So I can take the position of each of those pieces, and that tells me my current state. I can store that somewhere and recreate it if I want to. Stream works the other way around. It takes a sequence of events from the starting position at the start of chess, and then we can replay all of the moves, the events, the state changes, to get us into the same position. Streams don't just tell us about the position of the board at a point in time. They also tell us about the game, and tell us about how we got that.

What does this have to do with databases? We can actually think of a test of a stream as a type of table. Actually, no real reason to call it a stream. We'll get into that a bit later. A stream can definitely be thought of as a type of table. It's a particular type of table that doesn't exist in most databases. It's immutable. You can't change it. And it's append-only. In a table, I can have insert, update, and delete. That's mutable, it has a notion of a primary key, which is also important.

A stream can be considered as an immutable append-only table. If I write a value, it is automatically going to be recorded forever. I can't go back and change that. The reason I can't change it is because it might have been consumed by somebody else. If I want to change it, I have to make another record, write another message, which is actually then going to propagate that change. That's why immutability is so importantly, it allows us to recreate the state through a sequence of facts. Stream processors communicate through streams. Input streams, go to output streams.

Internally, a stream processor, even though that model is stream to stream, internally, they're using tables. This is actually a bit like the way in a database. A database, when it runs a query, actually uses these temporary tables to hold intermediary results as it's doing its computation. Can't see them, they're temporary. In a stream processor, they're not temporary because a streaming query runs forever. It doesn't run differently for each user, for each request. It's just like one query that's running concurrently. It does contain this notion of tables.

In this case, I have payments coming in. On the left-hand side, there's a payment stream. Then I'm running some credit scoring function. That's going to basically collapse the stream of payments, which we can think of as a table of payments, and aggregated by user so that each user has a credit score. My table is actually smaller than my input stream. The stream processor then listens to that table and provides the output as another stream. When we use the stream processor, we create tables, we just don't really show them to anybody. Because what we're trying to do is read streams of data and output streams of data.

This leads to this duality, this kind of stream-table duality, where the stream represents history. Every single state change has happened. I can apply a projection, it could be something as simple as a group by key, which basically involves just creating a table that has the latest version for each key, which is what most people think of as a table when they use a database. Then likewise, you can listen to a table, listen to all of the things that change inside it and create another stream. That stream won't be the same as the one that you started with, if your table is using an aggregation, because you lost data, but you can maintain both of these. You have this kind of duality. Streams can go to tables, tables can go back to streams.

This is actually very similar to the notion of a materialized view. A materialized view is a construct which is available in many databases, notably relational ones, and it came from this idea of an active database which was popular in the 1990s. There's a couple of big differences. Stream processor, we have a stream, our input is a stream, we have our aggregating function or logic or whatever it might be, which is going to create my table. Then that outputs as a stream to another application, so another application can react to that. That's asynchronous. The query is basically a push query. It's running all the time, and it's creating new results.

An active database, the input is a table. We have a table. Remember, these two things could be similar. We can create an append-only table and database. Then we have a function that runs, creates our credit score. Then if we want to interact with that, we have to send a query and get a result, because it's a database. In this case, there's two big differences. The whole process inside of creating a materialized view is synchronous. Locked inside the database, often inside a transaction. The interaction model is a pool query. This represents a very similar concept. It's just that the way that we interact with it, and the way that data moves around is different. Big similarity between materialized views and stream processes.

Joins

Let's look at joins. Joins are another way that these things are sort of similar. A stream processor can do joins, databases, mostly can do joins, or many of them can do joins. Are these things exactly the same or are they slightly different? On the most part, they're quite similar, but let's look at how a join works in a stream processor. There are a few different types of join. We're going to start with a join to a table.

This is a very simple idea. We have a stream of orders coming in here, and a stream of customers which is going to be materialized inside this table. We're going to enrich each order with some customer information. That's our stream processor. Basically, the orders come in, we do a lookup in the table based on this primary key. This table, there's actually an event stream behind it, and it's got to effectively do a group by key to materialize the table. Then what we get out is an order join to its customer information. That's pretty simple. It's actually almost exactly what database does. A database reads data from a file, read the orders from the orders table. You don't have an index, maybe on customer information, or maybe a primary key on that table. It would just look up the values inside a query. Very similar the way these two things work. Big difference the database has is a database knows about the data that it's got. I can maintain my statistics and stuff that tell you about the cardinality of that data. That helps you decide whether or not to use indexes and stuff like this. In a stream processor, you typically don't know what's going to arrive next because it's like real-time data. You can't really make those optimizations.

What about joining two streams? That's a bit different. The thing about a stream is you know less about it. That table we can kind of index it, we can give it a primary key. With the stream, that's not quite the case. The way this works is, in this case, we've got Bob's order, Bob and Jill have orders, and we have payments and we're joining orders to payments. Maybe Bob bought some trousers and then he made a payment, and we want to kind of collate them together. Because the system's asynchronous, these things could be reordered. They could basically be reordered and turn up late. Actually, Bob's order is behind Bob's payment with respect to Jill.

As these events move into the stream processor, what happens is Bob's payment comes in and it gets buffered inside a little index. Jill's order goes into the buffet also, and Jill's payment comes in. We can do a match. Each time these are processed, it's looking for the corresponding side or a corresponding event on the other side. It's also actually checking the event time, it's using event time to process that. We'll come back to that in a bit. Then Jill's payment come out as a joined unit. Only when both of these things were available do we actually get an output. It doesn't matter which one came first, or came last, more importantly. That's the point that will actually get the output. Then likewise, Bob's order comes in as matched and is sent out.

The reason this is a little bit different to a database is that this is actually using a couple of things. One is that you're only buffering so much data, which means you can handle these very high throughput use cases, where you don't want to maintain the whole dataset inside a table and index it and look it up on the fly. The other one is that often these streams -- remember the streams are history, they're like a history table -- hold all of the different versions of a particular event. If Bob bought a pair of trousers and then he decided to change his order, you'd actually have like a set of different orders or versions of an order for the same thing. In this case, maybe like a change from Boots to Boots2. We've actually got multiple different values, multiple different state changes from the same entity inside this stream. We just treat it as a table. We get this Cartesian product. That'd be pretty inefficient, you get loads of data coming out all the time.

Stream processes handle this by allowing me to correlate things in time. This is where windows come from. We can use a time window to basically restrict how we're going to join a data set. If we want to correlate these two things in time, but also by their identifier, we can use windows to do that. It's very hard to do that inside the database.

So we have this idea of including in the stream processor the ability to correlate recent events in time. Something you can't really do in a database. You actually do a lot more advanced versions of these. A session window is another interesting thing, also quite hard to do in a database. A session is like, you're like looking for trousers on your website, and then maybe you buy some, and then you go away. You want to just detect some periods of activity that then ends. Session window allows you to do that. It has a no defined length, it just goes on for some period of time and then when there's a period of inactivity, it's dynamically ended.

The other thing is late and out-of-order data, because things obviously change. One of the things we talk about in stream processing is the idea it's hard to work out where the end of something is, when does something end, when does a day end, when does a computation end. In the stream processing, you can actually use window functions to create outputs which you send out. Then you're actually using time to correlate when something comes in, which window it should go into. You have this notion of history, you can update previous windows. This lets you handle late and out-of-order data.

Stream processors provide a bunch of tools that let us handle asynchronicity. It allow us to leverage time. Time in any distributed system is kind of hard to work out, or hard to reason about. Time in this concept, in stream processing, you can have system time, effectively, or stream time, or you can have event time, which is a time that the event was actually created by the client. You can configure the system to use either one. They let you leverage time, and then they also let you focus on the now. Stream processors are designed for this processing, which is about real-time events. They're kind of similar. There's similarities there between the way that joins work, but there are also some differences.

Data Placement

What about data placement? Stream processors use a two-tier storage model, most of them. Different ones use actually slightly different approaches, but they're all quite similar. Storage is normally in a streaming system, something like Kafka. Flink actually uses something like Kafka as well, there's maybe HDFS for snapshotting. Effectively, we can think of it as just being some distributed storage layer. Then streams are processed as they arrive, because the storage layer is not just like a file system, although Kafka actually architecturally looks a lot like a distributed file system. What's actually happening is [inaudible 00:29:00] at least conceptually, is pushing data towards you.

You can react to streams in real time. If you want to create tables, you need to materialize those. This is actually done inside the stream processor. If you have a table of customers, that will actually be materialized inside the stream processor itself. This is actually very similar to the way that databases work. If you look at something like Snowflake. Snowflake has a back end, it's analytics engine, and it materializes tables locally, so that it can do efficiently elastic processing.

Data is also partitioned. It's actually partitioned in two different levels. You're partitioning on the storage level. Kafka stores data in a bunch of different partitions on a bunch of different machines. If I have some data set of orders, it's actually going to be spread across all of these machines. Likewise in the processing layer, we're also going to spread the data across multiple different machines in different partitions or shards. That allows us to process things in parallel. Big problem with doing this, though, or one of the problems, is that if you want to join things, they have to be partitioned in the same way. If you want to join orders to payments, you have to make sure they both share a key so that the order and the payment correspond, or the payment that corresponds to the same order ends up on the right machine. You do this by applying a key. If you don't have the right key, or your previous bit of processing didn't use the right key, you have to do a shuffle, redistribute data, which is relatively costly. Again, this is the same process that distributed databases use.

Another thing distributed databases use is broadcast [inaudible 00:30:57] to broadcast datasets to all nodes. And that's actually to avoid this problem of shuffling. In a partition data model, you have to re-shuffle if you want to join from a key that wasn't the key that was used in the previous operation. You can broadcast a join. There are different ways of doing this. In Kafka Streams and ksql, there's a thing called a global table. This creates a copy of the data set. Let's say I want to join orders to customers. Orders is a big facts data set, customers is relatively small by comparison, also typically would change not that frequently. We actually broadcast that all of the nodes. I can have partition data of my orders, and replicated data, the same copy on every single processing node so that I don't have to do the shuffling. I can do more joins efficiently.

Architecturally, there are actually many parallels with data warehousing. This idea of events and facts are quite similar. They're often pretty much the same thing. A record of every single thing that happened. Dimensions are these things that you want to look up. Customer information, account information, those kind of things, they are relatively small. Actually, a lot of the technology that backs it is the patterns [inaudible 00:32:22] process a query and distribute data to different nodes in different ways, are actually the same across both, really similar across both.

Interaction Model

Finally, we get to the idea of the interaction model. We talked about this before a little bit. A stream processor continuously processes input to output using event streams. That's the way that you interact with it. It's very different to the way that you interact with the database. The query runs continuously, and outputs results as soon as they happen. The traditional database, the query is active, the data is passive. In event streaming, the data is active, and the query is passive.

Databases have this notion of a pool query. This is what we call a pool query. What's Ben's credit score now? I ask a question, and I get the result back. In a stream processor, the process is running all the time and every single time that table changes internally, we're publishing out a result. We can create a hybrid of these two. Why do we need to just have one? This is where we think things are going.

In a hybrid stream processor, a couple of things are different. Firstly, the interaction model doesn't just handle queries, requests for responses, it also handles event streams. I can have a payment stream coming in, I can be, again, computing my table. I can summarize and materialize those credit scores. Because that's a table, there's actually nothing to stop where I could be querying it. I can send a select statement to that, get a response. At the same time, I can also listen to changes on that table as they happen. I can have both interaction models.

What I think we're going to end up with is a unified model. It really manages two big differences. There's a bit more to it than this. We talked about the different types of joints and stuff like that. Fundamentally, the two big differences are handling both the asynchronous, as well as the synchronous, and having an interaction model that is both active and passive.

What does that actually look like? Actually, it's pretty simple. Remember, we can actually think of a stream as being like a table with a different interaction model. Then we can run queries using SQL. Standard database query basically just runs from the earliest bit of data. It runs [inaudible 00:35:27] to now. Then it stops. It gives you the answer. That's the query model that we're all used to.

Standard stream processing query. That starts now. It doesn't consider previous data, unless I ask it to, or can do, it can go back in time and reprocess everything. We can store all our data in Kafka for as long as we want, but it doesn't have to. Typically, it will start now and just run forever. Just keeps running forever.

There's another interaction model, which is like this dashboard query. In this we're starting from the earliest record, we're querying through to now, but then actually going to keep going, keep going forever. I call it a dashboard query model, because it's a bit like loading data into a dashboard, and then keeping it up to date. This is actually slightly nuanced because when you do that initial query, what do you actually return? Do you return just a snapshot from now, which is the latest version for each key in the table? Or do you return all of the events, or some combination [inaudible 00:36:44]? Because the future data, that's going to have to come one event at a time, or at least when your windowing function is, if you're using suppression.

We have these different interaction models, but there's some sort of subtleties about how we actually might want to phrase that data. We might want a snapshot, we might want to have every single state change that happened in the past. We put it all together and you get this unified model, where you have earliest to now, earliest to forever, and now forever to represent different types of queries. We can represent these in SQL. Our push query would look like regular SQL, SELECT user, credit_score, orders, etc., WHERE ROWKEY = 'bob', but we have to have this EMIT CHANGES. We're basically going to execute the query and then keep giving you extra results. Or you can just do a standard pull query. In a standard pull query, we're literally just going to ask a question, and we're going to get an answer.

That's inside the interaction model. That's one part. The other part is this idea of embracing synchronicity. That's another thing that we think needs to change, or thinking increasingly is going to change. In an asynchronous world, if you think back to the sort of microservices side, the idea that we're going to connect these different things together, different parts of a system together. We often want to do that asynchronously for automating your business process. Different microservices want to have different life cycles, they want to be decoupled, etc. We actually want to be able to orchestrate across different processes. We might want to pull data out of a database, push it into event streams, perform some processing, push it into an application, maybe into a lambda function or something. This idea of pipelines is actually something that ideally we want to be able to embrace, along with this interaction model. Having the sort of joins/aggregations/time-handling that is required to manage this kind of asynchronous world.

Another thing is transactions. Transactions in the database are actually very different. Fairly different from transactions in the stream processor. In a database, it's actually you have two-phase commit protocol on when you're actually changing or reading data. In an asynchronous system, what you actually do is you pass barriers. The barriers flow through the various different operations inside the system. Those barriers are actually committed, there is also a two-phase commit protocol. It's based on these barriers that flow through this system. This is how you can basically manage transactions, manage updates to various different places in a sort of atomic way, using an asynchronous processing system without having to do this two-phase commit that we used to have back in the days of [inaudible 00:39:44], and that kind of thing. We have a synchronous pipelines.

There's a couple of other important variants I wanted to mention. Stream processors today, most of them are actually frameworks. Flink has a SQL API, Kafka Streams has KSQL. There are a few others that do SQL-based processing. We definitely think that's the right way to talk about data, you typically want to talk about data using a declarative language. That's one thing that the ratio database will definitely got right. Many of them are actually programming frameworks based on JVM. Most of these are based on the JVM. That actually gives you a lot of power. You can write anything that you want, you can interact with these tables directly, you can build a different type of application.

The other side of this is active databases have also improved through things like MongoDB, Couchbase, RethinkDB. These let you listen to change streams on tables, which is, again, quite similar. They don't have the sort of temporal functions or the handling of a synchronicity and creating materialized views that you get inside a stream processor. It's definitely like a path of consolidation going on here. I think we're going to see a lot more of it.

As Software Eats the World

At the start we talked about this idea that software is eating the world, and that the user of software today, as we kind of automate our business processes, is less likely over time to be a user, is less likely to just be you and me. It's more likely actually to be some other piece of software which we're sewing together in some automated architecture. To do this, we think you need both. It's the asynchronous and the synchronous. You need to be able to handle both of these things. You also need these different interaction models, you need to be able to query passive datasets and get answers for users that are clicking buttons and expecting things to happen. You also want this proactive interaction model where data has been pushed as an event stream to different subscribing services. We still need all of these. Still need all of these different types of database, but we're likely to see this shift happening where we start to get different niches are forming around these active databases, and ksqlDB being one example of this. You're actually seeing other movements in the database community going in the same direction, particularly around active databases.

My final thought for you is, if any of you did close your eyes, even if you did it just figuratively, and you thought about what a database means to you, you thought about how you interact with the database. Maybe those things change. So does that notion, so does that thing that we hold so dearly, that idea of what a database is. Maybe it needs to be something slightly different to what it is today.

Questions & Answers

Moderator: When do you see this standardization happening and the emergence of this new kind of database? Do you see that's something happening quite soon? Or what is your kind of timeframe on that?

Stopford: There is actually a lot of activity in this space. You're seeing a lot of different database vendors embracing streams. Maybe we see this more closely than others, because we watch it. Mongo has event streams. Couchbase has these things. I think there was a talk earlier about databases that can interact with the front end. That's another burgeoning field. A lot of front end development is event based. Yet most databases don't really handle the event-based side of things. We're definitely seeing some change in that area. From a personal perspective of Confluent, we have ksqlDB. We are investing heavily in that. We think that that's actually the way that stream processing is going to go because it gives you this kind of unified model. Actually having the familiarity of a database with a lot of the processing capabilities of a stream processor was a really powerful combination. We definitely think it's going in that direction.

Participant: How about moving into more of a data lake architecture directly rather than just a database?

Stopford: I think a data lake is, in some ways, another type of database. Everything you say here you can think of from an operational database perspective or from an analytical perspective. I think probably the big difference with stream processing, these two things are not the same. A stream processor and actually event stream processing is mostly about sharing data between different places. That could be different microservices, it could be collecting data and pushing it into a data lake. Certainly the way that we've been going with this is to provide tooling that lets you kind of automate those processes. You can suck data out of one place, materialize it somewhere else. You can do that using a single SQL layer effectively.

At the same time, the actual query model, I think it's unlikely that we would use that. We're not going to be a replacement for big analytics databases. Comes back to the sweet spots. There's no way that I think you'd see this kind of technology necessarily replacing big analytics systems in the short term. Certainly around communicating different information around a company or information around a microservice architecture, and most importantly, probably just creating materialized views. A lot of the sort of serving layers you want to create. You can do that with this kind of technology very well.

See more presentations with transcripts

Recorded at:

Jun 09, 2020

Ben Stopford

InfoQ Software Architects' Newsletter