BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Streaming Databases: Embracing the Convergence of Stream Processing and Databases

Streaming Databases: Embracing the Convergence of Stream Processing and Databases

Bookmarks
46:05

Summary

Yingjun Wu discusses the evolution of streaming databases, and the features and design principles that set streaming databases apart from conventional database systems and stream processing engines.

Bio

Yingjun Wu is the founder of RisingWave Labs, a database company developing RisingWave, a distributed SQL database for stream processing. Before running the company, Yingjun was a software engineer at the Redshift team, Amazon Web Services, and a researcher at the Database group, IBM Almaden Research Center.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Wu: I'll be talking about the streaming database, embracing the convergence of stream processing and the database system. How many of you have already heard of stream processing or have experienced using stream processing systems? I was in the Confluent Current event, and Current promote a lot the commercialized Apache Kafka. We actually did a survey asking people, how many people have already used any stream processing systems like Flink, ksqlDB, or whatever? Actually, not many compared to people who have already used Kafka. Let's talk about streaming databases. My name is Yingjun Wu. I'm the founder of RisingWave Labs, a company working on the streaming database system. Prior to starting this company, I was at AWS Redshift, the data warehouse team. I also spent two years in IBM Research Almaden, doing research in event stream processing. I obtained my PhD in stream processing and database systems. Actually, I feel that's where I'm quite lucky to have the chance to explore more in these two different domains.

What is RisingWave?

What we are building is called RisingWave. It's essentially a distributed SQL streaming database. It was open sourced under Apache 2.0 license in 2022. Over the last year, we have gathered 5000 GitHub stars and 130 GitHub contributors. We have a large Slack group. The system has already been deployed over 100k times worldwide.

Stream Processing

Streaming database, designed for stream processing. That's for sure. Let's talk about stream processing first. Every time we mention stream processing, people may think about, there are the real-time applications like ads recommendation, network monitoring, the delivery apps, stock dashboarding. All these apps are real-time apps, and they actually can generate large amount of streaming data. People usually use operational databases or ODB databases like MySQL, Postgres, CockroachDB, Yugabyte, TiDB, these kinds of databases. All the streaming platforms like Kinesis, Pulsar, Kafka, Redpanda to consume the streaming data. You can just use MySQL or Postgres to support these applications, or probably you can just use your Kafka or Kinesis to store your data, to store your original data. The problem here is that these platforms are pretty good at storing data. Basically, you cannot do analytics over it. You can hardly use MySQL or Postgres to do real-time analytics. How do we do that? That's why we need to have the stream processing system. A stream processing system continuously ingests the data from the upstream systems like the operational databases or the messaging systems, to do computations and deliver real-time insights to the users. If you have experience working on stream processing, you may have already heard of these systems like Apache Storm, Apache Samza, and Apache Flink. We'll be talking about the streaming database. Streaming database can be considered as the new breed of stream processing system. They recently emerged and have already attracted a lot of attention. Representative stream databases include RisingWave, Materialize, ksqlDB, Timeplus, DeltaStream, and many others. The problem here is that we have already seen that the stream processing systems like Flink, like Samza, like Storm, and Apache Spark streaming might have already been deployed in so many companies, then why do we need a streaming database? Why do we need to build a streaming database?

What is a Streaming Database?

To answer this question, let's first discuss, what is a streaming database? A streaming database is just like a database system. You can actually just use your streaming database in the same way as you use MySQL or Postgres. A different thing here is that MySQL and Postgres are optimized for transaction processing. You can use MySQL and Postgres for your operational workflows, but you cannot use it for stream processing. MySQL and Postgres cannot help you to generate your real-time insights. Streaming databases like RisingWave, ksqlDB, and many others are tailored for stream processing. Inside our streaming database, a key concept is called materialized view. Basically, you can use materialized view to express your stream processing logic. That is, once you define the materialized view using the CREATE MATERIALIZED VIEW AS statement, the database system will start continuously ingesting data from upstream systems like MySQL, Postgres, or Kafka, Kinesis, to do computations, and fill in the materialized views with consistent and fresh results. You can easily connect to streaming databases with your favorite BI tools and visualize them inside like Superset. You can visualize the materialized views result in Superset, Grafana, Tableau, or Metabase.

What's the difference between a stream processing engine and a streaming database? If you are not familiar with any of these two systems, or you have no idea about how they are different, then you can just think about their counterparts in a batch regime. In a batch regime we have the batch processing engine and OLAP databases. Spark, Samza, and Flink can be considered as a streaming version of Spark, while RisingWave, Materialize, ksqlDB, Timeplus, and many other streaming databases can be considered as a streaming version of the OLAP databases. Similar to Spark, Samza, Storm, Flink are computation engines that are primarily designed for computing streaming data, and they do not really have their own storage. You do not really store data inside of Spark. Spark will only do computation. Similarly, Flink will not store your data, it will only do computation. On the other hand, running with ksqlDB and streaming databases are the comprehensive database systems that have both the computation and storage functionality.

History of Stream Processing Systems

Coming back to this question, why streaming database? Is it only because we need to store data in a database? I think to answer this question, we can just review the history of the stream processing systems. The field of the stream processing systems have already been investigated for actually two decades. In the early 2000, researchers from academia have reviewed the research prototypes like STREAM, NiagaraCQ, Aurora, and Borealis to use stream processing technology for the real-time monitoring applications. Actually, it didn't really take too long for the stream processing technology to prove its business value. That's why Db2, and all these top three big vendors, database vendors like Db2, SQL Server, and Oracle introduced their streaming product. In Db2, I remember it was called Db2 Event System [inaudible 00:09:11], and SQL Server has the SQL Server StreamInsight, while Oracle has the Oracle CQL, Continuous Query Language. After these three big vendors introduced their streaming product, all these tech companies started building their own version of stream processing technology. That's why we saw the emergence of these popular stream processing systems like Samza, SQL, Spark streaming, Flink, and many others. More recently, probably because of the cloud, we saw a new trend, a new group of people have already started building a new stream processing system. That's the history.

Revisiting MapReduce

From the history, we can actually see that essentially, we have the streaming database first. Then, probably because of MapReduce, people started building the stream processing engines. More recently, people come back and say that, ok, whether we can build a streaming database again. Instead of discussing why streaming database, we should actually first discuss why streaming processing engines? The answer is actually quite simple, it is just because of MapReduce. The MapReduce paper was published in 2004. At that time, the concept of stream processing and streaming databases just emerged. People just started looking at, how we can do incremental computations over continuous data streams. That's what people were investigating. Then, what is MapReduce? Why is MapReduce relevant to stream processing? MapReduce was built to solve the problem of scaling a computation seen in the commodity machines. Basically, Google says, instead of purchasing the database system from those three big vendors, let's build something new. Let's just scale the computation in our own machines, in our commodity machines, don't buy the Db2 mainframe. No need to buy that, when we can just build our own. How do we do that? MapReduce said, what we can do is that we can directly expose the low-level API to our users, and give up control over the storage. Instead of building a fully-fledged database system, we can just store the data in a distributed file system, and then only care about the computation. How do people do computation? People can just write map function and the reduce function, and start pairing them on their own, and determine how we want to shuffle the data and how we want to scatter data. That's the MapReduce idea. It turns out, that's where the idea was quite successful. After the MapReduce paper was published, everyone knows that, nowadays we have Hadoop. After Hadoop was introduced, the batching guy said, we can probably build a better Hadoop. We can do the computations in memory. That's why the batch guys developed Spark. While the streaming guys said, we can probably leverage a similar idea and apply it in stream processing. That's why we have Storm, Samza, and Flink. If you look at these two types of systems, you will find that all these systems have very similar abstractions. They actually ask you to determine how you want to express the logic in Java, in Scala, or in some other languages.

This is some classical cesspool. All these big techs have already deployed big data systems. Twitter originally used a streaming processing system called Heron, their own system. LinkedIn has Samza. Uber adopted Flink. It turns out that's ok. The stream processing system was quite successful. Then comes the question, why we need a streaming database. I think the answer is also quite simple. The only answer we have is that, it's all because of the cost. What is cost? Is it because of money? The resource costs, or what cost it is. There are all different kinds of costs. If we just review the MapReduce architecture, we will find that, it really allows you to scale your computation in your commodity machines. Actually, it brings several tradeoffs: the learning curve, the performance efficiency, the development difficulty, and the data stack complexity.

Limitations of Stream Processing Engines

Let's talk about these tradeoffs one by one. Let's first discuss the learning curve. If you use MySQL, Postgres, what do you write? You write SQL. You don't need to learn something about how to write in Java, how to write in SQL, how to write in Scala. All these big data style systems, you have to learn my Java API, or probably you have to learn my Scala API. More than that, essentially, you have to learn what fault-tolerant model these kinds of systems have, and how these systems can scale up or scale down, or how these systems do checkpointing. You have to learn all these kinds of details. The reason here that the systems will try to explore if the users can choose, and ask a user to determine how you want to measure the system. If you are an expert, it's definitely great for you because you have more control of the system. If you're not familiar with stream processing systems, then you typically think that it's very stupid you've got to learn this system. That's why I actually put a question mark on the title. It's the limitation of the stream processing engine, but whether it's a limitation or it's a feature, it all depends on how you think about the system.

The second tradeoff is performance efficiency. In the stream processing engines, all these engines trying to adopt a very brute-force approach to optimize the performance, what remains? Let's think about the database system, how a database system optimizes performance. A database system has the optimizer. All these databases, no matter whether it's MySQL or Postgres, they all have optimizers. Optimizers query the data, and then check, how can I optimize the query? Then generate the execution plan. That's how a database system works. In a stream processing engine, things can become different. Though, many of these stream processing engines nowadays provides for a SQL wrapper, you can also write in SQL, but none of them have control over data. Which means that these systems do not know what that data looks like, and it cannot extract any insights from the data. It cannot build a histogram, or it cannot collect stats from that data. Which means that the optimizer can never generate the optimal query plan. If you have experienced working on optimizers, you will know that query optimizers typically need to do something like cardinality estimation, when they consume the histograms and plan the query. If we do not really have the control over the data, if we do not know anything about our data, then obviously we cannot generate the optimal query then. The similar issue here is that, also because streaming processing systems do not really have the control over data, they do not know how to allocate a resource. They say that now we generate three streaming jobs, job1, job2, and job3, and now we have 6 CPUs, how do we allocate resources? We don't know, because these streaming jobs, all the systems do not know how complex the query can be, and how I can set parallelism. They don't know. All these systems require the users to provide an input or provide insight, and the users are responsible to set the number of CPUs or number of parallelisms and tell them how we can shuffle the data. That's potentially a limitation of the stream processing engines. The good thing here if you're an expert in stream processing engines, you can actually leverage more about these kinds of details, you can craft a better streaming program and better leveraged results.

The third tradeoff is the development difficulty. Back to this chart. Now if we have three streaming jobs, in the stream processing engine, the challenge here is that the streaming jobs, so every single streaming job is independent, they cannot talk to each other. Which means that you cannot basically build a job on top of another job, you cannot do that. The fact here that whenever we write a streaming pipeline or a streaming analytics application, you will know that the logic can be quite complex. If we use a stream processing engine, then probably we have to express our logic in big SQL code, probably 200 lines of code or 500 lines of code. Which means that if you have such a big code chunk, then it means that it's really difficult for us to verify the correctness and to do any debugging. It's super difficult to do that.

The fourth tradeoff is the data stack complexity. As we just mentioned, in a stream processing engine, a stream processing engine does not really provide the storage capability, then what does it mean? It means that you probably need to bring your own storage. Think about it, for stream processing, it will do the computation. At the end of the day, you will have to have some system to store the result. Otherwise, you will never know how the result looks like. Let's say that, we want to do a monitoring or alerting or automation application, we use such kind of application and we use a stream processing engine to consume data from their upstream systems. Then we want to visualize the results in a BI dashboard. How can we do that? We cannot just connect our BI dashboard to Flink, Samza, or Storm, we have to use some database system. That's why people probably have to connect their Flink, Samza, Storm to Postgres, Cassandra, and Redis, so that Redis, Postgres, and Cassandra can talk to your BI dashboard, or the client libraries. That's how the data stack looks like if you use a stream processing engine.

Streaming database is trying to tackle this problem, try to tackle these challenges. It actually tries to get the best of both worlds, to get the best of both the stream processing engine world and the database world. In terms of a learning curve, assuming database is just like a Postgres or like a MySQL database, you do not need to learn something about, how I can write Java or a Scala program. You only need to learn SQL. In a database system, all the modern database systems like Postgres and MySQL provides a UDF. You can express your complex logic in Java UDF, or Python UDF. That's how the streaming database solved the learning curve problem. Second challenge is the performance efficiency. A stream processing engine adopts the brute-force approach to optimize the performance. Assuming the database have finer-grained control over that, how does it do that? Because it's a database system, and it knows the data, and it has its own optimization to determine how I can optimize the code. The third aspect is the development difficulty. As we just mentioned, in a stream processing engine it's quite difficult to verify the correctness of the program, because every single program is independent. In a database system, we can have the concept called tables, materialized views, or views. We can essentially view the materialized view on top of another materialized view. You can basically decompose complicated code into several smaller components, so that it makes that correctness verification and the system debugging much easier. The fourth aspect is the data stack complexity. In a stream processing engine, you have to bring your own storage. Streaming database has its own storage, so you don't need to care about, how I can store it, or where I can store the data. Instead of using a separate streaming processing engine to do the computation and the database system to do the data storage and the query serving, we can just use one single system to do that.

Stream Processing Engine + Database

People will wonder, why not just glue together our stream processing with a database system? Can we just put these two systems together and call it a streaming database? That's a very good question. Actually, concept for free, yes, you can do that. You can probably put it on Flink, or build a wrapper on top of Flink or probably on top of your Postgres and wrap them together, and build a streaming database. The challenge here is that, technically speaking, it's very challenging. Why? First let's talk about system maintenance. If you have experience maintaining a database system, you will know how challenging it could be. You have to deal with, how I want to set up my Postgres parallelism. For Flink, or for Samza, or for Spark streaming, you also need to do the same thing. I need to think about how I want to manage my ZooKeeper. Maintaining one single system can be very challenging, and if you want to maintain two different systems together and integrate them, and make sure that they can work correctly, it will be even much more challenging. Definitely, if you don't really care about the health of your employees, then that's fine. In many cases, it's very challenging to maintain two different systems.

The second thing here is the consistency across the different systems. In stream processing, we know that it has the progress to guarantee the so-called exactly-once semantics. Which means that for every single event, we will process and only process once. The database system always tries to display a consistent result. Now, if we want to maintain two systems the challenge becomes how we can guarantee the consistency across different systems. Why is it difficult? Let's think about, what if your stream processing engine crashes? Should we pause our database system, or should we just let our database system continuously ingest the data and display the result? It will be a very challenging problem to guarantee how the result can be consistent if we want to maintain two different systems. Third challenge is the efficiency. Let's think about, if we want to use a stream processing engine and to use a database system to mimic a key concept, a materialized view concept in a streaming database, how do we do that? If we just have one single materialized view, I think it could be easy because in a materialized view, we can create a Flink job to do the stream processing, and then dump the result into Postgres and ask Postgres, and so that way people can use their BI tools to connect to Postgres. That's simple. The beauty of the database system is that, the data has connections, data has relations. People can actually build a materialized view on top of another materialized view. We can just build an MV2, materialized view 2 on top of materialized view 1. Then how do we do that if we use a streaming processing engine and a database system? Unfortunately, in this case, we need to deliver the result from our upstream materialized view to the downstream materialized view. We probably need to use Kafka to do that, to buffer data. Then, we need to use another Flink job to consume the data from Kafka, and then display the result in Postgres. You can imagine that now we just maintain two materialized views. What if we maintain 10 different materialized views, or 100 materialized views. The complexity can be very challenging to tackle.

Developing Streaming Databases

Let's talk about the idea of the streaming database. The idea of the streaming database is that stream processing essentially can be expressed in materialized view. The materialized view is just a table. In stream processing, the key challenge here is how we can manage the internal state. Essentially, we can treat the state as a storage, that's a key idea. Think about, if we use Storm, or Samza, or Flink to do the stream processing, instead of maintaining your local state in systems like RocksDB, we can actually maintain the state in a table. We don't need to have local storage, we can just use a table. Then if we use such kind of model, then a lot of things can be changed. The first thing is that all the internal states are queryable. This is really important, because people can directly query the state, and connect their BI tools, like Grafana, like Metabase to the internal state, and display the result there. The data can be super fresh. Moreover, you can also build a materialized view on top of another materialized view. Things can be much easier to deal with. The second thing that can be changed is that all the internal states are persistent. Let's say that, if you use a stream processing engine, and now you use RocksDB to maintain the state, then the challenge here that you need to determine how you want to recover RocksDB state, if the system crashes. In most of the stream processing systems, they need to do recomputation. In our database system, database never do recomputation, or seldom do recomputation. The database model is that we can always persist the data probably in a persistent storage in the cloud, probably in S3. If you have a local machine, then it's a local disk. Which means that all the internal states can be persistent, so that way if the system crashes, you don't need to worry about, how I can recompute the result. You can just reload the internal states from the persistent storage, so that you can recover your program instantly, instead of waiting for several minutes or several hours. The third thing here that created the internal state, are elastic. Obviously, if you have a distributed stream processing engine, then you need to determine how I can pause my system and reshuffle the state, to make sure that the stream processing system is scalable. In a streaming database, if we manage the state in a database system, then database can scale on their own. There are all kinds of technologies to guarantee how we can scale our database system. Scaling the stream processing system can be much easier if you use a streaming database.

How to Compare: OLAP Databases and Streaming Databases?

We have already talked a lot about streaming processing and database systems. People may wonder, what's the difference between the OLAP databases and the streaming databases? What I just mentioned, streaming database can do real-time analytics over streaming data. People will say, real-time OLAP databases can also do that. Yes and no. Yes, both of those systems can do real-time analytics, but actually they are optimized for different workloads. If you use OLAP databases like ClickHouse, Druid, and Pinot, they actually are better optimized for the interactive analytics. For example, if you want to calculate the sales volumes of bottled water over the past one day, then probably you don't really want to use a streaming database. You probably would really want to use an OLAP database system, because this system is really optimized, OLAP database systems is really good at answering interactive queries. Let's say that if you want to do something like monitoring, alerting, such as monitoring accounts that makes over 100 transactions within the last 10 minutes, then probably OLAP databases are not a very good fit. Probably you really want to use a streaming database system. Because streaming databases are very good at doing predefined queries using incremental computation model. People may ask whether we can build a system that achieves the best of both worlds, to achieve the best of both the OLAP world and the streaming database world. Technically, yes, but there are still challenges, because you probably need to think about how much money you want to pay.

The Tradeoff Triangle: Freshness, Cost, and Query Latency

There's a paper called Napa that was published several years ago by Google. They proposed the so-called tradeoff triangle. That is, no database system can achieve all these three things: the freshness, the cost, and the query latency. Any database system can only achieve two of them, but not three of them. What does freshness mean? Freshness means that the result freshness. Think about, if your data comes in within less than one second, when I can see the result computed from this event. Then query latency means how fast I can answer these queries. Cost means results cost. Basically, it's an immutable physical law that no systems can achieve the best of all these three worlds: you can only pick two, but not three. Streaming databases are actually optimized for freshness, and potentially cost, but they are not optimized for query latency. This is because all these databases adopt an event-driven architecture, and they are optimized for predefined queries, and use the incremental computation model to do the computation. For the OLAP databases, like Druid, ClickHouse, and Pinot, they are highly optimized for query latency. Let's say that you want to have an interactive dashboard, then probably these three systems are very good for you, because they are user driven, they can answer ad hoc queries pretty fast. They use the full computation model to do the full table scans and then generate the result. If you don't really care about the data freshness, and if you don't really care about the query latency, then probably you can just use a data warehouse, that's like Snowflake and Redshift. Snowflake is quite expensive. It's probably not optimized for your cost. That's a tradeoff among these three types of systems.

Because of the tradeoff, the designs of the streaming database, and the OLAP databases are also quite different. Streaming databases, by default, use a row store. If you are familiar with the internal design of the database systems, then, essentially, row store is very good for append-only ingestions. They are actually optimized for point access. Because in a streaming database, typically, materialized views maintain the result. You probably don't really need to do further analytics over results. The streaming databases will be optimized for the point access. On the other hand, the OLAP databases typically adopt the column store, because these systems are actually optimized for long-range scans, because we need to answer the user queries pretty fast, so we probably need to adopt the column store. In terms of computation, in addition to the column store, OLAP databases also use the vectorized execution engine. It can leverage SIMD instructions to further optimize the query. For the streaming database, they are optimized for the predefined queries, so they don't need to have the column store and they don't need to have a vectorized execution engine. I know that's where some streaming databases actually implement the vectorized execution engine. These streaming databases are highly optimized for the materialized views using the incremental computation model instead of full computation model.

Other Technical Challenges

There are several other challenges that I would like to probably mention about when we want to build a system that can achieve the best of both stream processing world and the OLAP world. One thing, how we can isolate a resource. Let's say that now we want to build a system that can serve both the predefined queries and the ad hoc queries. Then the problem here, let's say that we just have five machines, how much resource should we provision for the predefined queries, and how much resource should we provision for the ad hoc queries? It's really hard to determine. You probably need to determine based on your workload. Then people might say that, we probably can use two isolated machines for the predefined queries and the other stream, isolate the machines for the ad hoc queries. Then the other problem will come, that is, these two sets of clusters are isolated, then how we can guarantee the data consistency across these two clusters. Because you know that if we want to guarantee the consistency, then typically, we need to think about whether we want to use some committing protocols like two-phase commit, which can incur very high latency. We're seeing here that the OLAP databases and streaming databases share a different fault tolerance model. In an OLAP database system, if a node crashes, then what we do? Then we can just reboot the machine and redo the computation. In a stream processing system, or in a streaming database, if the system crashes, if one of the node crashes, then we cannot just ask users to recompute it from scratch, because stream processing is continuous stream processing, is continuous data processing. If we want to compute it from scratch, then probably it will take several days or several hours to compute from the very beginning, and the system will not be available for a very long time. It's not quite tolerable. Essentially, OLAP databases and streaming databases share different fault tolerance models, which makes their design quite different.

Use Cases

Streaming databases or OLAP databases are actually pretty good friends. There are three typical patterns we have already seen. The first pattern is that people can use the streaming database to do the ETL, and then the result can be delivered or be consumed by the OLAP databases. People can use the BI tools to connect to the OLAP databases. We do this because you know that many of the OLAP databases are not very good at joins. You can essentially probably generate two streaming joins and generate flat tables and use OLAP databases to do full table scans on the flat table. The second pattern we have already seen is that people can use the streaming database to deal with the predefined queries. Then use OLAP databases to do ad hoc queries. Typically, people will use the BI tools to connect both the streaming databases and OLAP databases for different purposes. If they want to do monitoring, then they can use the BI tools to connect two streaming databases, to always see the fresh result. If we want to do interactive analytics, then it's a different story. The third pattern is like this. The only difference between pattern 2 and pattern 3 is that, in pattern 3, all the data will be dumped into both the streaming database and the OLAP database. I think it is quite similar to the Lambda architecture. Basically, people were saying that, I probably want to do both predefined analytics and ad hoc analytics over all the data. That's why people do this.

Summary

Streaming database is definitely quite different from streaming processing engines for whatever reasons. Streaming databases are better optimized for intuitive query processing, so that way you can use the streaming database easily without dealing with all kinds of decisions or details. If you use the stream processing engine, then probably you need to figure out how to configure all the systems. Streaming database is not just equal to a stream processing engine plus the database system. The streaming database, they are quite relevant to OLAP databases, but they are also quite different. People use these systems in parallel, actually. Choose your streaming processing system wisely. It all depends on what kind of applications you want to support and how confident you are in using a complex solution or using an intuitive solution.

 

See more presentations with transcripts

 

Recorded at:

May 17, 2024

BT