InfoQ Homepage Presentations From Batch to Streams: Building Value from Data In-Motion

From Batch to Streams: Building Value from Data In-Motion

Bookmarks

View Presentation

Speed:

Download

38:02

Summary

Ricardo Ferreira discusses the risks of designing siloed-based systems and how streaming data architectures can become a solution to address competitiveness.

Bio

Ricardo Ferreira is Principal Developer Advocate at Elastic—the company behind the Elastic Stack (Elasticsearch, Kibana, Beats, and Logstash), where he does community advocacy for North America. With +20 years of experience, he may have learned a thing or two about Distributed Systems, Databases, Streaming Data, and Big Data.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Ferreira: If you're interested in how to transform batch oriented systems into streaming systems, so you can start harnessing the power of whatever is happening right now, and by doing so, start discovering new business opportunities that your company can leverage, and you also happen to be working in the financial services industry, you came to the right place. We are going to discuss exactly this, the journey that you and your teams can undergo to start harnessing the power of streaming based systems. If you happen to be working with batch oriented systems, what is the journey that you and your team need to undergo before you actually start collecting the benefits of this architecture side?

The World Is Not the Same As It Was

The most common reason why people these days are looking forward to implement streaming based systems come from the realization that the world has changed. If you think about it, most companies these days are becoming software, which means that all their internal and external business processes and policies and systems need somehow to communicate to each other in order to provide a better or a superior customer experience that could differentiate their businesses to other companies. That requires them to rethink in how they're treating data. Traditionally, most companies have been using data in two main phases. The first phase is the phase known as the capture and storage of data, where they start collecting huge amounts of data, and then they're just stuck into some datastore. The second phase is where they start actually analyzing that data to see if there is something relevant to be done with them. Therefore, try to create a superior business opportunity or identify a business threat. The problem with this approach is that by the time the data is analyzed, somehow the data might have lost its value, because it's past tense. It's already happened. The only way for businesses to actually leverage the power of what's happening right now is to transform their processes and systems in a more reactive way. Meaning that they need to start bringing the processing of the data as close as possible to when the data is actually captured, and when the data is stored. All the three phases, capture, storage, and processing almost have to occur concurrently. They can start treating the data as something that is going to fuel a better way to provide superior experiences or different business opportunities.

There is this very interesting report that was created by Forrester called perishable insights. This report talks exactly about this, the analysis of the value of the data and how that data perishes over time. Traditionally, we have been designing and thinking systems and processes that are going to make use of that data in a period of days, months, and perhaps even years. Whereas systems that are more time sensitive, those systems and processes or situations that requires the company to react as soon as the data happens, these need to be designed in such a way that they respond to those situations in a matter of seconds, minutes, or sometimes even fractions of seconds. This is the rise of streaming based systems.

Background

My name is Ricardo Ferreira. I'm a developer advocate with Elastic, where I'm part of the developer relations team, which is also known as the community. Before working for Elastic, I used to work for other software vendors such as Confluent, Oracle, and Red Hat. I also spent a significant part of my career working for consulting firms, implementing and designing systems that are somehow related to distributed systems in general, or are based on big data and streaming data technologies, as well as databases in general.

The World Today, As-Is

The best way for us to understand the benefits of streaming based systems is to analyze how companies build systems today. Most of them, you're going to see this very young concept of siloes that is rooted in how we design systems for the last decades. In the siloed based architecture, each system will be treated as their own project. Most importantly, each system will possess their own datastore, regardless of what datastore this is. The problem is that datastores won't talk to each other, and data won't be made available right away. That means that an inherit with batch, will take place in order for data from one system to replicate to another. That is directly what causes the problem of perishable data, because we are now creating some delay in when the data will become available for analysis, or transforming into something useful.

Also, today's techniques of transferring data from one system to another, or replicating this data is very hard. Most people spend precious time, like creating jobs and scripts and programs and using ETL tools or any other replication tool that is available in the market, to do exactly this. To read data from one place and copy to another. By the time they actually have to keep managing these systems that creates a more fragile architecture that doesn't scale very well. This is another problem that is prevenient from the siloed-based architecture.

By contrast, if we think about how we could leverage the power of what's happening right now is to shift to another way to actually handle the data. Instead of actually starting to analyze later, we have to start actually analyzing data as they happen or come true by using a mechanism that we're going to discuss later. The secret of adopting a streaming data platform is effectively making the data and the analysis to happen concurrently. A better way to understand this is, in the past, in the traditional way for designing systems, we would store the data and the queries would be applied to that data to perform the analysis. Now, in the streaming data world, we actually query the analysis. The data is what keeps flowing in and coming in all the time. The data is applied to those persisted queries that are going to leverage the power of whatever is passing through the data structure that will contain these streams of data.

Streaming Data = Better Experience

A good example to how streaming data architecture will transform this into a business opportunity is by looking into the possibility of correlating different datasets or different streams of data. Create a situation where a company could actually benefit from discovering a business opportunity or perhaps a threat that if they start treating their systems as silos as they were doing so far, they wouldn't be able to analyze this and evaluate this as soon as possible. Streaming data architectures are very good to provide tools for companies to start reacting to opportunities as they present themselves.

Streaming Data Journey

I've mentioned before that there is a data structure that you need to leverage, so you can implement streaming data architectures. This data structure is called a commit log. A commit log is something that looks like a log with a queue, but different from a queue that is transient in nature. A commit log is supposed to be persistent, meaning that you can store data in the commit log, and you can analyze it multiple times. Also, you can go back in time in the commit log and read and process data that has been stored in seconds, minutes, hours, days, or months ago. A commit log is not necessarily something new. It exists for a very long time. Developers of database technologies choose to not expose the commit log directly for the users. Instead, they decided to encapsulate into something that is more easy to use, and provides more easy programming APIs.

If you take, for example, a relational database. A relational database table is something that is essentially a subset of what is contained in the commit log. It is a subset of what is actually stored in the commit logs. The commit log is the repository where all the mutations and events, that happens with a specific dataset is going to be stored. Periodically, what relational databases do is to grab chunks of that set of mutations and to materialize into something called a table. It is the case with a relational database, for example. Even if you are dealing with different database technologies, take Neo4j which is a graph database, for example, the same principle will apply. Chunks of the mutations are going to be captured, and they're going to apply a set of data structures that is going to be optimized for graph notation, Elasticsearch, for example. A set of algorithms and data structures will be applied to those mutations to make the data available for searching purposes. Any database chooses to encapsulate the commit log somehow.

The reality is that if it's tables or indexes, or whatever objects developers use to query data or store data, they are actually manipulating something that is hidden underneath the database called a commit log. The commit log is actually where the data is stored. The commit log is the perfect data structure for you to start writing your streams of events. Most importantly, it is how you are going to share those streams of events from one application to another. That requires a fundamental change in how you are going to actually search those events, and store that data into the stream store. The commit log is essentially a stream store. We're going to be referring to the commit log as a stream store.

Lessons Learned - Transform Transactional Data into Streams

Now that you understood what a stream store is, let's discuss some of the lessons that you can take home to better implement your streaming based systems. The first one is definitely trying to transform transactional data into a stream of events as much as possible. In your organization and the company you work for, you're probably going to have a lot of systems producing transactional data already. That is data that produces an enormous value in the context of streaming based systems because this is the data that you can later on use to correlate with other streams of events and identifying situations that can be useful for your organization, as a business opportunity that you can leverage and take advantage of.

There are technologies in the market that can very easily access your transactional systems and capture the data as-is, and bring in as a stream of events, such as the Debezium project from Red Hat. This is a technology that uses a technique called CDC, or Change Data Capture, and allows database systems to be read, not in a table level, but in a log level. Remember, any database has a commit log underneath. The Debezium project goes into the log level of the database and reads all the mutations from the log and bring this stream of events to a stream store, which is Apache Kafka or Apache Pulsar. There are some other implementations that I know of that are being included on Debezium, but primarily it's Kafka and Pulsar. You can, from that point on, replicate to other systems and start your implementation of streaming data architectures. Definitely use projects such as Debezium to transform transactional data into stream of events very effortlessly, without necessarily impacting your source databases.

Leverage Different Constructs

The second one is the usage of different programming constructs that are available in most stream store technologies. This is important because most of the time, what we see out there, is developers that are afraid to go deeper into the constructs that are available to use. They restrict themselves to scenarios that they are familiar with, for example, aggregations, which is pretty common if you're coming from the database world. The problem with this approach is that you don't leverage the real power of streaming data architectures using constructs, for example, window analysis. In window analysis, you can create temporal windows, for example, the last two hours or the last five minutes. Then you can filter only the stream of events no matter the source they're coming from that are included on that window. You can identify situations, for example, that has to do with time. The other example is to mix with aggregations. Aggregations are just like you would do in any database and queries, you can aggregate and summarize data. You can create metrics that can be used downstream, as some sort of, certainly, a business process or being just used for monitoring and observability purposes.

The other programming construct that is also available and probably the most important one is the correlation engine that analyzes the causality or the relations between the events. If they are somehow correlated, you can identify some patterns and causalities such as the incidence of the occurrence of one event followed by another. That represents a situation that you are trying to capture and leverage. Trying to mix and match all these programming constructs are going to be useful for you to leverage the power of streaming based architectures.

Geo Replication and Multi-Tenancy

The same goes to trying to implement a technology. This has more to do with the choice of stream store that you're going to use. Try to use a stream store that supports geo replication and multi-tenancy. This is important because, in the early implementations that you are going to do in your company, probably your first implementation aren't going to be successful because this is a proving architecture, and other projects and people within your company are going to be interested in leveraging your own stream store implementation that probably has not been sized enough to accommodate multiple projects. The ability to break down different projects into tenants is a must. Try to focus on a technology that can do this as much as it can. The same goes for geo replication. Very often we see out there situations where you have clusters that are spread in different locations or geo locations, and because streaming based technologies allows you to correlate data, there will be this very inherent need to link those clusters so you can start correlating data coming from different locations. This is a very natural need, but the stream store technology needs to support this.

Build Versus Buy Analysis

Lastly, there is always the discussion of build versus buy. Build versus buy is going to be basically the decision that you have to take about, am I going to build and maintain my own stream store in-house, or I'm going to outsource and buy from some vendor that produces this? There are three main criteria that you need to take care of. The first one is you have to ask yourself, do I have a team that is capable and confident enough to maintain an infrastructure from stream store in-house? This has to do with a lot of things. It has to do with building a competent team. A team that actually has the time to maintain this. More importantly, a team that is dedicated to leveraging and continuously updating the infrastructure. If you don't have this, and if you are from the financial services industry, you're probably not going to be highly interested in keeping teams like that, because ultimately, it is not your goal to produce technology. Your goal is to produce customer experiences. You're going to leverage your expertise in the financial services industry to use other than just producing technology. This is a question that you have to ask yourself as early as possible.

The same goes to data gravity. Some vendors that produce technologies for a stream store, they're often cloud vendors, which is good, in some ways, and bad in others. It might be bad, for example, if your data that is coming as sources, or the data that will be produced by streaming datastores, and sent somewhere else, does not necessarily live in the cloud. The data gravity is going to be the decision around, how much effort I'm going to actually have to keep sending data in and sending data out from my stream store? Evaluate these conditions before you make a decision about your stream store.

Lastly, you have to decide about if the vendor has a complete portfolio. A stream store is the basic data structure that you are going to need for your stream based architecture implementation. You're also going to need some other tools such as a stream processing framework, or what developers are going to use to express their stream processing implementation. All of this has to be taken into consideration when you are deciding which vendor will provide a solution that is complete. Otherwise, you're going to end up with a very siloed set of products that perhaps will create more friction for you every time you need to implement or change anything in your implementation.

What Traps to Avoid - Using Streams Only to Aggregate Data

Let's discuss some of the traps that you definitely should avoid. The first one has to do with the programming constructs that we've discussed. Usually, developers tend to reduce the usage of streaming based architectures into aggregating data only. Because somehow they are just used to this from their backgrounds using SQL databases, and then streaming based architectures are essentially used to aggregate data and ship somewhere else. As you may know, at this point, by doing so you are not leveraging the full power of the technology.

Outsource Data Streams Creation

The same goes to the continuous usage of CDC as a way for you to transform transactional data into stream of events. This is extremely powerful and important for existing systems. For new systems, you should avoid having to rely on CDC to write stream of events. As much as possible, new applications should start writing stream of events into the stream store directly. Avoid this, "I'm used to writing data into my database, and then I'm going to outsource the generation of stream events to the CDC connector." This is something that you have to take care of.

Stream Stores per Project

The same goes to replicating the mistake of creating silos in your organization. Now, instead of creating silos from datastores, you're creating silos for a stream store. A very common tendency of new teams is to use the old habit of each project will create their own set of a stream store. By doing so, you are not leveraging the full power of having a common backbone of communication that is going to replicate a stream of events across multiple systems. The idea of a stream store is to work as the central nervous system where all the applications, the peripheral applications are going to read and write, to and from.

Scaling Storage with Compute

Another point that you should pay attention to, that has to do with the choice of technology, is pick a technology that can grow the storage of where stream events are going to be written economically. There are multiple implementations of stream stores out there. Try to pick one that does not necessarily require you to bring in compute in order to increase storage. It's similar to Hadoop and HDFS clusters that became pretty famous in the past, in the big data world. Nowadays, it does not scale as we want to. It's not very economically savvy to rely on a technology that behaves like this because you are going to need to store large amounts of data. You will want to, for example, leverage a technology that has a hybrid storage that can persist old data into something like an object store that is durable and cheap. New data can be treated in hot nodes, that you can actually leverage some block storage or SSD disks to actually make the data available as fast as possible. It is important to pick a technology that can have this hybrid storage model.

Unidirectional Flow of Data

Also, it is important to make sure that you are not going to have one-way direction processing all the time. As you mature your implementation, you're going to feel the need to actually, every time you need to change something that has been generated by the streaming system, you need to go back to where the data was collected or ingested or processed, and do some adjustments to make sure that the new data will be generated properly. Most of the time, you're developing architectures that only think forward but never backward. Try to design your implementations in such a way that data can be continuously updated back to the stream store, so that it can be reused, and therefore can be refined for later purposes of usages.

Implementation Technologies

Regarding implementation technologies, there are some very interesting options out there available for you to use. Apache Kafka and Apache Pulsar, as well as the Elastic Stack are probably going to be the main ones on the on-prem, open source world. For the cloud providers, you're going to have Kinesis from AWS, Event Hubs from Microsoft Azure, and Pub/Sub from Google Cloud. Those are going to be the technology implementations that we can start discussing the tradeoffs between each one of them.

Questions and Answers

Porcelli: I would not ask you if you should go for Kafka or Pulsar, because this internet is full of opinions. I'd like you to help us understand where Elastic Stack fits in this model.

Ferreira: You're right. I think everybody that thinks about the Elastic Stack or Elasticsearch specifically, doesn't remember using it for streaming data architectures. They're right. Elasticsearch is a technology that is a datastore for implementing full text searches. What I mean by that is that I've seen some use cases where I can see clearly that people are trying to implement some analytics platform, and they're using Elastic Stack. Ultimately, they want to see all the data flowing in and ending up there for analytics purposes. I've see someone asking for Apache Flink, for example. Then, I have to use Apache Flink for performing some aggregation here before actually writing data off into Elasticsearch. You don't actually need Apache Flink for that because you can run this aggregation on the Elastic Stack itself. That would help you to remove some operational complexity from your infrastructure. It is not like it is right or wrong. It's more like, how much extra trouble do you want to have in your infrastructure? That's the part where Elastic Stack sits very well in the use case.

Porcelli: People are asking for clarification on one of your questions. You replied with the classical, it depends. Can you expand now?

Ferreira: The question was about using CDC for adjunct. He said that CDC solutions should be avoided, but they can be very useful in a microservice architecture, where doing distributed transactions is hard. Let's discuss what CDC is by definition. You are using some implementation technologies to transform raw data, stored data into a stream of events, so they can become fluid again. Then you are going to see situations where, for example, you would like to concurrently and with atomicity, propagate those events so that they can be used for downstream systems, for analysis or for aggregation purposes. At the same time, they're being written into the local datastore. You can achieve some consistency. We don't want things like, I've committed this financial transaction here on my datastore. Then, I forget to replicate, or it didn't go through the replication of the events for downstream systems to use. There are techniques like the outbox pattern that you can use to ensure that consistency. That creates operational burden. It is all about that. Again, it is another situation that it's not wrong doing this, it's more about, there will be some extra steps, and there will be some extra coaching from your part to implement that. That's what I said it depends.

Porcelli: When I think about batch, I remember the batch window. That was the classical batch. You approach it differently, it's not exactly what you're talking about here. It's the dependency of applications in this microservice world. We see all the time, this data pipeline, things that go from here to there, and that connects to partial data that you mentioned. This opened my mind on how you approach streaming. That is a different perspective, how you handle the operations and now all the possibilities that you bring, and remove this latency, the communication between different microservices that you call siloed application, but we can connect to microservices.

Ferreira: Exactly. Regardless, if you're implementing monolithic architectures or microservices architectures, you are doing everything right. There is nothing wrong with the approach. The problem is, when companies start asking questions, yes, but I'm not able to answer the right questions. Technically speaking, you're doing everything right. The question remains, I'm not able to answer certain types of questions. That's where you can mix and match streaming-based architecture with your transactional architecture that you are used to.

Porcelli: I've been working on rule engines, and one very good talk with CEP, Complex Event Processing. I'd like to understand, is streaming an evolution of event-based architectures? How it is connected to the work?

Ferreira: I like to analyze technologies or projects, or whatever I'm working with as what type of problem they solve. If we remember the event-based systems, or I think they used to be called CEP, Complex Event Processing, they were used to solve the same fundamental problems that streaming-based architectures also do. Yes, we can think then as like, they are similar. They're solving the same problem, but with some differences there. I like to use the analogy, it is a Docker for visualization. Fundamentally, they're solving the same problem. One of them is more efficient than the other.

Some event-based systems in the past, not all, are very memory bound, if you think about it. For being able to do what they're supposed to do, which is correlating a huge set of events and producing an outcome that will be significant to the businesses and the enterprise. They have to allocate all those events in memory. That creates the problem of vertical scaling. That worked perfectly in the past. Fifteen years ago, 100 messages per second is our daily job. Now we're talking about hundreds of thousands of messages per second, and terabytes of data is the new normal. That's where streaming data architectures are becoming more popular. They solve the same problem, but they handle data and volume better, in my opinion. I think that's the point cut that separates them.

Porcelli: One thing that I think is very common for the financial sector to deal with is legacy applications, especially mainframes. How many points to get this connection between mainframes, there's transaction applications and the streaming world?

Ferreira: I would think that has to be like a separation, per se. You can definitely live with both worlds. Here's the bummer, even though you are crazy about streaming data architecture as I am, there are applications and architectures that are transactional in nature. That's a truth in life, just like the fact that my wife is 100% right all the time. You have to deal with it. You can mix and match all the time. You can transform just like you would do with CDC, with a local database or a relational database. You can transform the data from a mainframe into event streams. That is possible. There are a lot of financial institutions actually doing this already. I don't think you have to change the way you build systems just because you want to leverage the analysis part. What I think you should do is if the whole purpose of the systems is purely for analysis, you should start designing your systems more toward streaming based architectures, because you would be able to leverage all the benefits if you're doing this.

See more presentations with transcripts

Recorded at:

Nov 19, 2021

Ricardo Ferreira

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?