Bridging the gap between BI & SOA
We already know that business intelligence (BI) can bring many benefits to an organization. Through consolidating, aggregating, and analyzing data, BI can provide many insights into what is currently happening, as well as what is going to happen within the organization. BI allows for identifying trends of where an organization is going or should be going. The road to BI usually starts with extract transform and load (ETL). ETL is, generally speaking, a process in data warehousing that involves:
- Extracting data from external and internal data sources.
- Transforming the data to fit business needs.
- Loading the transformed data into the data warehouse (or data mart).
Basically, for us to achieve BI Nirvana, all we need is "just" one input: data. BI needs the data that is hidden within the organization's systems.
In the last few years, we have seen the advance of service-oriented architecture (SOA) to the forefront of IT architectures. As the hype begins to clear and organizations make the transition to SOA, the data that BI requires is suddenly scattered between multiple services and hidden behind contracts.
Looking at the SOA components in Figure 1 (taken from my paper What Is SOA, Anyway?), we can see that apart from the obvious component—the service—SOA has several other components that are related to the interface of the service:
- Contract that the service implements
- Endpoints, where the service can be contacted
- Messages that are moved back and forth between the service and its consumers
- Policies to which the service adheres.
- Consumers that interact with the service
This, along with SOA tenets like "share schema, not data" and "services should be autonomous," tells us SOA really cares about its interfaces. This emphasis on communication through rigorously defined interfaces is exactly what brings technical and business advantages of loose coupling, flexibility, and agility to SOA.
Figure 1. SOA components and their relations
There is a real impedance mismatch here, with BI pooling in one direction of intimate understanding of the data and SOA pooling in the other direction of isolating internal data behind interfaces. As Pat Helland explains in Data on the Outside vs. Data on the Inside, the service's internal data should never be exposed outside of the service, yet this is the very data that BI wants. Thus, when we think about this, it is not very surprising that a recent survey by Ventana Research (published by Dan Everett in Dr. Dobb's Journal) shows that only one-third of respondents reported that they believe their internal IT personnel to have the knowledge and skills to implement BI services.
It seems that there are two options: either to go directly at the data and invalidate some of our SOA principles (like "share schema, not data") or to try to make do with the contracts that we have in place and hope that we will have enough data for BI. (A third option, which I will discuss later and is equivalent to the first option, is to create contracts specifically for the BI needs.)
Get That Data or Else...
The first option is to get the data that BI needs by using the same ETL processes that have proven themselves in the past.
SOA presents a little challenge to ETL, as you have to integrate data from many dispersed locations (services). However, converging data from multiple resources is not a new problem for either BI or ETL. Large enterprises already have a lot of data sources: ERP, CRM, all of those departmental data silos, and whatnot. ETL with SOA might even be easier, considering that SOA promises that the enterprise data would be woven into a cohesive fabric and not some point-to-point integration spaghetti. ("Promises" being the operative word here; actually achieving a cohesive fabric of services is not an easy feat. But that is a topic for another article—or book, for that matter.)
As I mentioned earlier, ETL is mature and has a proven record as a basis for building successful BI solutions. However, using ETL basically negates most of the benefits that made us pursue SOA in the first place. One of the main problems in the pre-SOA era (which is still the reality in many organizations) is what is known as integration spaghetti. Consider the situation in Figure 2. Historically, each department builds its own systems. The result is isolated or stovepipe systems, as new business requirements emerged. Then, systems needed to share data, and new point-to-point interfaces were added to solve the integration needs. As people use the systems, they find that they need information from other systems, and point-to-point integration emerges. Figure 2 shows four types of point-to-point integrations: ETL (extract, transform Load), which is a DB-to-DB relationship; online and file-based, both of which are application-to-application relationships; and direct connection to a DB, which is an application-to-database relationship. Note that this is not an exhaustive list; there are additional relation types, such as replication, message-based, and others that are not expressed in Figure 2.
The end result is a spaghetti of systems. Making changes in one system has ripple effects, with results that are unpredictable. The SOA emphasis on general interfaces and autonomy aims to solve these problems.
Figure 2. Typical enterprise-systems integration spaghetti>
Adding ETL as a direct pipeline into the services' data just adds a new point-to-point interface—cracking the SOA "interface armor" and introducing a dependency between the BI and the service. It also opens the door for other workarounds. (If we can do that for BI, why not do the same for other applications, services, or systems?)
A variation on doing ETL can be to replicate the SOA data into an external database, and then do ETL on that data. However, it is exactly the same as using ETL on the service's database, as we are still bypassing the contract and we are still coupled to the structure of the internal data.
Okay, so using ETL is probably not the best option. So, let's try to see if the second option of building on the SOA principles by using contracts will fare better at integrating BI and SOA.
Pulling SOA Data (Request/Reply)
The simplest solution for integrating SOA and BI is not to do anything specific for the BI processes. Instead, what if we use the existing contracts—those that were drafted as part of the SOA initiative? To be able to fulfill our BI needs, we would need to poll the services' interfaces on a regular basis, so that we can get trend and historic data.
There are basically two problems with this approach. One is the problem of network bandwidth. Polling each of the services that we need transfers a lot of data on the wire. To solve this problem, we might want to increase the interval in which we poll the services. However, in doing so, we hit the second problem: We run the risk of missing important events that occur during the interval. This is analogous to looking at the sky in the morning and the afternoon, and completely missing a solar eclipse that happened at noon. Thus, unfortunately, this does not look like a very promising direction, although it is probably better than nothing.
Another option for using SOA contacts is to build a specific contract that would serve the BI needs; that is, the contract will enable retrieving data from the internal structures of the service and so that the BI can use it. However, that is pretty much the same as using standard ETL; you still create a point-to-point integration and a precedent of specific contract for a specific use.
The situation thus far does not look very promising. We find that we are between a rock and a hard place; if we are pulling SOA data, we must either invalidate some of SOA's principles and benefits or forget about a good BI solution.
But, hey, wait: Maybe there is a third way, after all...
Making an SOA Mind Shift: Moving to a Push Model
The third option is based on taking SOA forward, beyond the simple request/reply that we are used to thinking about, and combining SOA with another architectural style that is called event-driven architecture (EDA).
In a nutshell EDA, like SOA, is an architectural style that is built on the push model. EDA components publish events. In the logical sense, an event is any significant change in the component that publishes the event. The change can be a result of proper conduct, such as an order than has been processed; it can be a fault, such as a database that is down; a threshold that was crossed, such as the millionth customer making a purchase; or anything else that seems important. In the physical sense, events are messages with a header describing the metadata of the event and the body containing the content.
As soon as they are produced, events ripple through to subscribing components. After processing the event, these components can also produce new events, and so on. For example, in an airline scenario, an event can be a notice that a flight is delayed. This event can trigger another component that is responsible for connecting flights, to try to find alternate flights for the passengers arriving on the delayed flight. (Yeah, right; as if we will see that ever happening.) A unique characteristic of EDA versus other push technologies is its notions of event stream processing (ESP) and complex event processing (CEP). Instead of treating the events as isolated occurrences, we look at them as a chain of related events. Looking at an event chain—and, even more so, at a combination of several event chains (event cloud)—allows retrospective analysis over time, as well as other advanced analysis of event patterns.
EDA can be used independently of SOA; but fusing them together can be very beneficial.
SOA Meets EDA
What if we add publication messages into the contract? By "publication messages," I mean that the service will publish its state either in a periodic manner or per event to anyone who might be listening. I like to call this service-communication pattern "inversion of communications," because it reverses the request/reply communication style that is the common case for SOA. While it might look like we get a similar network load that polling the services would, the network load is much less. However, using inversion of communications, each interested service consumer would get an event only once, at most, while polling a consumer would get the same state change multiple times (or miss out on data).
To make the solution complete, you can add additional request/reply or request/reaction messages to allow service consumers to retrieve initial snapshots. Following this approach, you get an event stream of the changes within the service in a manner that is not specific for the BI. In fact, having other services react on the event stream can increase the overall loose coupling in the system; for instance, it can allow caching the state of other services and ease the temporal coupling between services. Additionally, adding EDA to SOA can serve as the basis for solving the reporting problem of SOA, by implementing the aggregated-reporting pattern (early draft).
EDA on SOA solves the BI problem; as soon as you have event streams on the network, the BI components can grab that data, scrub it as much as they like, and push it to their data marts and data warehouses. However, event streams can also enhance the BI itself by enabling much more complex and interesting analysis of real-time events and real-time trend data, using complex event-processing (CEP) tools to get real-time business-activity monitoring (BAM). What would event processing look like? Imagine that you have an Orders service that publishes an event with an XML description of every order that it processes—something like Listing 1.
Listing 1. Excerpt from an Order summary XML
We can then use ESP or CEP tools to monitor this stream and continuously extract interesting events for further analysis or further actions. For example, Listing 2 shows a query on such an order stream to find orders that are larger than $100,000. Note that while the query looks suspiciously like SQL (from which it was derived), it is also quite different; the query continuously runs on a non-persistent stream of events.
INSERT INTO LargeOrders
orderid as orderid,
SUM(Ords.price * Ords.qty) AS TotalValue,
OrdersStream AS Ords XMLTable (val
Orderid as orderid,
TO_FLOAT (XMLExtractValue ('@Price')) AS price,
TO_FLOAT (XMLExtractValue ('@Quantity')) AS qty );
Listing 2. A Coral8's Continuous Computation Language query to find orders larger than $100,000 in an order stream and insert them into a LargeOrders table
The road to mainstream CEP tools is still long, but there are several vendors working on solutions. Even if we do not use CEP, we can still gain a lot of benefit from receiving these events. For example, a service that manages the stocks in the warehouse can listen in on the Order service's orders-processed stream and then take care of ordering new stocks, securing available items, and so on.
When we build our BI with EDA on SOA, we essentially create the BI as a mash-up of services. We can take that even further and have the BI component itself expose its trend data and other analysis results as a service. We can then consume that data and use it in other applications. For instance, if the CEP query in Listing 2 will generate an event every time that an order exceeds $100,000, we can present a nice dashboard on the CEO's portal that will show in real time how many large orders the organization processes per hour/per day, and so on, along with a few other meaningful gauges.
Figure 3. Displaying the BI as a mash-up
We still have not answered one question, however: How can our services produce these events?
But What About Request/Reply?
Looking at the implementation side, we can see that the infrastructure to support this move is already emerging or even present. If you are implementing SOA over an ESB, that is rather easy to implement, as most ESBs support publishing events out-of-the-box. Using the WS* stack of protocols, you have the WS-BaseNotification, WS-BrokeredNotification, and WS-Topic set of standards.
If you are in the Representational State Transfer (REST) camp or do not want to get into the complexities of relatively immature WS protocols aforementioned, I guess you will need to implement publish/subscribe by yourself. But, then, we already have that solved, too: It is called RSS. When someone posts on a blog, your RSS reader uses synchronous request/reply to get to that blog and get the posts that were added since the last time that the RSS reader asked. Well, well, guess what: RSS gives us loosely coupled publish/subscribe, including topics (categories) built on top of synchronous request/reply, too.
Your services can publish their event streams as feeds, just like your blog, which as a bonus also gives as a few architectural benefits. For one, the service does not have to manage subscribers. Secondly, the consumer does not have to be there the moment that the event occurs to be able to consume it. Also, the management and setup are easier and simpler than using queuing engines or any other technology that I can think of.
Using EDA and SOA together gives us a solution that does not break SOA and solves BI requirements. However, there are two challenges to the EDA and SOA approach. One is that there is not a lot of experience using EDA and SOA as a BI solution (compared to ETL, which is proven). The other is that it needs more work or even rework, as the first wave of SOA implementations builds on the more basic synchronous-messaging approach. Adding EDA to an existing SOA solution is not a small effort. However, neither is using ETL within SOA, because we need to go out and extract data from many sources, as each service holds its own internal data and we are likely to have quite a few of them for any reasonably sized SOA initiative.
My opinion is that, overall, EDA and SOA wins over using ETL from almost all of the perspectives.
From the SOA perspective, adding EDA to SOA is good for the overall SOA initiative. EDA is a valuable tool for building services that are more autonomous. For example, services can now cache relevant data from other services and get notifications when that data changes. Thus, the consuming service can be decoupled in time from the services with which it interacts and not depend on their availability—which is the situation when synchronous request/reply is used.
From the BI perspective, things are even better. Utilizing EDA can give us something that was really hard to achieve by using traditional BI mechanisms—which is real-time insights. Using the EDA-generated event stream, we can now get data in real time and, using CEP tools, we can process it to act in real time and handle the emerging trends as they appear.
To summarize, implementing a BI solution by using EDA and SOA is superior to using traditional ETL. Not only do we get our basic BI, but we actually get better, real-time BI—not to mention improvement in the overall quality of our SOA.
About the author
Arnon Rotem-Gal-Oz is a manager and architect with extensive experience in building large, complex, distributed systems on varied platforms (ranging from HP-UX and Solaris to AS400 and Windows). Arnon blogs at www.rgoarchitects.com/blog and writes the Dr. Dobb's Portal blog on Software Architecture & Design at www.ddj.com/dept/architect. You can contact Arnon at firstname.lastname@example.org.
Yet another alternative
I am not sure that the service being aware of BI is necessarily the most flexible solution. Definitely, services should publish and subscribe to events but the event dispatcher should act as a coordinator pattern, I doubt that services can be designed to produce/respond to all possible interesting events associated to their message flow and that they could be directly wired with each other at the event level.
ETL vs SOA, real-time etc
When an event comes in it has to be added to a number of different aggregate tables. To have a consistent view of data you need transaction isolation for both the front-end queries and back-end queries on the intermediate data store. All of a sudden a new bunch of technical problems have been introduced that are not standard to BI implementations. There is also a training cost, labour cost etc.
The cost of hardware though is changing the options available: the cost of memory and CPUs is going down so fast, in-memory storage of pre-aggregated data is feasible for many implementations in the mid-market and even some in the high-end (despite the amount of data produced growing faster than that used for major companies). We don't care about the power going out most of the time because it's BI and not transactional. This pushes towards products that bundle custom hardware and the new complexities of the software into an appliance. After all use the price of the new hardware and simplify the traditional technical problems with BI while shielding the IT shop from the new ones is a good move. Not surprisingly there are already products doing this (Cognos bought Celequest for example).
The downside is that what you get is real-time BI: but most BI doesn't need real-time yet; you need to clean up the spaghetti: yet most companies still have spaghetti and they're still going to want a cheaper band-aid than surgery.
My two cents anyway.
Re: ETL vs SOA, real-time etc
The main point here is that events will let you pry the data out of the services without resorting to doing something specific to BI or any other SOA violating technique
This doesn't mean that you have to update your datamart (or datawarehouse) for each event - you can, within the component responsible for the BI, create batch files with all the updates and employ "common" ETL on that. You also have the added benefit that you can do real-time BI (BAM, KPIs etc.)
Re: ETL vs SOA, real-time etc
I see your point. Technically the only problem is that you have an overhead per-transaction in the operational system. I see the problem being much more a business/budgetary problem than a technical problem.
You are pushing SOA and have built a ROLAP engine (yes I read your CV). I work in France and have done many BI projcets in South Africa, France, Australia and the UK. SOA is almost non-existant for BI projects as a source of data. SOA in the BI market that I've worked in is almost purely hype.
The software companies who say to their customers "we use SOA" mean one of : a) they can consume services to read data (e.g. EII, EAI etc) b) they have exposed services from their platform so it can be integrated into larger projects or c) their internal components to their platforms work between each other via services.
The first two may have some business benefit for the customer (the third is just a maintanability issue for the vendor), but the main point of SOA as I understand it is to make the IT shop more agile so it can respond more quickly and with less budget to changing business needs. Almost no one has addressed this need in the BI market because they try to sell something that will work on top of existing architecture. SOA became for a while a buzzword that you had to be able to use to get past a certain stage in the sales process but not much more.
Now the main problem with the approach you've outlined is not technical its IT strategy - lots of IT shops want to have SOA but the business won't give them the budget to do it because you can't prove ROI. BI typically falls into a very different budget. So from an CIO perspective, someone who talks about SOA and BI together is either going to be visionary but only implementable on new systems at a departmental scale, or far too complex and costly to implement. No matter what the technical merits. Teradata is pushing SOA as the basis for its Active Data Warehouse/Right-Time Enterprise and they try to only target enterprise-wide data warehouses. They've got quite a few good examples of business benefits. But they are clearly struggling because of the huge cost barrier (time and money) to having the SOA fundamentals in place to be able to then build the BI.
We've got a Real-Time BI offer at my company but there is hardly anyone who actually needs Real-Time (or Right-Time) yet. This could be a problem of the customer's perspective and we need to demonstrate real ROI benefits. We've got one major project that does do that, but one is not enough. It's really hard without having some significant projects under our belt. So we get a vicious circle. Maybe the best place to start IS small scale - one business line at a time? Any thoughts welcome because at that point it becomes more a marketing issue.
Re: ETL vs SOA, real-time etc
SOA is not popular as a source of data for BI since it seems (to me at least) that just now the hype is starting to fade and real SOAs are starting to emerge.
I am sure that when companies move to an SOA they face the BI dilemma I've seen it - and I've also seen that happen for reporting.
I just try to offer a way to handle that situation
Note that I didn't suggest building the bulk of the BI solution as SOA (which may or may not be a viable option - but it is as orthogonal question) just how to make SOA and BI work together when you need to have a solution that integrate both.
Also note that web-service != SOA so slapping some web-service interface on a product or a bunch of web application does not mean you have an SOA, instead you get Just a bunch of web services
AS for Real-Time BI -As you said, it is basically about building a value proposition.For defense systems it is pretty obvious a big win but I've also seen it in other areas such as media companies (cables/satellite) where they care a lot about their KPAs and the freshness of their data. Airlines also come to mind etc.
Data Services Using EII
Here's a more detailed entry I wrote on the topic blogs.ipedo.com/integration_insider/2007/05/fus...
Re: Data Services Using EII
EDA is a great way to deal with fine-grained data updates and get very low coupling through canonical event formats and event ontologies.
In the open-source world the Esper project (which I co-lead) is processing push-data in an EDA, please visit at esper.codehaus.org.
Very nice article...
I enjoyed your article and I believe it will be a common way to architect solutions in the future (how far in the future remains to be seen). My company, MetriWorks, has developed a product intended to make the approach you described a configurable bolt-on intermediary for web services. I can comment on how we address the performance overhead concerns, as our product has a very light footprint on the web service call. It only takes an in-memory snapshot of the raw web service data, pushes it onto a queue and then allows the web service to continue. The heavy processing of the raw data is handled in our server process asynchronously with the web service processing. The raw data can also be routed to a different physical server for processing in order to reduce overall system load on the web service server.
One common problem we do have with this approach is that, many times, the service does not contain all of the related data that might be needed. For example, in your Order Service, perhaps a "Customer ID" is passed in the service where you would really like to have the "Customer Name" and the "Customer Credit Rating" information to include in the alert. Of course you can write custom event handlers to do a lookup from some other databases. But, I would like to hear if there are any preferred patterns or suggestions for better ways to handle this cross-reference data lookup requirement?
Re: ETL vs SOA, real-time etc
I agree with your points;
the main point of SOA as I understand it is to make the IT shop more agile so it can respond more quickly and with less budget to changing business needs. Almost no one has addressed this need in the BI market.
I think many got confusion with misunderstanding with what SOAP/XML things can do, which SOA supposed necessarily not to be bound. BI covers far more than simple data or tables. It involves lot more object types than what many see. SOAP/XML SOA will never be prevalent!
Instead, other approaches that provide ROI (as you mentioned) will be successful in the field. SOA based on HTML URL Tags can deliver really robust platforms. For examples, see www.roselladb.com/bi-soa.htm. It is based on html pages. One can easily implement BI requirement by simply writting a bunch of html pages!
Re: ETL vs SOA, real-time etc
Looks like this post is inactive for few months. But inveitably I am in a simmilar situation where it is diffcult to convince BI/ETL technical folks about adopting SOA. My take on it is , it is of the BI/ETL folks benifit to leverage on data services (especially to get data from Apps DB and ODS) they can offcourse expose there ETL job as a service as well. But for the first part where ETL acts as consumer there are unnecessary worries and pre-judice in industry about slowness and performance etc. Did somebody have an actual test result or benchmark to share that the overhead added by SOA is significant ? I dont think many have done any testing before jumping on assumptions.
1 year and 3 month later... CDC and MDM
I also did appreciate Michael's constructive remarks.
One of my customer recently asked himself how he could take advantage of the service orientation (SOA) in the process of setup a new data warehouse from several scaterred databases (BI).
That's how I found your article.
My point is that 15 months later, 2 things have appeared and maybe relevant for this SOA/BI question :
- CDC for Fact Data (Sales, Production, ...) : Change Data Capture which allow to load continuously the data warehouse (which is actually is a form of EDA that you have described). New generation ETL should be able to use this mechanism to detect changes on multiple data sources and to expose the associated information (pull mode) or to publish it.
- MDM for Referential Data (Customers, Products,...) : Master Data Management is really Service Oriented as its purpose is to federate referential data of the enterprise and to expose a unique and consistent 360 degree view of the reference data. At least we have to noticed that one of the MDM Architecture Style, "Consolidation" is a pure BI/DWH model : mono-directional flow from data source to the MDM, MDM Data used for Analytics and reports.
if you still follow this thread, your thoughts are welcome, especially on my customers original question and on the relevancy of MDM and CDC in the game.
7 Ways to Optimize JenkinsCloudBees