DataStax, company behind the commercial support for Cassandra NoSQL database, offers graph database functionality. It recently announced the new product called DataStax Graph that includes an adaptive query optimizer, automatic graph data partitioning, a distributed query execution engine, and graph-specific index structures.
DataStax Graph database is based on open source Titan graph database and uses the open source Apache Tinkerpop framework’s Gremlin query language. DataStax donated Tinkerpop to the Apache Foundation and they are now part of a community of vendors who are using and contributing to Apache Tinkerpop.
Aurelius, the team behind Titan was acquired by DataStax last year and the team has built the new graph database functionality.
Datastax Enterprise (DSE) Graph is part of a multi-model platform that supports key-value, tabular, and Document models in addition to graph. Rather than use multiple vendors for handling polyglot implementations that demand different data models, the users can use one vendor and get different data models in the same product.
DSE Graph includes additional capabilities like security, built-in analytics, enterprise search, visual management monitoring and development tooling. Also, DataStax Studio now comes with a new web-based solution to visualize graphs and write & execute graph queries.
InfoQ spoke with Martin Van Ryswyk, EVP of Engineering, DataStax, about the graph data model support in Datastax.
Martin talked about the new graph database support and how this feature compares to specialized graph NoSQL databases. DSE Graph is a distributed graph database platform, designed specifically for Cassandra to provide graph computing at-scale for the DataStax platform.
DSE Graph comes integrated with DSE Search and DSE Analytics so the end users can leverage these technologies together in their applications.
He also talked about graph database use cases and advantages of multi model databases for enterprise data management needs.
InfoQ: What are some use cases where DataStax Enterprise Graph database can help with data management needs?
Martin:
- Customer 360: A hospital group is building a system that consolidates patient data and records as well as health care provider data in a 360 system to which all hospitals are connected. This combination of data results in a complex graph of patients, visits, facilities and medical professionals.
- Inventory Management: An online music and video store needs to implement its product catalogue based on supplier information and integrating relationships to authors, bands, actors, genres, etc resulting in a nested and complex structured graph. This graph needs to be traversed by shoppers in real time and be searchable based on user queries.
- IT Network and Device Management: A large bank needs to monitor its network of computers and servers together with their configuration. To better understand how machines are connected to each other they build a graph which is used to optimize their deployment, track network health and identify security or compliance risks. To track the health of the systems in the network, they also collect health status information from each machine.
- Security and Fraud detection: A financial institution builds a graph of users, institutions, accounts, credit cards and financial transactions between those to identify if a transaction is indicative of money laundering by analyzing the path the money took and the individuals involved in the local transaction community.
- Recommendation engines: An eCommerce site recommends products to their customers based on the customer profile as well as previous interactions and purchases but - most importantly - based on the most recent actions taken on the site in the current browsing session which are all integrated into one graph. Graph analysis techniques are applied to identify other products that match previous interaction patterns.
InfoQ: Can you discuss a use case where the different data models like Key Value, Tabular, Document, and Graph can all be stored in a single database?
Martin: Imagine an IoT use case with data flowing in from sensors. That data is stored in a Tabular Time Series model (as is common with DSE). Each record contains the sensor id and readings over time. In the same instance DSE Graph is used to represent the large and complex hierarchy(s) of sensors, devices, factories, business lines, products, locations, suppliers, and more.
Example: The ID of the sensor is retrieved from the time series data and used as a lookup for a Vertex in the graph which represents the sensor. Now context can be determined. If a sensor reading represents something abnormal, what real things will be affected? This sensor is one of 500 sensors in a jet engine. The graph can be used to figure out that the failure is in the fuel pump component of engine #3 and indicates an immediate precautionary landing is required. The graph will let us know that the engine is in airframe 1234 which is currently operating as Flight #722. From there we can find subsequent flights that this airframe will service and predict those flights will be delayed today. We can also figure out which airports will be affected. And each piece of luggage that was loaded on the plane was scanned and added to the graph, so we know what freight and luggage will be delayed. This is one use case for the customer. They are also storing information on food service, crew scheduling, baggage handling, ground equipment, and more in the same DSE Graph database. Each other domain may have its own time series data, documents, or other data that is stored outside of the graph. You can see how storing all this in one system is easier on the developers and operations staff.
InfoQ: Are multi-model databases a better choice to manage different types of unstructured data (KV, Tabular, Document and Graph data models) compared to using individual specialized NoSQL databases?
Martin: Customers have been very clear. A few years ago they would use specialized point solutions because their legacy systems could not provide the scalability and availability they needed. However now their concerns are turning to operational simplicity and they would much rather prefer to have a single multi-model system that provides the best of all point solutions in one system. It is easier to operate and find skilled employees for one distributed system than five.
InfoQ: When to use a multi-model database versus going with a polyglot persistence approach with a data access framework like Spring Data that abstracts the data access logic from application developers?
Martin: Abstracting data access logic means you don’t have a clue what the underlying system is doing. To meet the scalability and availability needs of cloud applications, developers can not abdicate knowledge to a layer like Spring. Usually these layers can only create a good abstraction by simplifying to the least common denominator of all the systems they are abstracting. The different models are too “different” to abstract away. It is good to know the model so you can optimize.
InfoQ: How does DataStax Enterprise Graph stores large graph data sets, in terms of partitioning and replication among multiple nodes in the cluster?
Martin: DSE Graph is tightly integrated with the Cassandra database inside DataStax Enterprise. It is able to use the partitioning and replication technology present in Cassandra. In addition, our team has developed efficient query routing, query optimization algorithms to find the data in the cluster quickly and efficiently.
InfoQ: How does the new graph data model support affect the data analytics with Cassandra integration with Spark?
Martin: We leverage our Spark based DSE Analytics functionality for Graph. When a user runs a query that will run against large sections of the graph (rather than starting a traversal at a single vertex) we use the Spark integration behind the scenes to optimize the query.
InfoQ: Are there any graph data visualization tools available with Datastax Enterprise Graph product?
Martin: DataStax Studio is a developer tool which helps users run Gremlin queries and visualize the results. This tool is designed to help new developers learn the query language and expert developers who are testing new queries for their application. In addition we have partnerships with several Graph Visualization companies who have added support for DSE Graph including Cambridge Intelligence and Linkurious.
He also mentioned that the graph implementation includes the server, visual management and monitoring of graph database implementations with DataStax OpsCenter, visual graph development with DataStax Studio, and a suite of drivers that handle graph as well as all other components of DataStax Enterprise (e.g. SparkSQL, enterprise Search, etc.).
Readers can also check out DataStax whitepaper on multi-model databases and how cloud applications can benefit from a multi-model approach.
In a related news, a new website called PlanetTinkerPop.org was launched recently by TinkerPop community and DataStax to provide a place to discuss and share information about TinkerPop Gremlin, the query language for graph databases.
About the Interviewee
Martin Van Ryswyk, Executive Vice President of Engineering, is responsible for the worldwide software engineering, product development and continued advancement of our integrated enterprise big data platform. He has more than 22 years of experience managing software teams at both small startups and large corporations. During that time, he's brought products to market in a wide variety of areas such as cloud computing, application lifecyle management, database performance analysis, storage management and systems management. Before joining DataStax, he held numerous senior engineering roles, leading the development and go-to-market strategy for enterprise level technology products at Tidal Software, Luminate, EMC and most recently at Electric Cloud. Martin earned a bachelor of science degree in computer science from the University of California, Davis.