BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles High Tech, High Sec.: Security Concerns in Graph Databases

High Tech, High Sec.: Security Concerns in Graph Databases

Bookmarks

This article first appeared in IEEE IT Professional magazine. IEEE IT Professional offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.

 

Cybersecurity measures are best accommodated in system design, because retrofits can be costly. New technologies and applications, however, bring new security and privacy challenges. Furthermore, the consequences of new technology adaptation are often difficult to anticipate. Such is the case with graph databases, a relatively new database technology that’s gaining popularity. This article explores the value of graph databases and probes some of their security and privacy implications.

The Emergence of NOSQL

The Relational Database Management System (RDBMS), initially designed to maximize highly expensive storage, has indeed proven to be highly effective in transaction-rich and process- stable environments. For example, the RDBMS excels in large-scale credit-card transaction processing and cyclic billing operations. The RDBMS offers superior performance in the realm of indexed spatial data, but it fares poorly in highly dynamic environments, such in as a management information system that depends on volatile data or a systems architecture in which the churn of many-to-many relationships is high. In such environments, the RDBMS design imposes far too much mathematical and managerial overhead.

The emergence of the Not Only Structured Query Language (NOSQL) database represents an alternative to the decades-long reign of the RDBMS.1 Various forms of NOSQL databases opened doors to a vastly improved dynamic data portrayal, with far less overhead and performance penalties. For example, schemas need not be as rigorous in the NOSQL world. NOSQL database designs include wide- column stores, document stores, key value (tuple) stores, multimodal databases, object databases, grid/ cloud databases, and graph databases. The graph data base, crossing many lines in the NOQQL world,2 stands poised to become a successful technology.

The Graph Database

The graph database relies on the familiar “node-arc-node” relationship, or perhaps more simplistically, a “noun-verb-noun” relationship of a network (see Figure 1). A node can be any object. An arc represents the relationship between nodes. Both nodes and arcs can contain properties. This simple node-arc-node triad, often called a triple, is the fundamental building block for describing all manner of complex networks in great detail.

Networks such as an electrical grid, a corporate supply chain, or an entire ecosystem are often composed of numerous nodes that share huge numbers of multiple relationships across arcs. Networks of all kinds lend themselves well to graph representation. The graph database harnesses this powerful capability to represent network composition and connectivity. Graph databases have matured to support discovery, knowledge management, and even prediction.

In an Internet-connected world, where networks of all types become increasingly preeminent, such a network capability is becoming essential to modern sense making. However, like the RDBMS, the graph database is just another tool in the box, and it can be harnessed for good or ill. Thus, it’s not premature to consider the large-scale security implications of this new and rather exciting technology, at least from the highest levels.

Graph Discovery

Because they deal with properties and connections, graph databases represent rich pools of information, often hidden until discovered. Discovery is a means by which a large collection of related data is mined for new insights, without a strong precognition of what these insights might be.

The graph database wasn’t initially considered a useful tool for discovery. It took a specially designed family of super computers to realize the full power of graph discovery. Although it’s straightforward to represent graphs, as the volume of triples increases into the billions, the ability to rapidly traverse multiple paths becomes compute-bound in all but the most powerful machines.

This is particularly true in the case of dense graphs, such as tightly woven protein-protein networks. Here, detailed graph queries can overwhelm lesser capable computational systems. The graph supercomputer, built from the ground up to traverse graphs, overcomes time and capacity limitations. Such devices, some complete with Hadoop analysis tools, recently became available on the high-end graph database marketplace via Cray computers. 3

The high-end graph supercomputer, built for discovery, brings great promise. For example, it can support a detailed build-out of the complex relationships between the ocean and atmosphere that compose climatic conditions. In a time of great climate change, further discovery of indirect, nonlinear causes and effects becomes increasingly crucial. Likewise, a graph supercomputer could hasten a discovery concerning the spread of Ebola in Western Africa, which could serve to stem the spread of the disease. Figure 2 illustrates the notion of discovery using a graph database.

Figure 1. Fundamental graph reasoning. This simple node-arc-node triad, often called a triple, is the fundamental building block for describing all manner of complex networks in great detail.

Figure 2. A graph database harnessed for discovery. Such discovery could support a detailed build-out of the complex relationships between ocean and atmosphere that compose climatic conditions, or could hasten the discovery of how Ebola might spread in Western Africa.

Discovery: Privacy and Security

Graph discovery, which has great promise for resolving complex interrelated problems, however, presents privacy and security concerns. For example, one’s identity can be further laid bare if the graph supercomputer becomes the device of choice to further mine our social and financial transactions for purposes of surveillance, targeted advertising, and other overt exploits that tend to rob individuals of their privacy.

While perhaps an alien thought in a thriving free enterprise system, placing an ethical bar on the acceptable extent of intrusion into one’s personal life might well prove necessary for financial, if not constitutional, reasons. It’s quite acceptable to expect law-enforcement to use all necessary means to remove real threats from our midst, but at what expense to the rest of society? Likewise, those anxious to move their products will take advantage of every opportunity to do so by whatever means possible, but at what personal price for those targeted? The reality is that such high-end exploitation amounts to nothing more than a projection of currently established trends.

In the design of such socioeconomic studies, especially involving a wide range of social and business transaction relationships, the security bar must be set exceedingly high. Any intentionally perpetrated breach could be far more devastating than the recent massive hacks against corporations such as credit-card issuers or motion picture companies. This is further acerbated by the notion that the Internet of Anything (IoA) consists of myriads of sensors, actuators, and mobile devices, all of which seem to be optimized for privacy leakage. 4

Graph Knowledge Management

The node-arc-node triple concept is highly conducive to the “subjectpredicate-object” relation ships ex pressed using the Resource Description Framework (RDF) descriptive language. RDF creates a level of formal expression that lets you describe and reason about the data housed in a graph database. Moreover, RDF nicely feeds a formal ontology, thus permitting a rigorous semantic definition of terms. The “how much is enough” question, however, might take years to resolve with regard to a tolerable degree of practical formalization. Together, RDF and a formal ontology speak to the World Wide Web Consortium (W3C) view of linked data—an endeavor to make reusable structured knowledge generally available in a common referential format via the Web. 5 There’s a downside though. Whereas it’s relatively straightforward to convert highly structured data, such as well-organized spreadsheets and databases, into RDF, the ability to reliably convert unstructured data into RDF exists only in high-end tools, which carry some restrictive caveats. Not all graph databases, how ever, require RDF triple representation. A number of thriving commercial graph databases employ triples in their own unique ways without engaging RDF. Many offer a number of attractive features, such as graph visualization, backup, and recovery. As the graph database industry grows from 2 percent to an estimated 25 percent of the database market by 2017, 6 a number of these tools will catch the corporate nod and the consumer base will continue to grow. Of course, many employ their own languages and techniques for data management. A real need exists for standards that, at a minimum, support data transportability.

Knowledge Management: Privacy and Security

Once again, however, security— particularly for proprietary architectural designs—must be taken into consideration. If Web sharing is envisioned as a reasonable means to generate a lot of system representative triples from resident experts, the design of a secure portal to the RDF data store becomes exceedingly important. Likewise, the notion of user authentication and verification also becomes important.

Although knowledge management is perhaps less extensive than discovery, related databases still might possess specific identity attributes that must be well protected. Front-end provisions must be made to assure the existence of both security against intrusion and the privacy of any personal data contained in the graph database. Failure to offer adequate protection could disqualify otherwise promising candidate graph database offerings, whose interfaces are nonetheless vulnerable to attack.

Graph Prediction

In dynamic circumstances involving an unfolding process such as weather or economic trends, the ability to predict future behavior becomes highly desirable.

Graph representations facilitate predictions, because they let us both qualify and quantify a system represented as a network. The ability to assign properties to nodes and arcs, such as location, time, weights, or quantities, lets us qualitatively evaluate the graph on the basis of similar properties. More importantly, quantitative techniques let us evaluate metrics inherent in almost all graphs. This applies to many fields, including neuroscience. 7

The ability to apply proven metrics to graphs means that their characteristics might be quantified in such a manner to objectively evaluate the graph. In cases where graph data is dynamic, such as in an ongoing process, a powerful predictive capability becomes possible, assuming the datastream is accessible. This approach presumes combinations of graph theory, and combinatorial mathematics can be applied against a real-time datastream. Moreover, various graph configurations could be classified based on their metrics. Such classification templates, each with a graph signature based on its metrics, could then permit identification of and a predictive baseline for similar graphs as they arise.

Prediction: Security and Privacy

Current cybersecurity best practices suggest the importance of taking a snapshot of a system under study to determine its security and privacy vulnerabili- ties, leading to the accreditation of systems proven to be “secure.” The fallacy of such practice is that most systems are influenced by ever-changing environments, which serve to change systemic behaviors over time. Thus, the accreditation is good for the moment of time in which the snapshot was taken.

Given their growing sophistication, graph databases offer the potential to let us monitor dynamic change in near real time. By monitoring datastreams for anomalous node or relationship pattern changes using quantitative methods, we could detect and investigate intrusions and other security breaches early on, quickly prosecuting any identified perpetrators.

From the predictive perspective, data integrity must take a front seat. Thus, the data provenance issue becomes crucial, because the stakes of prediction are high. The results of a prediction are as accurate as the data underlying the predictive tools. False data could gravely affect outcomes where security is literally endangered. Consider the consequence of a faulty predictive model for disaster relief, which calls for distributing resources in an unaffected region as opposed to the affected region. In this regard, good security practice results in the highest ethical standards of applied science. A lthough graph databases hold great promise in a world being consumed by networks of all kinds, they also represent some inherent security risks that have yet to be fully understood, much less appreciated. Rather than piling on the bandwagon, the prudent IT professional must carefully evaluate potential risks in the context of the intended operating environment and perform the necessary tradeoffs to achieve acceptable levels of security and data protection. If security and privacy issues surrounding relatively new technologies, such as increasingly popular graph databases, aren’t considered up-front, they become far more costly to implement downstream.

References

  1. A.B.M. Moniruzzaman and S.A. Hossain, “NoSQL Database: New Era of Databases for Big Data Analytics—Classification, Characteristics and Comparison,” Int’l J. Database Theory and Application, vol. 6, no. 4, 2013.
  2. M. Buerli, “The Current State of Graph Databases,” Dept. of Computer Science, Cal Poly San Luis Obispo, Dec. 2012.
  3. Real Time Discovery in Big-Data Using the Urika-GD Appliance, white paper, Oct. 2014;
  4. A. Ukil, S. Bandyopadhyay, and A. Pal, “IoT-Privacy: To be Private or Not to be Private,” IEEE Conf. Computer Communications Workshops (INFOCOM), 2014, pp. 123–124.
  5. D. Wood et al., Linked Data—Structured Data on the Web, Manning Publications, 2014.
  6. E. Eifrem, “Graphs are Eating the World,” keynote, GraphConnect, Nov. 2014;
  7. O. Sporns, “The Nonrandom Brain: Efficiency, Economy, and Complex Dynamics,” Frontiers in Computational Neuroscience, vol. 5, 2011; 

About the Author

George Hurlburt is chief scientist at STEMCorp, a nonprofit that works to further economic development via the adoption of network science and to advance autonomous technologies as useful tools for human use. Contact him at ghurlburt@change-index.com.

 

This article first appeared in IEEE IT Professional magazine. IEEE IT Professional offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.

Rate this Article

Adoption
Style

BT