InfoQ Homepage Articles NoSQL in the Enterprise

NoSQL in the Enterprise

Apr 21, 2010 24 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Introduction

By the virtue of being an Enterprise Architect, I'm always in search for new promising concepts/ideas, which can potentially help my enterprise customers across different industry verticals. With the same quest in mind I had also been following the space of NoSQL for a while, even before the term got coined (or miss-coined?). Google put the first brick on the wall by disputing the popular belief of silver bullet RDBMS with the publication of their Big Table architecture subsequently followed by Amazon's paper on Dynamo. Last one year or so we saw a huge NoSQL momentum through explosion of more than 25 products/solution in this space along with the increasing mindshare across different corners of the industry. In that pretext recently I was thinking to take a deep dive on this to evaluate how exactly my clients can get benefited out of this NoSQL movement. More than that, I wanted to find out whether this is the right time for enterprises to give a serious thought about starting adoption of the same.

A quick recap on what is NoSQL

Like many others who follow this space, I do not like the sense of opposing SQL inherently associated with the term NoSQL. Neither I like the current improvisation of the name, 'Not Only SQL'. To me what all we are talking here is not about whether to use SQL or not. (On the contrary, one may still decide to use SQL like query interface (without support for join, etc.) to interact with these databases just to manage the development scalability and maintainability with existing resource skills.). This movement is rather about figuring out what are the other efficient options of storing and retrieving data instead of blindly taking the RDBMS approach as de facto for anything and everything. And hence to me 'Non Relational Databases' is a better name to summarize the idea.

Whatever may be the name, the scope of 'Non Relational Databases' is little open (and negation oriented)with a 'catch all' type connotation implicit to it. That in turn makes people (especially the enterprise decision makers) confused about what is there and what not and more importantly why it makes sense for them.

Keeping that in mind, here I try to capture the spirit of 'Non Relational Databases' through the below mentioned characteristics.

The'Non Relational Databases' are the ones which

Logically model data using loosely typed extensible data schema (Map, Column Family, Document, Graph etc) instead of modeling data in tuples following fixed relational schema.
Designed for horizontal scaling through data distribution model across multiple nodes abiding by principles of CAP theorem(ensuring that any two of Consistency, Availability or Partition ability are achieved). This comes along with necessary support for multiple data centers and dynamic provisioning (transparently adding/removing a node from a production cluster), a la Elasticity.
Can persist data either in disk or memory or both; sometimes in pluggable custom stores.
Support various 'Non-SQL' interfaces (typically more than one) for data access.

The variations around these four characteristics (Logical Data Model, Data Distribution Model, Data Persistence and Interfaces) of 'Non Relational Databases' are very well covered in some of the recent articles widely available over the Internet. So Instead of detailing the same I summarize the key aspects with some examples for a quick reference –

Interfaces– REST (HBase, CouchDB, Riak, etc.), MapReduce (HBase, CouchDB, MongoDB, Hypertable, etc.), Get/Put (Voldemort, Scalaris, etc.), Thrift (HBase, Hypertable, Cassandra, etc.), Language Specific APIs (MongoDB).

Logical Data Models–Key-Value oriented (Voldemort, Dynomite etc.), Column Familiy oriented (BigTable, HBase, Hypertable etc.), Document oriented (Couch DB, MongoDB etc.), Graph oriented (Neo4j, Infogrid etc.)

Data Distribution Model– Consistency and Availability(HBase, Hypertable, MongoDB etc), Availability and Partitionality (Cassandra etc.). Consistency and Partitionabilityis a combination where Availability of some of the non-quorum nodes is compromised. Interestingly none of the 'Non Relational Database' today supports this combination.

DataPersistence–Memory Based (e.g. Redis, Scalaris, Terrastore), Disk Based (e.g., MongoDB, Riak etc.), Combination of both Memory and Disk(e.g., HBase, Hypertable, Cassandra). The type of storage gives a good idea of what type of use cases the solution can cater to. However, in most of the cases people find that the combination based solution is the best one. They cater to the high performance though in memory data store and also ensure durability by storing the data into disk after enough writes have happened.

How does it fit in Enterprise IT

In today's enterprises not all use cases lend themselves intuitively to RDBMS,neither they need the strictness of ACID property (especially the Consistency and Isolation). Gone are the days of 80s and 90s where most of the data stored in an organization databases were structured, had to be generated and accesses in controlled manner and were 'records' of business transactions. Unarguably those types of data are still there and will continue to be there and should always be modeled, stored and accessed using RDBMS. But what happens to the large volume of uncontrolled, unstructured, information oriented data explosion happened in enterprises in last 15 years with the advent of web, digital commerce, social computing etc? Enterprises really don't need RDBMS to store and retrieve them, as the core characteristics of RDBMS do not fit with the nature and usage of this data.

The above figure summarizes emerging patterns in Information Management in today's web centric enterprises. And the 'Non Relational Databases' are better choice for handling these trends (compared to RDBMS solutions) given their support for unstructured data, horizontal scalability through partitioning, high availability support etc.

Here are some examples of use cases supporting the point –

Log Mining – Server Logs, Applications Logs, User Activity Logs get generated in multiple nodes of a cluster. For production problem solving Log mining tools are handy which can access logs across servers, relate them and analyze them. Custom solution can be built easily for this using'Non Relational Databases'

Social Computing Insight–Many enterprises today have provided their users (Internal users, Customers, Partners) ability to do social computing through message forums, blogs etc. Mining those unstructured data they are finding of utmost importance to get an idea of user mindshare to further improve the services. Use of 'Non Relational Database' is a perfectly good fit for addressing this need

External Data Feed Integration– Many cases enterprises need to consume with the data coming from their partners. Obviously, even after number of discussions and negotiations, enterprises have little control on the format of the data coming to them. Also, there are many situations where those formats change very frequently based on the changes in business of the partners. 'Non Relational Database' can be used vey successfully to solve this issue while developing/customizing a ETL solution.

High Volume EAI – Most of the enterprises have heavy volume traffic flowing through their EAI system (either product based or custom developed). These messages flowing through the EAI need to be typically persisted for reliability and audit purpose. Again 'Non Relational Databases' can be good fit as underlying data store for this scenario given the variation in data structure of the source and target systems as well as given he volume in question.

Front end order processing systems– Given the explosion of digital commerce the volume of orders, applications, service requests flowing through different channels to the systems of Retailers, Bankers and Insurance providers, Entertainment Service providers, Logistic providers etc. is enormous. Also owing to the restrictions and behavior patterns associated with different channels, the structures using which the information is captured typically little different in each cases and needs different type of rules imposed. On top of that, most of these requests data don't need immediate processing and reconciliation at the back end. Rather what needed is that these requests needs to be captured without any interruption whenever end user wants to put this forward from anywhere across the world. Later on typically a reconciliation system updates them to the source of truth back end systems and update the end user on the order status. This scenario is another one, where 'Non Relational Databases' can be used for initially storing the inputs from end users. This scenario perfectly lends towards use of 'Non Relational Databases' given the characteristics of high volume, differences in input data structure and acceptability of 'Eventual Consistency' during the reconciliation.

Enterprise Content Management Service – Content Management is now used enterprise wide across different functional groups Sales, Marketing, Retail, HR for the various purposes. And most of the time the challenges are faced by enterprises to bring together requirement of different groups in a common content management service platform in terms of difference in meta data structure. 'Non Relational Databases' is a good fit to solve this problem also.

Merger and Acquisition– Enterprises face huge challenges during M&A as they need to consolidate systems catering to same functions. 'Non Relational Databases' can be used to solve this problem either to quickly put together a temporary common data store or even architecting the future data store, which can accommodate structure of existing common applications of merging companies.

But how exactly we can articulate the business benefits of using the 'Non Relational Databases' over traditional RDBMS solutions? Following are some key benefits, which can be drawn from the core characteristics of Non Relational Databases' (as discussed in previous section), along the line of core parameters of any enterprise IT decision – Cost Reduction, Better Turn Around Time and Superior Quality.

Business Agility – Less Turn around time

'Non Relational Databases' can help in creating Business Agility in two basic ways.

The schema free logical data model helps in accommodating any business change in faster turn around time with least impact on the existing applications and functionality. In most of the cases your migration effort for any change would be almost zero.
The horizontal scalability brings the inherent promise of supporting more and more user load to support seasonal load variation or any sudden change in usage pattern. Horizontal scalability oriented architecture is also the first step towards moving towards SLA based setup like cloud which essentially ensures business continuity in varying usage situations.

Better End user satisfaction – Superior Quality

In today's Enterprise IT the quality of applications is primarily decided by end user satisfaction. 'Non Relational Databases'can help achieving the same by addressing the following concerns of end users, which are the most frequent and difficult to handle.

'Non Relational Databases' bring in opportunities to improve the performance of the applications drastically. The core concept of distributed data ensures that the disk I/O (and hence the 'Seek' rate) can never be the bottleneck in application performance. Rather performance is more governed by the 'transfer' rate. On top of that most of the solutions support different new generation paradigms for faster computation like MapReduce, Sorted Columns, Bloom Filter, Appended only BTree, Memtable etc.
The other important aspect of user satisfaction today is Availability. End users want to access the applications as and when they want and at least wants to be able to perform his job whenever he gets the time. So application being non-available is something to be avoided at any cost. Most of the 'Non Relational Databases' of today are actually geared to support this type of availability requirement with the concept of strict and eventual consistency.

Lesser Total Cost of Ownership

In today's competitive market place where enterprise IT expenditure is scrutinized every now and then, achieving the right quality at right cost is the mantra. 'Non Relational Databases' out perform the conventional databases in that area to a considerable extent, especially when the data volume to be stored and handled is high.

Basic premise of horizontal scalability ensures that they can run in even commodity machines. This reduces not only the hardware capital cost but also operating costs like electricity, maintenance etc. This further ensures readiness to utilize low cost next generation infrastructure like cloud, virtualized data center etc.
In the long run one gets more operating cost benefits out of lesser maintainability. This is absolutely the case when a RDBMS needs to store high volume of data. Tuning RDBMS to be fast in case of high volume of data is an art and many a times need specialized skill, which comes with cost. In comparison 'Non Relational Databases' always provides a fast and even response characteristic even when the data grows in leaps and bounds. Indexing and caching works in the same way. The developers need to worry less about h/w, disks, re-indexing, file layout, etc. Instead can spend more time on applications programming.

But there are Challenges in Enterprise Adoption

Irrespective of all these long term benefits there are surely challenges at hand of enterprises before they can embrace the 'Non Relational Databases'.

Apart from the high level resistance due to existing mindset and lack of confidence the top tactical challenges I see today are –

Identification of right Applications/Usage Scenarios for 'Non Relational Databases'

Though it is easy to prove theoretically that not all of enterprise data need a relational and ACID based system, the years of bondage between RDBMS and enterprises makes it difficult to decide which all data can go little loose towards non relational solutions. Most of the time IT managers (and other ground level people with core bottom line responsibilities of the applications) don't have clear idea of what all they are going to loose and that apprehension makes them adverse to moving away from RDBMS. Data is the most valuable asset of Enterprise IT. Soability to take a decision for managing the same witha solution which is not that clear or widely used need a different type of mindset as well as big support (and push) from senior management.

How do we select the right product/solution which will suit us the most

The next biggest challenge is to identify the right product/tool to be used as a provider of 'Non Relational Databases'. As mentioned before, in today's industry there are more than 25 different products/solutions available with different characteristics across 4dimensions. Since every product has different characteristics in these 4 dimensions it is typically very difficult to select 1 product, which may address all needs. Sometimes it has even lead to use of multiple types of non-relational database across different groups of enterprise and eventually people turned around towards RDBMS for sheer need of standardization.

How do we get economy of scale

This thought essentially stems out from the previous one. If an organization needs to use multiple non-relational database solutions (due to fitment issue of one)ensuring economy of scale in terms of skills (developer, administrators, support personnel), infrastructure (hardware cost, software licensing cost, support cost, consulting cost), and artifacts (common components and services) is a big question. This aspect when gets compared with traditional RDBMS solution the issue looks to be really significant as most of the time organizations run their datastores in a shared service mode.

How do we ensure portability of solution

Given the formative state of the 'Non Relational Databases' world it is very intuitive to anticipate that in coming years there would be many changes in this space in terms of vendor consolidation, feature advancement and standardization. So the better strategy for an enterprise would be not to bet on a particular product/solution available today so that they can move to the better and proven product of future easily. Now given the current product/solution landscape of non-relational products, which mostly work in a proprietary way, Portability becomes an important issue to be considered before IT decision makers can start venturing out in the 'Non Relational Databases' space. This is for the sheer need of protecting their current investment.

How do we get right type of production support

Not many of the 'Non Relational Databases' today have a support solution in place through external organizations. Even those, which have one, cannot be compared with the big names like Oracle, IBM or Microsoft. Especially the support around data recovery, backup and ad hoc data fixing is always a big question in the mind of enterprise decision makers, as many of the'Non Relational Databases' don't provide a robust and easy to use mechanism towards these problems.

How do we budget for the overall cost

In comparison to the Big Iron RDBMS solutions the 'Non Relational Databases' typically provide very less data on their performance and scalability characteristics. I'm yet to see any benchmarking figure from the min TPC or equivalent places. This puts enterprise decision makers in a 'no clue' situation where they don't know how much money they need to spend on hardware, software license, infrastructure management and support. This is a big hindering factor towards deriving a budgetary estimate. Hence most of the times at the initial stage itself the decision goes in the favor of the known RDBMS based solutions.

Sometimes, even if the numbers are available, they may not be sufficient enough to feed a TCO model to compare typical RDBMS based data store and non-relational data store for an overall (Capex+Opex) cost analysis. Many a times the high number of hardware boxes (along with software license cost, support cost) required in a horizontal scalability situation does make people more jittery at a first glance compared to vertical scaling based solution unless the benefit is substantiated with an overall comparison based on TCO model.

My 2 Cents on how to go about the adoption

So does that mean that the enterprises should better watch and see the NoSQL movement at this point of time? Not really. It is true that the 'Non Relational Databases' today is in a nascent stage for a large-scale adoption by enterprise. But the sheer potential of 'Non Relational Databases' to frame the enterprise of future should not be missed out. This is especially true given the fact that enterprises in near future going to deal more with high volume of semi-structured/unstructured and eventually consistent data rather than significantly low volume, tightly structured, ACID abiding data.So what is important today is at least to start developing the mindshare within the key stakeholders of enterprise on the need of using 'Non Relational Databases' for enterprise data handling. In that journey, taking some incremental steps towards 'Non Relational Databases' around key aspects of Enterprise IT (Technology, People and Process), is going to make sense. That can help in holistically addressing the challenges we identified before in a slow but steady way.

Adopt one product/solution

There are plenty of choices available today in the market, which can deal with different dimensions of 'Non Relational Database' solutions in different ways. At the same time the use case scenarios of an enterprise may demand different type of characteristics. But going for different solutions for different applications/usage scenarios will not work for an enterprise from the perspective of the economy of scale. So it is better to settle for one depending on the target applications. Please remember that most of the solutions give some work around for the features, which are otherwise available in other products and have a placeholder for the same in the roadmap. Also most of the products will attend some maturity in near future where they can provide different solutions through configuration. So as long as a solution can cater to majority of the need it can be an option to start with.

The thumb rules for selection of product/solution are

Give more weightage to the support for required logical data model. This will essentially decide how easily the solution can fit different business need of today or future.
Investigate suitability of physical data model supported by the product to get a sense for possibility of horizontal scaling along with availability, consistency and partitionability according to the need of the solutions. This also dictates the possible backup and recovery mechanisms.
Interface support has to be aligned with enterprise Standard Operating Environment. Given variety of interfaces supported by these products this can be easily addressed.
The choice of persistence model does not matter much as long as the product supports horizontal scalability.

Here is a comparison of a set of short-listed 'Non Relational Databases'. This can be a good starting point forenterprises that are thinking of serious adoption right now. To make a sense at Enterprise context, while short listing the subset from the huge superset of 25+ choices the filter criteria primary used are –

For most of the enterprise applications support for reasonably complex data structure is a must. Otherwise responsibility of application programs will become huge to manage the complexities. The way I see is it has to be something in between plain key/value pair store and relational schema. In that respect some of the products like Voldemort, Tokyo Cabinet etc gets dropped out of my list.
Secondly it has to support large volume data horizontally through shards/partitions at low cost. Absence of this support makes the solution same as any RDBMS. That way some of the products like Neo4J (though it has very rich model based on graph), Redis, CouchDB get filtered out of my list at this point of time.
As a last criterion, I would care for a commercial support of some form before using it at enterprise level. Otherwise whom am I going to call in case of any production problem? This takes out from my list the hot products of today like Cassandra (though there is a good chance that pretty soon either Rackspace or Cloudera going to provide some support for it given that it is already being used in some production grade environments Twitter, Digg, Facebook).

With these filter criteria the ones I could short list for an enterprise to use right now are MongoDB (The shard support is coming shortly in next version), Riak, Hypertable and HBase. Following table summarizes the key characteristics of these four options. An enterprise based on its own detail requirements can think of using any of these four options, which has characteristics most fitting to the need.

Features	MongoDB	Riak	HyperTable	HBase
Logical Data Model	Rich Document with support for Nested Document	Rich Document	Column Family	Column Family
Support for CAP	CA	AP	CA	CA
Dynamic Addition/Removal of Node	Supported (Coming shortly in next release)	Supported	Supported	Supported
Multi DC support	Supported	Not Supported	Supported	Supported
Interface	Variety of Language specific APIs (Java, Python, Perl, C# etc.)	JSON over HTTP	REST, Thrift, Java	C++, Thrift
Persistence Model	Disk	Disk	Memory + Desk (Tunable)	Memory + Desk (Tunable)
Comparative Performance	Better (Written in C++)	Best (Written in Erlang)	Better (Written in C++)	Good (Written in Java)
Commercial Support	10gen.com	Basho Technologies	Hypertable Inc	Cloudera

Build abstraction for data access

Building a separate abstraction layer for accessing data from the 'Non Relational Databases' is a must to do. It will provide benefits in number of ways. Firstly application developers can be completely insulated from the underlying details of the solution. This will help inscaling in terms of skill. This will also help in easily changing the underlying solution in future if needed. And this can be also used to cater to requirements of multiple applications in a standard way (a la SQL without the complex features like Join, Group By etc.).

Create Model for Performance & Scalability

Irrespective of whatever solution is chosen, modeling the scalability and performance characteristics of the same using standard techniques (like Queuing Network Model, Layered Queuing Network etc.) is highly recommended. It will provide necessary data which can be used for basic server sizing and topology and also for overall cost for software support licenses, administration etc. This will essentially become the primary data for all budgetary purpose, which will help in taking decision.

Build Explicit Redundancy

There is no other way than replicating data in some backup server to protect any data loss. Though many of the Non Relational Databases' provide automatic replications but they also have the probability of single point of failure of the master node. So it is better to protect your data at a secondary backup and also have a set of scripts ready for data recovery and automatic data fix. It is therefore important to understand the physical data model of the target solution and identify the options for possible recovery mechanisms and evaluating whether those options fit well the overall enterprise requirements and practice.

Build Common Data Services Platform

Like common shared service RDBMS databases, common data service for 'Non Relational Databases' can be built to achieve economy of scale in terms of infrastructure need and support need. This will also help in evolving and changing it in future for betterment. This should be the final goal in the wish list as the maturity level to achieve in mid term or long term. However, having this in vision from the initial days will help in taking right decision in the overall journey.

Foster Enterprise Task Force

Every organization has a set of people who has zeal towards learning new and non-conventional things. Forming a group with such hand picked people (full time and part time) to keep a tab on what's going on in this space, known issues and challenges, next generation thinking will help in providing direction to the projects which use this technology. Also, this group can help decision makers demystifying the hypes and providing them with actual data points.

Develop relationship with product community

After adopting a product what will make sense is to develop relationship with product community to help in each other being successful. Most of the 'Non Relational Databases' of today has a vibrant community who are more than eager to help others. A thriving relationship between the enterprise and the community will help both the party in a win-win way. Knowing the problems and solutions before hand can benefit Enterprise in taking decisions on some features or versions. Also enterprise can influence the product roadmap with features, which make sense for them as well as general community. On the other hand the community can know the actual ground level issue to make the product robust and feature rich. Also the success stories with big enterprises will help them to be ahead in the curve.

Go Iterative

Given the relative maturity of 'Non Relational Databases' the only way to adopt the same with minimal risk is following the Iterative development methodology. The vision of building a common Data Service Platform for Non Relational Databases' along with standardized data access abstraction is not going to happen in a big bang way. Rather working in an iterative and refactoring oriented mode will help better in achieving the same. In this type of technology journey with less matured solutions, changing the solution in mid way is not very uncommon. Also the agile way of seeing things helps creating mindset for absorbing the reworks both for the management as well as implementers.

However, to Go Iterative with this problem it is very important to define a set of decision criteria matrix. For example guidelines (and examples) providing direction whether the object model of an application fits well with RDBMS or Non-RDBMS space, guidelines for infrastructure sizing, list of mandatory test cases etc.

The end note

In Enterprise adoption of 'Non Relational Databases' the biggest challenging task is changing the mindset of enterprise decision makers - making them believe that not every data/objects are suitable for RDBMS. The best way to prove that is Trying Out'Non Relational Databases' for the right type of use cases demonstrating how 'Non Relational Databases' can be a more effective solution compared to RDBMS if used in the right context. Identify few 'not so business critical'(but high visibility)projects where Non Relational Databases' can be a good fit. The success (or even failure)of these projects will help change the mindset. That will as well help in learning more about what needs to be done differently for adopting 'Non Relational Databases' in a better way.These baby steps around Trying out are indeed the need of the hour if enterprises want to reshape their information management world in the near future using the 'Non Relation Database' technologies.

About the Author

Sourav Mazumder, currently works as Principal Technology Architect for Infosys Technologies Limited and has more than 14 years of experience in Information Technology domain. As a key member of Technology Consultancy group of Infosys, Sourav has worked for key clients of Infosys in USA, Europe, Australia and Japan in various domains like Insurance, Telecom, Banking, Retail, Security, Transport, and Architecture/Engineering/Construction industry. He was involved in Technical architecture and Roadmap defnition for Web Based applications, SoA strategy implementaion, Internationalization strategy definition, UI Componentization and Performance Modeling & Scalability Analysis, Unstructured Data management. Sourav's association with Infosys' own Core Banking product, Finacle, provided him with an extensive product development experience also. Sourav was also involved in developing reusable framework for J2EE applications in Infosys and defining Infosys' software engineering methodology for architecting and designing custom built applications. Sourav's experience also includes ensuring Architecture Compliance and Governance for development projects.

Sourav is an iCMG certified Software Architect as well as a TOGAF 8 certified practitioner. Sourav recently presented in Berkeley Globalization Conference of LISA. Sourav's latest white paper on SoA has become immensely popular among various reading communities.

Sourav's current interest area includes NoSQL, Web 2.0 Governance, Performance Modeling and Globalization.

InfoQ Software Architects' Newsletter