Key Takeaways
- Learn how modeling the NoSQL databases is different from the data modeling in Relational database space
- Understand the steps in a typical NoSQL data modeling process
- Deliverables and artifacts in NoSQL data modeling efforts
- Multi-model NoSQL database modeling best practices
- How to data model for transactional and analytics use cases in NoSQL databases
NoSQL databases are specialized to store different types of data like Key Value, Documents, Column Family, Time Series, Graph, and IoT data. Pascal Desmarets, CEO of Hackolade, talks about how to perform data modeling in NoSQL databases compared to the modeling in Relational databases.
InfoQ: How is data modeling for NoSQL databases different from modeling for Relational databases?
Pascal Desmarets: For years, the terms "schemaless," "schema-on-read," and "non-relational" have given the impression that no data modeling was necessary for NoSQL databases. With a little bit of complexity, it quickly becomes apparent that data modeling is not only useful, it is actually more important than with relational databases. Indeed, the flexibility and power of JSON (used to store data in document NoSQL databases) are such that one can quickly get in trouble if not organized and careful. Too much flexibility and data will become inconsistent and hard to query. No rigor in the approach, and data could even become inaccurate.
The JSON-based dynamic-schema nature of NoSQL is a fantastic opportunity for application developers: ability to start storing and accessing data with minimal effort and setup, flexibility, fast and easy evolution. But while flexibility brings power, it also brings dangers for designers and developers new to NoSQL or less experienced. This is why the NoSQL database vendors counter their marketing department’s simplicity message by devoting countless pages, blogs, and videos to the subject of schema design (i.e.; MongoDB, DynamoDB, Couchbase, Cassandra, etc…) Data modeling for NoSQL databases is different from modeling for relational databases in two ways: first because it is not in the habit of NoSQL developers, and second, because modeling of nested objects in JSON is not a trivial exercise.
InfoQ: What are the steps in a typical NoSQL data modeling process?
Desmarets: When it comes to modeling with NoSQL, the biggest difference is that rules of normalization don't apply. To the contrary, it is encouraged now, given the cheap price of storage space, to denormalize information and repeat data as much as needed to satisfy performance and context needs. Understanding that NoSQL lets data be joined “on write,” data modeling needs to be looked at in terms of how the data will be accessed and queried. As for the data modeling process, it all depends on the degree of rigor adopted by the team, while leveraging an agile approach so the schema can quickly evolve as needed. Some people will feel comfortable enough, with a clear vision in their head, to go straight to the physical design of the database, and iterate quickly in a trial-and-error mode.
The more formal process when starting a new application, based on user requirements, would be similar to the one with relational databases, with only the physical modeling being specific to the data store technology. The steps usually start with some sort of conceptual (or domain-driven) design, before evolving to logical modeling, before designing the physical storage model, which is specific to the technology used. This is where NoSQL will vastly differ from traditional relational databases.
InfoQ: NoSQL databases are specialized to store different types of data like Key Value, Documents, Column Family, Time Series, Graph, and IoT data. How should the data modeling be done against these diverse databases?
Desmarets: The different types of NoSQL databases are quickly converging into multi-model databases. Key-value stores now store JSON in the value part, with some being able to index deeply nested attributes. Document databases are adding graph capabilities. And graph databases have a view that every attribute can be modeled as a relationship. IoT is definitely well handled, depending on the complexity of the data, by either key-value or document databases. Despite this convergence, each vendor still has its specific strengths and weaknesses, and it is important to choose the right tool for the right job. In terms of data modeling, the only one that presents a different approach is the graph type of databases.
That's because the number of relationships can quickly make a visual, graphical model hard to read. But some of the data modeling vendors are now starting to tackle the issue, and will soon bring useful tools to the market, to make developer's life easier.
InfoQ: NoSQL databases are used for managing data for transactional as well as analytics use cases. What's the best way to model the data for these diverse and contradicting use cases?
Desmarets: The type of data being stored, and the storage size, should be the main criteria for the choice of the solution. Completely unstructured human-produced text to be tagged and indexed will lead to completely different requirements than machine-produced data such as web logs, IoT, or operational data. Analytics databases for Big Data will still drive more of a relational storage approach given that performance on write needs to be very high, and that it is hard to predict how the data will be queried. They can therefore use the traditional data modeling tools that have been around for a long time. Transactional big data stores on the other hand can truly leverage the polymorphic nature of JSON and therefore use the new data modeling tools designed just for that.
InfoQ: Can you talk about multi-model NoSQL databases and data modeling best practices?
Desmarets: The scope of the 'multi-' part in 'multi-model' differs from vendor to vendor. When multi-model is a defensive move by DB vendors to tick a box in the marketing checklist, the impact on data modeling is fairly minimal because the way data is stored does not vary much. But some enterprise vendors have a real capability to support different storage models, for example, the ability to store XML indexing of human text, along with RDF graphs, along with JSON documents. Obviously, each storage type has its own requirements, and there is no single way to approach all of them. A data modeling tool for such multi-model databases therefore needs to be quite flexible.
InfoQ: What are the typical deliverables and artifacts of NoSQL data modeling process?
Desmarets: At a minimum, a good data modeling tool for NoSQL databases needs to generate DB vendor-specific forward-engineering scripts, and documentation of all the entities, attributes, constraints, and relationships. It also needs to be able to perform Reverse-Engineering in a native way and be able to model the polymorphic nature of JSON. The documentation is critical to facilitate the dialog between the different stakeholders involved in the development and maintenance of an application, most of whom don't like to look at code to understand structure: business analysts, designers, architects, DBAs, and developers. The forward-engineering scripts will help developers generate code compliant with the elaborated models, and according to the syntax of each vendor. All of this should be done in a way to contribute to proper data governance for the enterprise.
InfoQ: What are some best practices of data modeling in NoSQL databases?
Desmarets: Particularly after years, even decades, of relational database design experience, it is quite hard to forget the habits of normalization. It is not only OK, it is even encouraged to denormalize and repeat data, if it helps performance "on read." Some of the options to consider are embedding (where embedding can be viewed as a join performed at time of storage), referencing, and two-way referencing. A data modeler should think in terms of queries instead of in terms of storage: what are all the pieces of information that will be necessary to serve an application screen in just one disk access without any joins?
That structure will dictate how to store the data. And if a piece of data needs to be propagated to other places where it is denormalized, then a background process can take care of it so the data will eventually be consistent across the database. By the way, the term "eventually" is often misinterpreted, and should be put in perspective: a few milliseconds, or even a few seconds, are probably more than acceptable for most applications. Think about the alternative: do you prefer to present an error 404 on a web page and risk missing an e-commerce sale or lose a customer forever, just because your database choked on a complex join operation, or be sure to present data, even if not 100% consistent? A data modeling tool will help you design more resilient NoSQL databases.
InfoQ: Are there any gotchas developers should remember when working on data modeling tasks?
Desmarets: The goal of a data modeling tool for NoSQL is not to counter, but rather to leverage the flexibility and power of JSON, its polymorphism and evolutive nature. Therefore a data modeler should not try to control developers, but rather should become a facilitator, running "what if' scenarios, and going through the various options to avoid rework down the road. The data modeling documentation should become a communication tool for better teamwork between the different stakeholders to achieve better fit to business needs, higher quality, and lower total cost of ownership of any NoSQL application.
About the Interviewee
Pascal Desmarets is the Founder and CEO of Hackolade. He leads the company and all efforts involving business strategy, product innovation, and customer relations, as it focuses on producing user-friendly, powerful visual tools to smooth the onboarding of NoSQL technology in corporate IT landscapes. Hackolade’s software combines the comfort and simplicity of graphic data modeling, with the power of NoSQL document databases, resulting in reduced development time, increased application quality, and lower execution risks. Prior to Hackolade, he founded and was the CEO of IntegrIT SA/NV, an IT strategy and consulting firm that provided information architecture, innovation solutions, and scalable systems integration for large organizations. Pascal was also the CIO for ARTISTdirect, Inc. a publicly-held startup that was one of the web’s original music destinations and early innovators in the digital music world and online commerce.