InfoQ

News

Data normalization, is it really that good?

Posted by Arnon Rotem-Gal-Oz on Aug 13, 2007 06:10 PM

Community
Architecture
Topics
Data Access,
Design
Tags
Database
Normalization is one of the corner-stones of database design. Recently some discussion emerged on the need for normalization suggesting denormalization as a more scalable solution.

A couple of weeks ago Pat Helland blogged about a short presentation he gave named "Normalization is for Sissies". Pat started by stating (the obvious) that we normalize to eliminate update anomalies. but then proceeded to make a couple interesting observations:
  • Many types of computing problems are append only in nature or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)
  • Developers work with "complete" business objects. These Business Objects are denormalized complete views anyway
It's easy to dismiss Pat as just another blogger - until you remember that Pat has a long history with database systems including for example servicing as the chief architect for SQL Server's Service Broker and as one of the co-founders of the COM+ team. Pat also spent a couple of years or so working for Amazon - and indeed Werner Vogels (Amazon's CTO) mentioned that you can consider denormalizing entities to get better scalability and performance in his QCon presentation on Availability and Consistency (recently published here).

Andres Aguiar commented that in addition to the "no need to normalize immutable data", you don't even need to delete:
Another [idea] is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post). Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything.
Analyzing this, one barrier Andres mentioned was that databases are not ready for the large number of columns per table such a scenario will generate, which is solvable if you apply archiving to older data.

Dare Obasanjo thinks that
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. "
However Dare agrees that when you need massive scale you probably need to denormalize and give Flickr as an example.

As Cal Henderson (Flickr's web development lead) mentioned in a presentation he gave in 2004 on Flickr architecture that Joins are slow (slide 27) and as a result:
* Normalised data is for sissies
* Keep multiple copies of data around
* Makes searching faster
* Have to ensure consistency in the application logic
A 2004 blog post by Jason Kottke, which bares the same title like Pat's post and Cal's first point discusses Cal's presentations. Jason explains that in Flickr's case there is a ration of 13 Selects to each Insert/Delete or Update which is why they chose to denormalize.

To sum up as Pat finished his presentation:
    "people normalize 'cuz their professor said to"
Is it time yet to rethink the mantra of database normalization? Some of the larger web-sites seem to think so, what's your take?

16 comments

Reply

Spelling, Grammar by dddq dddq Posted Aug 13, 2007 8:25 PM
YMMV by Porter Woodward Posted Aug 13, 2007 9:04 PM
Have it both ways by Tero Vaananen Posted Aug 13, 2007 10:10 PM
Re: Have it both ways by Vikas Hazrati Posted Aug 13, 2007 11:35 PM
Re: Have it both ways by Tero Vaananen Posted Aug 14, 2007 12:42 PM
correct terminology makes a difference... by Ted Thibodeau Jr Posted Aug 15, 2007 12:02 PM
Re: correct terminology makes a difference... by Arnon Rotem-Gal-Oz Posted Aug 15, 2007 4:03 PM
yes, its really that good by fregas baratis Posted Aug 21, 2007 8:23 AM
Data normalization, is it really that good? by Roy Morien Posted Aug 21, 2007 8:30 AM
Re: Data normalization, is it really that good? by Geir Hedemark Posted Aug 21, 2007 12:34 PM
Re: Data normalization, is it really that good? by tahir akhtar Posted Aug 22, 2007 5:40 AM
Un-normalized by Humphrey Bogart Posted Aug 21, 2007 1:20 PM
Re: Un-normalized by Dan Hardison Posted Aug 21, 2007 6:25 PM
Update vs reporting by gilbert prine Posted Aug 22, 2007 9:02 PM
Normalization is conceptual, not implementation by Jack van Hoof Posted Aug 23, 2007 1:53 PM
qwqwq by berkay NiQuiL Posted Jun 30, 2008 12:12 PM
  1. Back to top

    Spelling, Grammar

    Aug 13, 2007 8:25 PM by dddq dddq

    Might want to double check some of the spelling and grammar (I don't want to be pedantic but it makes an otherwise well-written article difficult to understand). "Many kind of computing are append only or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)" and "It's easy to dismiss Pat as just another blogger - until you remember that Pat has a long history with database system including for example servicing as the chief architect for SQL Server's Service Broker, one of the co-founders of the COM+ team etc" In particular.

  2. Back to top

    YMMV

    Aug 13, 2007 9:04 PM by Porter Woodward

    Normalization versus Denormalization is a "Your Mileage May Vary" argument. It all depends heavily on the nature of your application. In some cases normalization is critical. OLTP (online transaction processing) systems with large numbers of updates seem to benefit most from normalization. Having one copy of the data - and just referencing it via foreign key can be a huge help for maintaining consistency; each copy of that data that needs to be updated increases the chance that the values will diverge. Joins may be slower - but most modern database servers provide "views" - allowing you to create flattened tables, yet still maintain relational integrity. On the other hand you have a class of applications that tend to have many more reads than writes. Often those are what you would term OLAP (Online Analysis and Processing) applications. Major database vendors have been pushing "Data Warehouse" solutions for some time - flattening out your data to improve complex query processing: tabulating sales figures by region and quarter for instance. Smart money seems to be on hybrid systems and building it smart. Your "online transaction" system being fully relational for performance reasons... and replicating that data to a warehouse flattening out the data for your online analysis needs. The OLTP system can purge out old inactive records based on timed retirement - with "backup" of the data now being stored in the OLAP system. Some of the larger web sites that have rethought normalization have done so because their problem space typically does not require it. Data accuracy at flickr isn't exactly of paramount importance (compared to financial systems). Not to mention flickr still seems to have abysmal performance in spite of claims to the contrary. I wouldn't say normalization is for sissies. I would say you should understand why or why not to normalize. It's also possible that the process of breaking down your data set - normalizing it - will help you understand the shape of your data better; even if you don't ultimately end up storing it in a normalized schema.

  3. Back to top

    Have it both ways

    Aug 13, 2007 10:10 PM by Tero Vaananen

    Why not have the best of the both worlds and use them together. I need to run batches of statistical queries over a normalized database, but it is horribly inefficient. I can only spare minutes, but the queries take better part of a day to run over a normalized database, if not days in some extreme cases. On the other hand, I can't really denormalize the existing database and it would be a horrible choice for many other things we do. So, we are keeping a mirrored copy that squeezes all the tables into a handful denormalized ones. It is literally thousands of times faster for the queries we need. And that's what we use it for - just for queries that need to be extremely fast. And sometimes that is too slow as well, so we are looking at partitioned views, to limit the queries to smaller chunks of data. Beyond that we are really looking at physical partitioning over several servers.

  4. Back to top

    Re: Have it both ways

    Aug 13, 2007 11:35 PM by Vikas Hazrati

    Tero, Wouldn't creating views (snapshot or dynamic as required) be a better option in your situation? Meanwhile I tend to agree with Porter that depending on whether your application is OLTP heavy or OLAP heavy you should decide on some of the strategies, or you might have to decide on a combination depending on the various parts of your application.

  5. Back to top

    Re: Have it both ways

    Aug 14, 2007 12:42 PM by Tero Vaananen

    Tero, Wouldn't creating views (snapshot or dynamic as required) be a better option in your situation?
    Views work if your data sets are not very big. We could not create proper indexes for the views so they were just too slow. It might work better for some other databases; we used SQLServer 2000. The problem is that in some cases even regular indexed tables are too slow so we have to partition the data. We use views to hide the partitioning and it seems to work well.

  6. Back to top

    correct terminology makes a difference...

    Aug 15, 2007 12:02 PM by Ted Thibodeau Jr

    Unfortunately, none of this is really about "data normalization". This is about *database* normalization. "Data normalization" is much more about having standard formats for your data — little things like always using "Street" or always using "St" (or "St.") in an address field, or storing all phone numbers as pure numerics without spaces or other punctuation vs. storing phone numbers as strings with whatever formatting the person inputting it chooses (this becomes important when you start dealing with international phone records — which is also where you have to start thinking about country codes, in addition to area codes and local exchanges). Data normalization is not only necessary if you’re going to be doing joins of any kind, but also when you’re doing selects based on the content of a given field — because you cannot match "(617) 555-1212" to "6175551212" or "+1-617-555-1212" without doing some major manipulation — and that takes significant time when you start considering all the possible format variants, and whether the leading "+1-" about breaks a match or not, and whether "617-KL5-1212" also matches…. Both OLAP and OLTP benefit from data normalization — in different ways, but the benefits are inarguable. Database normalization (basically, relational schema design) is different and distinct — and this is where the tradeoffs for OLAP vs OLTP optimzation come in.

  7. Back to top

    Re: correct terminology makes a difference...

    Aug 15, 2007 4:03 PM by Arnon Rotem-Gal-Oz

    This is a bit of nitpicking in my opinion - if you google for "data normalization" you'd find that a fair number of the links (including the top links) talk about what you call database normalization. Nevertheless, you are correct that a more accurate term is database normalization Arnon

  8. Back to top

    yes, its really that good

    Aug 21, 2007 8:23 AM by fregas baratis

    Every rule has exceptions. Certainly there are instances where due to performance, it may make sense to denormalize a table for reporting (although indexed views can also help in this regard if they are available.) But the idea that we don't need to normalize at all is just, well...STUPID. Normalization is part of a much larger concept that some people call "DRY" or "Once and Only Once." Normalization actually makes my application code SIMPLER in many cases. Sure joins can be a pain, but so is trying display a list of a user's phone numbers when your table schema looks like this: ContactID HomeArea HomePrefix HomeSuffix FaxArea FaxPrefix FaxSuffix MobileArea MobilePrefix MobileSuffix etc... Its much easier to have a seperate table where i can add as many types of phone numbers as I like without changing the database. PhoneID ContactID AreaCode Prefix Suffix PhoneTpeID This is just one simple example. I realize not everyone needs to normalize their phone numbers like this. Every application is different. But breaking things out makes application code simpler and more flexible. I have seen denormalization cause absolute havoc with application code, much less issues with updates.

  9. Back to top

    Data normalization, is it really that good?

    Aug 21, 2007 8:30 AM by Roy Morien

    The primary reason for having 'normalised' database tables is the problem of redundancy ... the perfect rule is that every fact in the database should be represented once, and once only. Redundancy introduces the problem of having to update the same fact everywhere it is recorded in the database. There are, however, many good reasons to denormalise. But what makes this an allowable and useful practice is the fact that by denormalising, you clearly are fully aware of the redundancy that you are creating, and your assumedly will maintain update control over that. The real problem of denormalised data is where the designer is really unaware of it, and / or the developers in the future are unaware of it, with the obvious potential for the redundantly recorded facts to become incorrect and invalid. I am sure that certain facts in any database are in reality immutable, thus creating no problems when redundantly recorded. But always control must be maintained over this redundancy, as in ensuring that future generations of users and developers are aware of this. Having stated that caveat, it is obvious that denormalisation can serve very useful purposes, primarily enabling ease of access, efficiency of access etc. But there is also some misunderstanding about denormalisation. For example , having three telephone numbers. Is that really three values for the same attribute; Telephone Number, or is that really three different attributes? Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role. Personally, I think the best approach is to create a stringently normalised data model, and then carefully and with great forethought undertake a denormalisation activity.

  10. Back to top

    Re: Data normalization, is it really that good?

    Aug 21, 2007 12:34 PM by Geir Hedemark

    Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.
    I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past. Geir

  11. Back to top

    Un-normalized

    Aug 21, 2007 1:20 PM by Humphrey Bogart

    As a practising architect when I am presented with de-normalized data I usually find that it is a poor excuse for un-normalized rather than de-normalized. De-normalization is a process that requires a normalized structure as the start point. Un-normalized is a structure that has never been normalized. The performance justification for de-normalized data is easy to verify - just test the model or do the calculations at the relative create, read, update, delete frequencies you expect. The complexity justification for de-normalized data is weak unless you can demonstrate more than one team of which one needs a 'simplified' interface.

  12. Back to top

    Re: Un-normalized

    Aug 21, 2007 6:25 PM by Dan Hardison

    Hear Hear Matthew!, I think the motive for a lot of this argument is laziness and lack of understanding. Whilst the OLTP vs OLAP arguments presented are valid, I dont think that is what is driving the Un-normalized camp. I have come across the "joins are slow", "joins are messy" argument many times and 90% of the time its just laziness. Modern relational database systems are designed for joining. Thats where the "relational" comes in. I fear the uninformed and lazy will skim this article and use it as continued justification for poor design.

  13. Back to top

    Re: Data normalization, is it really that good?

    Aug 22, 2007 5:40 AM by tahir akhtar

    Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.
    I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past. Geir
    Geir, He said, "If stored in invoice table" :). I think his point is if store customer's invoice-address in an invoice table rather than separate address table, it is justifiable due to one-to-one relationship between invoice and invoice-address but most designers will move it out to a separate table without giving it a second thought. Tahir

  14. Back to top

    Update vs reporting

    Aug 22, 2007 9:02 PM by gilbert prine

    As an old data warehouse architect, this sounds like revisiting the OLAP arguments during much of the 90s. When you are updating a large complicated database you certainly don't want to use un-normalized data, the OLTP updates and data quality problems will kill you. When you are reporting data you certainly don't want to use a normalized database, the joins will kill you. The only way to survive both of these needs is an efficient Extract Transform and Load (ETL) process that transforms the normalized data to the un-normalized structure you need to analyze your data. It is guts of data warehousing and made Bill Inmon and Ralph Kimball famous. Read Kimball's excellent Data Warehouse Toolkit book if you want to really understand probably the best view of this problem. My current project is downloading equity data and doing time series analysis on this information. If I were the Data Administrator of the NYSE or the NASDAQ, I would be out of my mind to consider updating this data with an unnormalized database. However as a customer I only want unnormalized data, because I actually use my own ETL process to a quasi-object database that I use for my stock market time series analysis. These Data Administrators better be providing me with unnormalized data or they won't have a customer.

  15. Back to top

    Normalization is conceptual, not implementation

    Aug 23, 2007 1:53 PM by Jack van Hoof

    I always thought that data normalization was done to structure data at a conceptual level. Now I really get the impression that it is all about implementation. Where did I go wrong the past 30 years of my IT-carrier? Fair enough, most of the concepts are implemented as is, but smart DBMS systems are tricking you... Fair enough again: messages in an EDA, where data represent immutable business events, is all about denormalization, but based on structured normalized (conceptual) that is hit into flat structures. Or am I wrong? Jack

  16. Back to top

    qwqwq

    Jun 30, 2008 12:12 PM by berkay NiQuiL

Exclusive Content

Rationalizing the Presentation Tier

Thin client paradigm characterized by web applications is a kludge that needs to be repudiated. Old compromises are no longer needed and it's time to move the presentation tier to where it belongs.

Agile Project Management: Lessons Learned at Google

In this presentation filmed during QCon 2007, Jeff Sutherland, the creator of Scrum, talks about his visit at Google to do an analysis of Google's first implementation of Scrum.

AtomServer – The Power of Publishing for Data Distribution

In this article, Bryon Jacob and Chris Berry introduce AtomServer, their implementation of a full-fledged Atom Store based on Apache Abdera, which is now available as open source.

An Introduction to Virtualization

It is easy to think that virtualization applies only to servers. In reality the recent resurgence of the concept is also being applied to networking, storage, and application infrastructure.

REST Anti-Patterns

In this article, Stefan Tilkov explains some of the most common anti-patterns found in applications that claim to follow a "RESTful" design and suggests ways to avoid them.

Choosing between Routing and Orchestration in an ESB

In this article, Adrien Louis and Marc Dutoo discuss the differences and relative merits of using orchestration vs. routing in a typical ESB setup, and discuss various implementation options.

Enterprise Batch Processing with Spring

Wayne Lund discusses batch processing, Spring Batch objectives and features, scenarios for usage, Spring Batch architecture, scaling, example code, failures and retrying, and the future roadmap.

User Story Estimation Techniques

Developer Jay Fields draws on his experiences as a ThoughtWorks consultant to describe effective user story estimation techniques.