BT

Data normalization, is it really that good?

by Arnon Rotem-Gal-Oz on Aug 13, 2007 |
Normalization is one of the corner-stones of database design. Recently some discussion emerged on the need for normalization suggesting denormalization as a more scalable solution.

A couple of weeks ago Pat Helland blogged about a short presentation he gave named "Normalization is for Sissies". Pat started by stating (the obvious) that we normalize to eliminate update anomalies. but then proceeded to make a couple interesting observations:
  • Many types of computing problems are append only in nature or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)
  • Developers work with "complete" business objects. These Business Objects are denormalized complete views anyway
It's easy to dismiss Pat as just another blogger - until you remember that Pat has a long history with database systems including for example servicing as the chief architect for SQL Server's Service Broker and as one of the co-founders of the COM+ team. Pat also spent a couple of years or so working for Amazon - and indeed Werner Vogels (Amazon's CTO) mentioned that you can consider denormalizing entities to get better scalability and performance in his QCon presentation on Availability and Consistency (recently published here).

Andres Aguiar commented that in addition to the "no need to normalize immutable data", you don't even need to delete:
Another [idea] is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post). Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything.
Analyzing this, one barrier Andres mentioned was that databases are not ready for the large number of columns per table such a scenario will generate, which is solvable if you apply archiving to older data.

Dare Obasanjo thinks that
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. "
However Dare agrees that when you need massive scale you probably need to denormalize and give Flickr as an example.

As Cal Henderson (Flickr's web development lead) mentioned in a presentation he gave in 2004 on Flickr architecture that Joins are slow (slide 27) and as a result:
* Normalised data is for sissies
* Keep multiple copies of data around
* Makes searching faster
* Have to ensure consistency in the application logic
A 2004 blog post by Jason Kottke, which bares the same title like Pat's post and Cal's first point discusses Cal's presentations. Jason explains that in Flickr's case there is a ration of 13 Selects to each Insert/Delete or Update which is why they chose to denormalize.

To sum up as Pat finished his presentation:
    "people normalize 'cuz their professor said to"
Is it time yet to rethink the mantra of database normalization? Some of the larger web-sites seem to think so, what's your take?

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

YMMV by Porter Woodward

Normalization versus Denormalization is a "Your Mileage May Vary" argument.

It all depends heavily on the nature of your application. In some cases normalization is critical. OLTP (online transaction processing) systems with large numbers of updates seem to benefit most from normalization. Having one copy of the data - and just referencing it via foreign key can be a huge help for maintaining consistency; each copy of that data that needs to be updated increases the chance that the values will diverge. Joins may be slower - but most modern database servers provide "views" - allowing you to create flattened tables, yet still maintain relational integrity.

On the other hand you have a class of applications that tend to have many more reads than writes. Often those are what you would term OLAP (Online Analysis and Processing) applications. Major database vendors have been pushing "Data Warehouse" solutions for some time - flattening out your data to improve complex query processing: tabulating sales figures by region and quarter for instance.

Smart money seems to be on hybrid systems and building it smart. Your "online transaction" system being fully relational for performance reasons... and replicating that data to a warehouse flattening out the data for your online analysis needs. The OLTP system can purge out old inactive records based on timed retirement - with "backup" of the data now being stored in the OLAP system.

Some of the larger web sites that have rethought normalization have done so because their problem space typically does not require it. Data accuracy at flickr isn't exactly of paramount importance (compared to financial systems). Not to mention flickr still seems to have abysmal performance in spite of claims to the contrary.

I wouldn't say normalization is for sissies. I would say you should understand why or why not to normalize. It's also possible that the process of breaking down your data set - normalizing it - will help you understand the shape of your data better; even if you don't ultimately end up storing it in a normalized schema.

Have it both ways by Tero Vaananen

Why not have the best of the both worlds and use them together.

I need to run batches of statistical queries over a normalized database, but it is horribly inefficient. I can only spare minutes, but the queries take better part of a day to run over a normalized database, if not days in some extreme cases. On the other hand, I can't really denormalize the existing database and it would be a horrible choice for many other things we do.

So, we are keeping a mirrored copy that squeezes all the tables into a handful denormalized ones. It is literally thousands of times faster for the queries we need. And that's what we use it for - just for queries that need to be extremely fast. And sometimes that is too slow as well, so we are looking at partitioned views, to limit the queries to smaller chunks of data. Beyond that we are really looking at physical partitioning over several servers.

Re: Have it both ways by Vikas Hazrati

Tero, Wouldn't creating views (snapshot or dynamic as required) be a better option in your situation?

Meanwhile I tend to agree with Porter that depending on whether your application is OLTP heavy or OLAP heavy you should decide on some of the strategies, or you might have to decide on a combination depending on the various parts of your application.

Re: Have it both ways by Tero Vaananen

Tero, Wouldn't creating views (snapshot or dynamic as required) be a better option in your situation?


Views work if your data sets are not very big. We could not create proper indexes for the views so they were just too slow. It might work better for some other databases; we used SQLServer 2000. The problem is that in some cases even regular indexed tables are too slow so we have to partition the data. We use views to hide the partitioning and it seems to work well.

correct terminology makes a difference... by Ted Thibodeau Jr

Unfortunately, none of this is really about "data normalization". This is about *database* normalization.

"Data normalization" is much more about having standard formats for your data — little things like always using "Street" or always using "St" (or "St.") in an address field, or storing all phone numbers as pure numerics without spaces or other punctuation vs. storing phone numbers as strings with whatever formatting the person inputting it chooses (this becomes important when you start dealing with international phone records — which is also where you have to start thinking about country codes, in addition to area codes and local exchanges).

Data normalization is not only necessary if you’re going to be doing joins of any kind, but also when you’re doing selects based on the content of a given field — because you cannot match "(617) 555-1212" to "6175551212" or "+1-617-555-1212" without doing some major manipulation — and that takes significant time when you start considering all the possible format variants, and whether the leading "+1-" about breaks a match or not, and whether "617-KL5-1212" also matches….

Both OLAP and OLTP benefit from data normalization — in different ways, but the benefits are inarguable. Database normalization (basically, relational schema design) is different and distinct — and this is where the tradeoffs for OLAP vs OLTP optimzation come in.

Re: correct terminology makes a difference... by Arnon Rotem-Gal-Oz

This is a bit of nitpicking in my opinion - if you google for "data normalization" you'd find that a fair number of the links (including the top links) talk about what you call database normalization.

Nevertheless, you are correct that a more accurate term is database normalization

Arnon

yes, its really that good by fregas baratis

Every rule has exceptions. Certainly there are instances where due to performance, it may make sense to denormalize a table for reporting (although indexed views can also help in this regard if they are available.) But the idea that we don't need to normalize at all is just, well...STUPID.

Normalization is part of a much larger concept that some people call "DRY" or "Once and Only Once." Normalization actually makes my application code SIMPLER in many cases. Sure joins can be a pain, but so is trying display a list of a user's phone numbers when your table schema looks like this:

ContactID
HomeArea
HomePrefix
HomeSuffix
FaxArea
FaxPrefix
FaxSuffix
MobileArea
MobilePrefix
MobileSuffix
etc...

Its much easier to have a seperate table where i can add as many types of phone numbers as I like without changing the database.

PhoneID
ContactID
AreaCode
Prefix
Suffix
PhoneTpeID

This is just one simple example. I realize not everyone needs to normalize their phone numbers like this. Every application is different. But breaking things out makes application code simpler and more flexible. I have seen denormalization cause absolute havoc with application code, much less issues with updates.

Data normalization, is it really that good? by Roy Morien

The primary reason for having 'normalised' database tables is the problem of redundancy ... the perfect rule is that every fact in the database should be represented once, and once only. Redundancy introduces the problem of having to update the same fact everywhere it is recorded in the database. There are, however, many good reasons to denormalise. But what makes this an allowable and useful practice is the fact that by denormalising, you clearly are fully aware of the redundancy that you are creating, and your assumedly will maintain update control over that. The real problem of denormalised data is where the designer is really unaware of it, and / or the developers in the future are unaware of it, with the obvious potential for the redundantly recorded facts to become incorrect and invalid.

I am sure that certain facts in any database are in reality immutable, thus creating no problems when redundantly recorded. But always control must be maintained over this redundancy, as in ensuring that future generations of users and developers are aware of this.

Having stated that caveat, it is obvious that denormalisation can serve very useful purposes, primarily enabling ease of access, efficiency of access etc.

But there is also some misunderstanding about denormalisation. For example , having three telephone numbers. Is that really three values for the same attribute; Telephone Number, or is that really three different attributes? Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.

Personally, I think the best approach is to create a stringently normalised data model, and then carefully and with great forethought undertake a denormalisation activity.

Re: Data normalization, is it really that good? by Geir Hedemark

Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.


I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past.

Geir

Un-normalized by Humphrey Bogart

As a practising architect when I am presented with de-normalized data I usually find that it is a poor excuse for un-normalized rather than de-normalized. De-normalization is a process that requires a normalized structure as the start point. Un-normalized is a structure that has never been normalized.

The performance justification for de-normalized data is easy to verify - just test the model or do the calculations at the relative create, read, update, delete frequencies you expect.

The complexity justification for de-normalized data is weak unless you can demonstrate more than one team of which one needs a 'simplified' interface.

Re: Un-normalized by Dan Hardison

Hear Hear Matthew!,

I think the motive for a lot of this argument is laziness and lack of understanding.

Whilst the OLTP vs OLAP arguments presented are valid, I dont think that is what is driving the Un-normalized camp.

I have come across the "joins are slow", "joins are messy" argument many times and 90% of the time its just laziness. Modern relational database systems are designed for joining. Thats where the "relational" comes in.

I fear the uninformed and lazy will skim this article and use it as continued justification for poor design.

Re: Data normalization, is it really that good? by tahir akhtar

Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.


I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past.

Geir


Geir,
He said, "If stored in invoice table" :). I think his point is if store customer's invoice-address in an invoice table rather than separate address table, it is justifiable due to one-to-one relationship between invoice and invoice-address but most designers will move it out to a separate table without giving it a second thought.


Tahir

Update vs reporting by gilbert prine

As an old data warehouse architect, this sounds like revisiting the OLAP arguments during much of the 90s. When you are updating a large complicated database you certainly don't want to use un-normalized data, the OLTP updates and data quality problems will kill you. When you are reporting data you certainly don't want to use a normalized database, the joins will kill you. The only way to survive both of these needs is an efficient Extract Transform and Load (ETL) process that transforms the normalized data to the un-normalized structure you need to analyze your data. It is guts of data warehousing and made Bill Inmon and Ralph Kimball famous. Read Kimball's excellent Data Warehouse Toolkit book if you want to really understand probably the best view of this problem.

My current project is downloading equity data and doing time series analysis on this information. If I were the Data Administrator of the NYSE or the NASDAQ, I would be out of my mind to consider updating this data with an unnormalized database. However as a customer I only want unnormalized data, because I actually use my own ETL process to a quasi-object database that I use for my stock market time series analysis. These Data Administrators better be providing me with unnormalized data or they won't have a customer.

Normalization is conceptual, not implementation by Jack van Hoof

I always thought that data normalization was done to structure data at a conceptual level. Now I really get the impression that it is all about implementation. Where did I go wrong the past 30 years of my IT-carrier?

Fair enough, most of the concepts are implemented as is, but smart DBMS systems are tricking you...

Fair enough again: messages in an EDA, where data represent immutable business events, is all about denormalization, but based on structured normalized (conceptual) that is hit into flat structures. Or am I wrong?

Jack

Re: Data normalization, is it really that good? by Roy Morien

You are quite correct, Geir.I actually meant precisely what you said, but my meaning was not obvious. The address onthe invoice is indeed a snapshot, an historical fact that should not change

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

15 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT