Data normalization, is it really that good?
A couple of weeks ago Pat Helland blogged about a short presentation he gave named "Normalization is for Sissies". Pat started by stating (the obvious) that we normalize to eliminate update anomalies. but then proceeded to make a couple interesting observations:
- Many types of computing problems are append only in nature or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)
- Developers work with "complete" business objects. These Business Objects are denormalized complete views anyway
Andres Aguiar commented that in addition to the "no need to normalize immutable data", you don't even need to delete:
Another [idea] is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post). Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything.Analyzing this, one barrier Andres mentioned was that databases are not ready for the large number of columns per table such a scenario will generate, which is solvable if you apply archiving to older data.
Dare Obasanjo thinks that
"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. "However Dare agrees that when you need massive scale you probably need to denormalize and give Flickr as an example.
* Normalised data is for sissiesA 2004 blog post by Jason Kottke, which bares the same title like Pat's post and Cal's first point discusses Cal's presentations. Jason explains that in Flickr's case there is a ration of 13 Selects to each Insert/Delete or Update which is why they chose to denormalize.
* Keep multiple copies of data around
* Makes searching faster
* Have to ensure consistency in the application logic
To sum up as Pat finished his presentation:
"people normalize 'cuz their professor said to"Is it time yet to rethink the mantra of database normalization? Some of the larger web-sites seem to think so, what's your take?
YMMV
by
Porter Woodward
It all depends heavily on the nature of your application. In some cases normalization is critical. OLTP (online transaction processing) systems with large numbers of updates seem to benefit most from normalization. Having one copy of the data - and just referencing it via foreign key can be a huge help for maintaining consistency; each copy of that data that needs to be updated increases the chance that the values will diverge. Joins may be slower - but most modern database servers provide "views" - allowing you to create flattened tables, yet still maintain relational integrity.
On the other hand you have a class of applications that tend to have many more reads than writes. Often those are what you would term OLAP (Online Analysis and Processing) applications. Major database vendors have been pushing "Data Warehouse" solutions for some time - flattening out your data to improve complex query processing: tabulating sales figures by region and quarter for instance.
Smart money seems to be on hybrid systems and building it smart. Your "online transaction" system being fully relational for performance reasons... and replicating that data to a warehouse flattening out the data for your online analysis needs. The OLTP system can purge out old inactive records based on timed retirement - with "backup" of the data now being stored in the OLAP system.
Some of the larger web sites that have rethought normalization have done so because their problem space typically does not require it. Data accuracy at flickr isn't exactly of paramount importance (compared to financial systems). Not to mention flickr still seems to have abysmal performance in spite of claims to the contrary.
I wouldn't say normalization is for sissies. I would say you should understand why or why not to normalize. It's also possible that the process of breaking down your data set - normalizing it - will help you understand the shape of your data better; even if you don't ultimately end up storing it in a normalized schema.
Have it both ways
by
Tero Vaananen
I need to run batches of statistical queries over a normalized database, but it is horribly inefficient. I can only spare minutes, but the queries take better part of a day to run over a normalized database, if not days in some extreme cases. On the other hand, I can't really denormalize the existing database and it would be a horrible choice for many other things we do.
So, we are keeping a mirrored copy that squeezes all the tables into a handful denormalized ones. It is literally thousands of times faster for the queries we need. And that's what we use it for - just for queries that need to be extremely fast. And sometimes that is too slow as well, so we are looking at partitioned views, to limit the queries to smaller chunks of data. Beyond that we are really looking at physical partitioning over several servers.
Re: Have it both ways
by
Vikas Hazrati
Meanwhile I tend to agree with Porter that depending on whether your application is OLTP heavy or OLAP heavy you should decide on some of the strategies, or you might have to decide on a combination depending on the various parts of your application.
Re: Have it both ways
by
Tero Vaananen
Tero, Wouldn't creating views (snapshot or dynamic as required) be a better option in your situation?
Views work if your data sets are not very big. We could not create proper indexes for the views so they were just too slow. It might work better for some other databases; we used SQLServer 2000. The problem is that in some cases even regular indexed tables are too slow so we have to partition the data. We use views to hide the partitioning and it seems to work well.
correct terminology makes a difference...
by
Ted Thibodeau Jr
"Data normalization" is much more about having standard formats for your data — little things like always using "Street" or always using "St" (or "St.") in an address field, or storing all phone numbers as pure numerics without spaces or other punctuation vs. storing phone numbers as strings with whatever formatting the person inputting it chooses (this becomes important when you start dealing with international phone records — which is also where you have to start thinking about country codes, in addition to area codes and local exchanges).
Data normalization is not only necessary if you’re going to be doing joins of any kind, but also when you’re doing selects based on the content of a given field — because you cannot match "(617) 555-1212" to "6175551212" or "+1-617-555-1212" without doing some major manipulation — and that takes significant time when you start considering all the possible format variants, and whether the leading "+1-" about breaks a match or not, and whether "617-KL5-1212" also matches….
Both OLAP and OLTP benefit from data normalization — in different ways, but the benefits are inarguable. Database normalization (basically, relational schema design) is different and distinct — and this is where the tradeoffs for OLAP vs OLTP optimzation come in.
Re: correct terminology makes a difference...
by
Arnon Rotem-Gal-Oz
Nevertheless, you are correct that a more accurate term is database normalization
Arnon
yes, its really that good
by
fregas baratis
Normalization is part of a much larger concept that some people call "DRY" or "Once and Only Once." Normalization actually makes my application code SIMPLER in many cases. Sure joins can be a pain, but so is trying display a list of a user's phone numbers when your table schema looks like this:
ContactID
HomeArea
HomePrefix
HomeSuffix
FaxArea
FaxPrefix
FaxSuffix
MobileArea
MobilePrefix
MobileSuffix
etc...
Its much easier to have a seperate table where i can add as many types of phone numbers as I like without changing the database.
PhoneID
ContactID
AreaCode
Prefix
Suffix
PhoneTpeID
This is just one simple example. I realize not everyone needs to normalize their phone numbers like this. Every application is different. But breaking things out makes application code simpler and more flexible. I have seen denormalization cause absolute havoc with application code, much less issues with updates.
Data normalization, is it really that good?
by
Roy Morien
I am sure that certain facts in any database are in reality immutable, thus creating no problems when redundantly recorded. But always control must be maintained over this redundancy, as in ensuring that future generations of users and developers are aware of this.
Having stated that caveat, it is obvious that denormalisation can serve very useful purposes, primarily enabling ease of access, efficiency of access etc.
But there is also some misunderstanding about denormalisation. For example , having three telephone numbers. Is that really three values for the same attribute; Telephone Number, or is that really three different attributes? Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.
Personally, I think the best approach is to create a stringently normalised data model, and then carefully and with great forethought undertake a denormalisation activity.
Re: Data normalization, is it really that good?
by
Geir Hedemark
Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.
I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past.
Geir
Un-normalized
by
Humphrey Bogart
The performance justification for de-normalized data is easy to verify - just test the model or do the calculations at the relative create, read, update, delete frequencies you expect.
The complexity justification for de-normalized data is weak unless you can demonstrate more than one team of which one needs a 'simplified' interface.
Re: Un-normalized
by
Dan Hardison
I think the motive for a lot of this argument is laziness and lack of understanding.
Whilst the OLTP vs OLAP arguments presented are valid, I dont think that is what is driving the Un-normalized camp.
I have come across the "joins are slow", "joins are messy" argument many times and 90% of the time its just laziness. Modern relational database systems are designed for joining. Thats where the "relational" comes in.
I fear the uninformed and lazy will skim this article and use it as continued justification for poor design.
Re: Data normalization, is it really that good?
by
tahir akhtar
Is the Customer's Address on an Invoice really redundant if stored in the Invoice Table? Perhaps it is drawn from the domain of Customer Addresses, but it might very well be defined as being different; ie: The Address of the Customer THAT THE INVOICE WAS SENT TO, which is a role.
I think you were a bit unlucky with your choice of example. The address in a customer object may change. You do not want to update the address you sent an invoice to - you may have to have a look at where the invoices were sent to when the customer complains about lost invoices. There is a whole host of these kinds of cases where something looks like a denormalized copy of something when it is a snapshot of that something at some point in the past.
Geir
Geir,
He said, "If stored in invoice table" :). I think his point is if store customer's invoice-address in an invoice table rather than separate address table, it is justifiable due to one-to-one relationship between invoice and invoice-address but most designers will move it out to a separate table without giving it a second thought.
Tahir
Update vs reporting
by
gilbert prine
My current project is downloading equity data and doing time series analysis on this information. If I were the Data Administrator of the NYSE or the NASDAQ, I would be out of my mind to consider updating this data with an unnormalized database. However as a customer I only want unnormalized data, because I actually use my own ETL process to a quasi-object database that I use for my stock market time series analysis. These Data Administrators better be providing me with unnormalized data or they won't have a customer.
Normalization is conceptual, not implementation
by
Jack van Hoof
Fair enough, most of the concepts are implemented as is, but smart DBMS systems are tricking you...
Fair enough again: messages in an EDA, where data represent immutable business events, is all about denormalization, but based on structured normalized (conceptual) that is hit into flat structures. Or am I wrong?
Jack
Educational Content
Writing Usable APIs in Practice
Giovanni Asproni May 19, 2013
Concurrency in Clojure
Stuart Halloway May 17, 2013





Hello stranger!
You need to Register an InfoQ account or Login to post comments. But there's so much more behind being registered.Get the most out of the InfoQ experience.
Tell us what you think