Data normalization, is it really that good?

Normalization is one of the corner-stones of database design. Recently some discussion emerged on the need for normalization suggesting denormalization as a more scalable solution.

A couple of weeks ago Pat Helland blogged about a short presentation he gave named "Normalization is for Sissies". Pat started by stating (the obvious) that we normalize to eliminate update anomalies. but then proceeded to make a couple interesting observations:

Many types of computing problems are append only in nature or in his words "Accountant don't use erasers". Hence there is no need to normalize immutable data (unless you really need the space)
Developers work with "complete" business objects. These Business Objects are denormalized complete views anyway

It's easy to dismiss Pat as just another blogger - until you remember that Pat has a long history with database systems including for example servicing as the chief architect for SQL Server's Service Broker and as one of the co-founders of the COM+ team. Pat also spent a couple of years or so working for Amazon - and indeed Werner Vogels (Amazon's CTO) mentioned that you can consider denormalizing entities to get better scalability and performance in his QCon presentation on Availability and Consistency (recently published here).

Andres Aguiar commented that in addition to the "no need to normalize immutable data", you don't even need to delete:

Another [idea] is that you actually don't need to delete/update the database. 'Deleting' a row means setting a InvalidationTimestamp = now(), and updating a row means setting InvalidationTimestamp = now() and inserting a new row with SinceTimestamp = now() and InvalidationTimestamp = null (you actually need two sets of dates, but that's for another post). Now, if you put the two ideas together, all the data is immutable, so you don't need to normalize anything.

Analyzing this, one barrier Andres mentioned was that databases are not ready for the large number of columns per table such a scenario will generate, which is solvable if you apply archiving to older data.

Dare Obasanjo thinks that

"Database denormalization is the kind of performance optimization that should be carried out as a last resort after trying things like creating database indexes, using SQL views and implementing application specific in-memory caching. "

However Dare agrees that when you need massive scale you probably need to denormalize and give Flickr as an example.

As Cal Henderson (Flickr's web development lead) mentioned in a presentation he gave in 2004 on Flickr architecture that Joins are slow (slide 27) and as a result:

* Normalised data is for sissies
* Keep multiple copies of data around
* Makes searching faster
* Have to ensure consistency in the application logic

A 2004 blog post by Jason Kottke, which bares the same title like Pat's post and Cal's first point discusses Cal's presentations. Jason explains that in Flickr's case there is a ration of 13 Selects to each Insert/Delete or Update which is why they chose to denormalize.

To sum up as Pat finished his presentation:

"people normalize 'cuz their professor said to"

Is it time yet to rethink the mantra of database normalization? Some of the larger web-sites seem to think so, what's your take?

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?