Dave McCrory Unveils Initial Formula for Principle of Data Gravity
Does data, like a celestial body, have its own gravitational pull that attracts applications and services into its orbit? That was the proposal in 2010 by VMware’s Dave McCrory who has recently put some mathematical prowess beneath his principle. In his new website, DataGravity.org, McCrory outlines the formula for data gravity and asks the technical community for help in vetting and applying his formula.
In a 2011 post about data gravity, McCrory describes the basics of his principle.
Data Gravity is a theory around which data has mass. As data (mass) accumulates, it begins to have gravity. This Data Gravity pulls services and applications closer to the data. This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.
This principle has deeply resonated with the technical community. This topic is frequently discussed at conferences including the recent GigaOM Structure, and written about in articles such as the ReadWriteCloud piece entitled What "Data Gravity" Means to Your Data. In that article, author Jon Brockmeier warns against casual investment in data storage that may generate significant gravity.
Whether it's a single-user application like iTunes, or a company wide project: You need to consider the implications of data gravity - once your data is in, how hard will it be to break the gravitational field?
The stronger the data gravity involved, the more cautious you should be when you choose your data storage solution. It's likely that once you have a sufficient amount of data wrapped up in a solution, it's going to be very difficult (if not impossible) to justify the costs of moving it away.
On McCrory’s DataGravity.org site, he described his approach for quantifying this principle. First, he tackled the calculation of data Mass.
The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point)
He decided to calculate Data Mass by multiplying data volume (equal to the size of the data measured in megabytes) by the data density (which is the compression ratio of the data).
After that, McCrory calculated Application Mass by multiplying application volume (generated by adding amount of memory used plus the amount of disk space used) by the application density (produced by adding the compression ratio of the memory, compression ratio of the disk space, and the total amount of CPU utilization).
To account for significant impact of the network on data gravity, McCrory injects variables for network latency, network bandwidth, the number of requests per second, and the average size of requests. He combines all of these factors to arrive at a calculation for data gravity. In an interview with InfoQ, McCrory shared that he considered and discarded many additional variables. He attempted to factor in the impact of create/read/update/delete operations for a given data mass and even the type of storage that the data rested upon, but ultimately decided that the formula below captured the key aspects of data gravity.
McCrory considers the number produced by this calculation to be relative to the network that data exists in. According to McCrory, each network is a universe, and a given data mass exists in that universe. While one could fruitfully compare data gravity numbers between two objects within the same network, McCrory does not yet have enough information to confidently compare the data gravity for one network versus the data gravity from another network.
On the DataGravity.org site, McCrory lists a few of the possible uses of this principle.
Reasons to move toward a source Data Gravity (Increase Data Gravity)
- You need Lower Latency for your Application or Service
- You need Higher Bandwidth for your Application or Service
- You want to generate more Data Mass more quickly
- You are doing HPC
- You are doing Hadoop or Realtime Processing
Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)
- You want to avoid lock-in / keep escape velocity low
- Application Portability
- Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)
Data Gravity and Data Mass may have other uses as well:
- Making decisions of movement or location between two Data Masses
- Projecting Growth of Data Mass
- Projecting Increases of Data Gravity (Which could signal all sorts of things)
This formula is a work in progress, according to McCrory, and he is actively seeking real-world use cases and tests of this principle.
So, your data mass = du -sh `zip your_backup.export` ?
And your Application Mass is basically what gets billed to you by EC2 (or any other cloud provider)?
While it can be true, I guess there could be other factors as well - most of our applications are data-intensive applications, that's true - but what about, let's say, a routing service? Or some kind of AI, like speech recognition?
Also, compression can depend on the data in question. I guess FLAC files are better compressed than wav files gzipped.
It looks a bit like Project Estimation theories... while I have to agree, some of them did serve me well, what does actually data gravity mean in practice?
'Cause project estimation can be easily translated into dollars, manhours, deadlines.
But what does this mean? Where does it help?
Re: So, your data mass = du -sh `zip your_backup.export` ?
Data Gravity represents your dependence between your Application and that Data Mass (Source). You could choose to embrace or resist this dependence depending on your motives. You could also choose to calculate weight based on mass, gravity, and the network and determine your costs for example.
Ultimately, I hope that it will allow people to derive accurate calculations for understanding causes for behaviors of applications, data growth and movement, costing, and a host of other things. This is all key to optimization in distributed architectures such as those used in Enterprises and Cloud Computing....
Gravity and distributed objects
I love the idea, but may have issue with the derivation of the units :-)
However, whilst the data gravity force is being linked to acceleration (aka the force due to data gravity) the way the units have come about are not actually correct.
For those of us that have done some physics in our time (in the UK, A-level should suffice), if you carry out the dimension analysis on the equation itself (which I don't have much issue with, though I could argue that the use of averages distorts the true dynamics somewhat), you don't actually get the units that data gravity is measured in (i.e. you don't get MB/s/s).
So either the units are wrong, or worse the equation is wrong (but I can't see anything immediately wrong with it) or I have been careless (which is definitely within the realms of possibility) :-D
Different environments often have different data flows and information needs. Overall, it may have been easier to keep the more general equation of data gravity at a higher level and have the specific equations for each case make an appearance as part of that. This would allow a calculus to be developed for non-linear systems and indeed, can theoretically include stochastic elements such as Markovian/probabilistic arrival rates.
However, I think this potentially has a wider application in classes within systems as well. The gravity associated with a class can be indicated by the amount of coupling to the conceptual class. So I definitely think this sort of idea has traction.