Facilitating the spread of knowledge and innovation in professional software development

Contribute

Topics

InfoQ Homepage News Dave McCrory Unveils Initial Formula for Principle of Data Gravity

Dave McCrory Unveils Initial Formula for Principle of Data Gravity

This item in japanese

Does data, like a celestial body, have its own gravitational pull that attracts applications and services into its orbit? That was the proposal in 2010 by VMware’s Dave McCrory who has recently put some mathematical prowess beneath his principle. In his new website, DataGravity.org, McCrory outlines the formula for data gravity and asks the technical community for help in vetting and applying his formula.

In a 2011 post about data gravity, McCrory describes the basics of his principle.

Data Gravity is a theory around which data has mass.  As data (mass) accumulates, it begins to have gravity.  This Data Gravity pulls services and applications closer to the data.  This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.

This principle has deeply resonated with the technical community. This topic is frequently discussed at conferences including the recent GigaOM Structure, and written about in articles such as the ReadWriteCloud piece entitled  What "Data Gravity" Means to Your Data. In that article, author Jon Brockmeier warns against casual investment in data storage that may generate significant gravity.

Whether it's a single-user application like iTunes, or a company wide project: You need to consider the implications of data gravity - once your data is in, how hard will it be to break the gravitational field?

The stronger the data gravity involved, the more cautious you should be when you choose your data storage solution. It's likely that once you have a sufficient amount of data wrapped up in a solution, it's going to be very difficult (if not impossible) to justify the costs of moving it away.

On McCrory’s DataGravity.org site, he described his approach for quantifying this principle. First, he tackled the calculation of data Mass.

The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point)

He decided to calculate Data Mass by multiplying data volume (equal to the size of the data measured in megabytes) by the data density (which is the compression ratio of the data).

After that, McCrory calculated Application Mass by multiplying application volume (generated by adding amount of memory used plus the amount of disk space used) by the application density (produced by adding the compression ratio of the memory, compression ratio of the disk space, and the total amount of CPU utilization).

To account for significant impact of the network on data gravity, McCrory injects variables for network latency, network bandwidth, the number of requests per second, and the average size of requests. He combines all of these factors to arrive at a calculation for data gravity. In an interview with InfoQ, McCrory shared that he considered and discarded many additional variables. He attempted to factor in the impact of create/read/update/delete operations for a given data mass and even the type of storage that the data rested upon, but ultimately decided that the formula below captured the key aspects of data gravity.

McCrory considers the number produced by this calculation to be relative to the network that data exists in. According to McCrory, each network is a universe, and a given data mass exists in that universe. While one could fruitfully compare data gravity numbers between two objects within the same network, McCrory does not yet have enough information to confidently compare the data gravity for one network versus the data gravity from another network.

On the DataGravity.org site, McCrory lists a few of the possible uses of this principle.

Reasons to move toward a source Data Gravity (Increase Data Gravity)

• You need Lower Latency for your Application or Service
• You need Higher Bandwidth for your Application or Service
• You want to generate more Data Mass more quickly
• You are doing HPC
• You are doing Hadoop or Realtime Processing

Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)

• You want to avoid lock-in / keep escape velocity low
• Application Portability
• Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)

Data Gravity and Data Mass may have other uses as well:

• Making decisions of movement or location between two Data Masses
• Projecting Growth of Data Mass
• Projecting Increases of Data Gravity (Which could signal all sorts of things)

This formula is a work in progress, according to McCrory, and he is actively seeking real-world use cases and tests of this principle.

Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• Re: So, your data mass = du -sh zip your_backup.export ?

by Dave McCrory,

• Gravity and distributed objects

by Mark Little,

• I love the idea, but may have issue with the derivation of the units :-)

by Ethar Alali,

• So, your data mass = du -sh zip your_backup.export ?

Your message is awaiting moderation. Thank you for participating in the discussion.

So, does this mean, that by compressing your data close to its Kolmogorov-complexity (basically, depends on serialization of course, but most lossless compression algorhithms, like *zip will do that for you, they're mostly based on LZ77/LZW), you'll get your data's mass?

And your Application Mass is basically what gets billed to you by EC2 (or any other cloud provider)?

While it can be true, I guess there could be other factors as well - most of our applications are data-intensive applications, that's true - but what about, let's say, a routing service? Or some kind of AI, like speech recognition?

Also, compression can depend on the data in question. I guess FLAC files are better compressed than wav files gzipped.

It looks a bit like Project Estimation theories... while I have to agree, some of them did serve me well, what does actually data gravity mean in practice?

'Cause project estimation can be easily translated into dollars, manhours, deadlines.

But what does this mean? Where does it help?

• Re: So, your data mass = du -sh zip your_backup.export ?

by Dave McCrory,

Your message is awaiting moderation. Thank you for participating in the discussion.

Mass is meant to be a measure of the amount of Data. In Data Mass the Data Set Size and how Compressed (aka Dense) the Data is is meant to determine the size, however size doesn't mean much until it is being used by something (say an Application). Application Mass is meant to measure the Transformational steps to the Data along with the working Data Size. The usefulness is being measured by the Requests and the Request Size. Less Useful = Fewer Requests.

Data Gravity represents your dependence between your Application and that Data Mass (Source). You could choose to embrace or resist this dependence depending on your motives. You could also choose to calculate weight based on mass, gravity, and the network and determine your costs for example.

Ultimately, I hope that it will allow people to derive accurate calculations for understanding causes for behaviors of applications, data growth and movement, costing, and a host of other things. This is all key to optimization in distributed architectures such as those used in Enterprises and Cloud Computing....

• Gravity and distributed objects

by Mark Little,

Your message is awaiting moderation. Thank you for participating in the discussion.

Interesting. It reminds me of some work that was going on back in the late 1980's and early 1990's on distributed objects and object migration. I can't find the reference(s) at the moment, but from memory I recall the idea was to associate a "mass" with an object based on a number of factors - machines, sensors etc. were represented as objects too and they attracted objects to them based on this notion of gravity. An object could migrate away if it could either break free of its own accord, or split with a number of other related objects so they migrated as a group. There was also the concept of a black hole once gravity reached a certain limit ;-)

• I love the idea, but may have issue with the derivation of the units :-)

by Ethar Alali,

Your message is awaiting moderation. Thank you for participating in the discussion.

I love the idea of this. Personally, I am all for anything that brings rigor to this field. Data is used by applications and it doesn't have any importance if it isn't used. So the idea of data gravity really appeals to me, as the greater the amount of use of the data by apps, just like the greater the number of bodies around a sun say, the greater the overall gravity of the system.

However, whilst the data gravity force is being linked to acceleration (aka the force due to data gravity) the way the units have come about are not actually correct.

For those of us that have done some physics in our time (in the UK, A-level should suffice), if you carry out the dimension analysis on the equation itself (which I don't have much issue with, though I could argue that the use of averages distorts the true dynamics somewhat), you don't actually get the units that data gravity is measured in (i.e. you don't get MB/s/s).

So either the units are wrong, or worse the equation is wrong (but I can't see anything immediately wrong with it) or I have been careless (which is definitely within the realms of possibility) :-D

Different environments often have different data flows and information needs. Overall, it may have been easier to keep the more general equation of data gravity at a higher level and have the specific equations for each case make an appearance as part of that. This would allow a calculus to be developed for non-linear systems and indeed, can theoretically include stochastic elements such as Markovian/probabilistic arrival rates.

However, I think this potentially has a wider application in classes within systems as well. The gravity associated with a class can be indicated by the amount of coupling to the conceptual class. So I definitely think this sort of idea has traction.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.