Bindings, Platforms, and Innovation
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
Tracking change and innovation in the enterprise software development community
Posted by Steven Robbins on Jun 19, 2008 04:31 AM
Jim Gray, a man who has contributed greatly to technology over the past 40 years, is credited with saying that memory is the new disk and disk is the new tape. With the proliferation of "real-time" web applications and systems that require massive scalability, how are hardware and software relating to this meme?[M]emory is several orders of magnitude faster than disk for random access to data (even the highest-end disk storage subsystems struggle to reach 1,000 seeks/second). Second, with data-center networks getting faster, it’s not only cheaper to access memory than disk, it’s cheaper to access another computer’s memory through the network. As I write, Sun’s Infiniband product line includes a switch with 9 fully-interconnected non-blocking ports each running at 30Gbit/sec; yow! The Voltaire product pictured above has even more ports; the mind boggles. (If you want the absolute last word on this kind of ultra-high-performance networking, check out Andreas Bechtolsheim’s Stanford lecture.)Tim also pointed out the truth of the second part of Gray's statement: "For random access, disks are irritatingly slow; but if you pretend that a disk is a tape drive, it can soak up sequential data at an astounding rate; it’s a natural for logging and journaling a primarily-in-RAM application."
Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures. Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.Dare Obsanjo pointed out how not paying attention to the mantra can have detrimental effects, a la Twitter's issues. Commenting on Twitter's content management-like implementation, Obsanjo said "The problem is that if you naively implement a design that simply reflects the problem statement then you will be in disk I/O hell. It won't matter if you are using Ruby on Rails, Cobol on Cogs, C++ or hand coded assembly, the read and write load will kill you." In other words, push the random-access operations into RAM and only use disk for sequential operations.
In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at the seek rate of the disk (seeking is the process of moving the disk's head to a particular place on the disk to read or write data). So why is this interesting? Well, look at the trends in seek time and transfer rate. Seek time has grown at about 5% a year, whereas transfer rate at about 20%. Seek time is growing more slowly than transfer rate - so it pays to use a model that operates at the transfer rate. Which is what MapReduce does.While it remains to be seen if Solid State Drives (SSD) will change the seek/transfer ratios, many commenters to White's discussion thought that they may be a leveling factor in the RAM/hard drive debate.
provide object-based database capabilities in memory, and support core database functionality, such as advanced indexing and querying, transactional semantics and locking. IMDGs also abstract data topology from application code. With this approach, the database is not completely eliminated, but put it in the *right* place.The primary benefits of an IMDG over direct RDBMS interaction listed were:
5 Ways to Ensure Application Performance
The Role of Open Source in Data Integration
Usage Landscape: Enterprise Open Source Data Integration
Wicked!
Time to use Prevayler :)
Prevayler is back (maybe it never left?)! It shall rule the world!
You can also use open source object database db4o (developer.db4o.com) configured as an in memory database. And you'd get all the benefits described in the article (in-local cache, no ORM, etc.)
While it remains to be seen if Solid State Drives (SSD) will change the seek/transfer ratios, many commenters to White's discussion thought that they may be a leveling factor in the RAM/hard drive debate.
While RAM is faster than a hard drive, it's not the performance that makes the difference. The hard drive concept is "slow" because it's a shared storage model, and the RAM is "fast" because there's some of it co-located with every CPU. If the hard drives were local then the scalability would be roughly identical, and the scalability is orders of magnitude more important than the raw single-threaded latency in a large system.
Peace,
Cameron Purdy
Oracle Coherence: Data Grid for Java, .NET and C++
Does that mean that all we need to do is replace the current disks with RAM technology to gain speed? The title of the article leads people to think along those lines.
IMO It's not just the speed of memory compared to disks that makes a difference. It's not even the extra benefit of the collocation of CPU and memory. What's really a important is the fact that disk is a sequential storage medium that was designed primarily to store a stream of bytes, not tables of data.
See my recent post on that matter for more details.
Nati S.
GigaSpaces
This is the approach that we've been taking with Web caching for HTTP-based services; serving a response out of memory is infinitely faster than getting it off a disk or from the origin server, and cache peering allows you to reach across the network and get it from a peer. The cyclic COSS filesystem in Squid is a good choice when you *must* go to disk.
For a multihreaded, indexed, clustered and simple in-memory Java collection persistence system. ;-) http://www.space4j.org/
Very good article.
However, you say in the advantages of IMDG over RDBMS that:
"Data can be accessed by reference"
but I've never worked on a project where this is possible. In n-tier applications, (e.g. Java server based http) you always have to look up objects by some kind of ID because of the request/response mechanism.
Recently I've been working with Flex and Java using BlazeDS in which case the object I'm manipulating in the client is serialised over the wire (Java to ActionScript to Java). Thus the object that gets passed to my invoked methods does not have the same reference and I have to do a lookup by ID anyway. (Some Adobe fan might point out that LiveCycle DataServices can actually handle this, but what is it doing under the covers? I don't know for sure, but I imagine it's passing IDs around)
In both cases this has to happen regardless of whether the storage is an OODBMS, RDBMS, some kind of fancy caching or just stored in collections.
So I guess my question actually is, can you give an example of a scenario in which data being accessed by reference is an advantage?
Cheers,
Chris
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.
This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.
This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.
This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.
After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.
IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.
Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.
9 comments
Watch Thread Reply