Bindings, Platforms, and Innovation
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
Tracking change and innovation in the enterprise software development community
Posted by Stefan Tilkov on Mar 21, 2007 01:47 PM
In his talk at QCon, Mark Nottingham, a “Principal Technical Yahoo!”, provided some insight into how the Yahoo! Media Group uses the Web, and not Web services, to build its SOA variant.
As can be expected from a Web company of Yahoo!’s size, the numbers are impressive: There are about 4 billion daily page views, up from about 65 million in 1996. Although Yahoo! presents a unified appearance to the outside, it has a diverse environment internally. Integrating the different “properties” (as the Yahoo! calls its offerings) can quickly become a nightmare — acquisitions, integration with partners, and also with older Yahoo! infrastructure pieces poses a significant challenge. And this problem is intensified by Yahoo!’s own SlashDot effect: A link on Yahoo! home page will trigger an enormous load on any of the existing applications.
The initial architecture at Y! Media Group for most of the properties consisted of independent front end boxes, including databases, with a master database in the backend. When one of the properties needed to be expanded due to higher demand, this led to a number of problems because “large datasets don’t push well”: If there was a need to extend the News property with 50 machines, all of them had to be initialized with the appropriate content from the backend master database. Adding to these problems are issues of synchronization after a failure, the fact that more and more content is generated by the users, and the need for “cross property integration”. Mark gave Yahoo! Tech as an example, which integrates products and answers from Yahoo! Answers. Another new property burdened with this problem is Yahoo! Pipes. In the old architecture, the need for cross-property access was solved by one “frontend box” requesting data from another frontend box, intensifying the problems.
The requirements for an improved architecture were thus pretty obvious: massive scalability, flexible deployment, highly dynamic, separation of concerns. As a result, Y! Media Group decided to move towards a Service-oriented architecture. Because “Webby” solutions such as PHP have always been more prevalent than “enterprise” technologies such as Java, and because it was felt that scalability, simplicity, reuse, and interoperability are better addressed this way, the decision was made to use a REST/HTTP-based solution instead of one relying on Web services and the WS-* stack.
Instead of replicating data between a backend master database and the frontend database, the frontend boxes now issue requests through a cache to backend API servers, all via HTTP. Because of this, there is now a single source of truth. The cache replicates the data once it has been requested - a pull model vs. a push model. Questioned whether this is a RESTful API, Mark stressed that he views issues around REST as a philosophical discussion, but conceded that the backend APIs are, in fact, RESTful. (He has expressed this view before in a blog entry called “REST issues: Real and Imagined”) User generated content is pushed through to the backend, and adding capacity becomes easy.
As one example of just using HTTP correctly instead of getting into a philosophical REST discussion gave caching intermediaries. The caching features built into HTTP are quite advanced, and they become immediately usable for well-designed HTTP applications. Examples of advantages are freshness (because the data is pulled from the backend whenever it needs to) and validation (asking “has this changed” is a quick HTTP-base question to the backend). It is also possible to provide “recalculated” results, which are validated against the etag of the calculation input. Having a standards-based cache also enables the collection of metrics and load balancing. (For a great introduction into HTTP caching, see Mark’s own Caching Tutorial for Web authors and Webmasters.)
Mark also commented on some more advanced techniques used at Yahoo! Media Group. Multi-GB memory caches are not at all uncommon, and sometimes they are put into groups that are kept in sync via cache peering, i.e. the synchronization of more than cache in a group. (There are numerous common cache peer protocols, such as ICP) Another advanced concept is negative caching: if there’s an error out of the API server, the cache will cache the error, reducing the load on the backend. Collapsed forwarding means that multiple requests from the frontend can be collapsed to a single one, which according to mark is another great way to mitigate traffic overload from the frontend. While the cache is refreshing something in the backend, it can return a stale copy, a concept called stale-while-revalidate. Similarly, stale-if-error means that if there’s a problem on the backend box, it can serve a stale copy, too. Another concept is an invalidation channel, which is an out-of-band mechanism to tell the cache something has become stale.
Currently, Yahoo! uses Squid, but Mark expressed his belief that one of the strengths of his approach is that Caching is a commodity: Squid they could easily be replaced by something else.
Mark also warned about some pitfalls. He questioned the merit of “REST vs. WS-* wars” and mentioned that he prefers to focus on applying Web technologies in practice instead of talking about them in theory. Also interesting was his assertion that REST and HTTP are human-intuitive, but not programmer-intuitive — he finds it much harder to explain REST to programmers than to “normal” human beings. He also noted that there are different deployment and operational concerns, since people know how to handle single applications, but that knowledge is not directly transferable to such as large-scale deployment. According to Mark, formats are hard even when applying REST and HTTP — just like in the WS-* world. He also highlighted the risk of format/interface proliferation (choice quote: “if you give developers a new protocol construction toolkit, they’ll build protocols”), the problems with authentication, (“HTTP authentication mechanisms are unbelievably primitive”), and mentioned that in his opinion, tools such as intermediaries have a way to go since they are optimized for the browsing case, not the service case.
He finished his talk by describing what he believes is needed: tools, a web-friendly description language (such as WADL), a data-oriented schema language (instead of something that describes markup), a significant investment in the Atom stack (according to him, Atom/RSS can be used in 80% of the cases to mitigate interface and format proliferation), and a standardized HTTP test suite.
Would you enroll in an India Forex Group i.e http://www.indiaforex.com Groups?
As an application developer to me the interesting point is that Yahoo (and Amazon too!) are two massive websites that are decomposed into a web services back ends, internal SOA's so to speak, instead of being your typical tightly coupled single-plaform approach. At least that's far more interesting to me from an architectural perspective than whether their services uses WSDL or REST to define their services. :)
This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.
This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.
This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.
This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.
This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.
After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.
IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.
Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.
1 comment
Watch Thread Reply