InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

QCon: REST for SOA at Yahoo!

Posted by Stefan Tilkov on Mar 21, 2007

Sections
Architecture & Design,
Development,
Enterprise Architecture
Topics
Java ,
Performance & Scalability ,
SOA ,
.NET
Tags
Caching ,
Yahoo! ,
Casestudy

In his talk at QCon, Mark Nottingham, a “Principal Technical Yahoo!”, provided some insight into how the Yahoo! Media Group uses the Web, and not Web services, to build its SOA variant.

As can be expected from a Web company of Yahoo!’s size, the numbers are impressive: There are about 4 billion daily page views, up from about 65 million in 1996. Although Yahoo! presents a unified appearance to the outside, it has a diverse environment internally. Integrating the different “properties” (as the Yahoo! calls its offerings) can quickly become a nightmare — acquisitions, integration with partners, and also with older Yahoo! infrastructure pieces poses a significant challenge. And this problem is intensified by Yahoo!’s own SlashDot effect: A link on Yahoo! home page will trigger an enormous load on any of the existing applications.

The initial architecture at Y! Media Group for most of the properties consisted of independent front end boxes, including databases, with a master database in the backend. When one of the properties needed to be expanded due to higher demand, this led to a number of problems because “large datasets don’t push well”: If there was a need to extend the News property with 50 machines, all of them had to be initialized with the appropriate content from the backend master database. Adding to these problems are issues of synchronization after a failure, the fact that more and more content is generated by the users, and the need for “cross property integration”. Mark gave Yahoo! Tech as an example, which integrates products and answers from Yahoo! Answers. Another new property burdened with this problem is Yahoo! Pipes. In the old architecture, the need for cross-property access was solved by one “frontend box” requesting data from another frontend box, intensifying the problems.

The requirements for an improved architecture were thus pretty obvious: massive scalability, flexible deployment, highly dynamic, separation of concerns. As a result, Y! Media Group decided to move towards a Service-oriented architecture. Because “Webby” solutions such as PHP have always been more prevalent than “enterprise” technologies such as Java, and because it was felt that scalability, simplicity, reuse, and interoperability are better addressed this way, the decision was made to use a REST/HTTP-based solution instead of one relying on Web services and the WS-* stack.

Instead of replicating data between a backend master database and the frontend database, the frontend boxes now issue requests through a cache to backend API servers, all via HTTP. Because of this, there is now a single source of truth. The cache replicates the data once it has been requested - a pull model vs. a push model. Questioned whether this is a RESTful API, Mark stressed that he views issues around REST as a philosophical discussion, but conceded that the backend APIs are, in fact, RESTful. (He has expressed this view before in a blog entry called “REST issues: Real and Imagined”) User generated content is pushed through to the backend, and adding capacity becomes easy.

As one example of just using HTTP correctly instead of getting into a philosophical REST discussion gave caching intermediaries. The caching features built into HTTP are quite advanced, and they become immediately usable for well-designed HTTP applications. Examples of advantages are freshness (because the data is pulled from the backend whenever it needs to) and validation (asking “has this changed” is a quick HTTP-base question to the backend). It is also possible to provide “recalculated” results, which are validated against the etag of the calculation input. Having a standards-based cache also enables the collection of metrics and load balancing. (For a great introduction into HTTP caching, see Mark’s own Caching Tutorial for Web authors and Webmasters.)

Mark also commented on some more advanced techniques used at Yahoo! Media Group. Multi-GB memory caches are not at all uncommon, and sometimes they are put into groups that are kept in sync via cache peering, i.e. the synchronization of more than cache in a group. (There are numerous common cache peer protocols, such as ICP) Another advanced concept is negative caching: if there’s an error out of the API server, the cache will cache the error, reducing the load on the backend. Collapsed forwarding means that multiple requests from the frontend can be collapsed to a single one, which according to mark is another great way to mitigate traffic overload from the frontend. While the cache is refreshing something in the backend, it can return a stale copy, a concept called stale-while-revalidate. Similarly, stale-if-error means that if there’s a problem on the backend box, it can serve a stale copy, too. Another concept is an invalidation channel, which is an out-of-band mechanism to tell the cache something has become stale.

Currently, Yahoo! uses Squid, but Mark expressed his belief that one of the strengths of his approach is that Caching is a commodity: Squid they could easily be replaced by something else.

Mark also warned about some pitfalls. He questioned the merit of “REST vs. WS-* wars” and mentioned that he prefers to focus on applying Web technologies in practice instead of talking about them in theory. Also interesting was his assertion that REST and HTTP are human-intuitive, but not programmer-intuitive — he finds it much harder to explain REST to programmers than to “normal” human beings. He also noted that there are different deployment and operational concerns, since people know how to handle single applications, but that knowledge is not directly transferable to such as large-scale deployment. According to Mark, formats are hard even when applying REST and HTTP — just like in the WS-* world. He also highlighted the risk of format/interface proliferation (choice quote: “if you give developers a new protocol construction toolkit, they’ll build protocols”), the problems with authentication, (“HTTP authentication mechanisms are unbelievably primitive”), and mentioned that in his opinion, tools such as intermediaries have a way to go since they are optimized for the browsing case, not the service case.

He finished his talk by describing what he believes is needed: tools, a web-friendly description language (such as WADL), a data-oriented schema language (instead of something that describes markup), a significant investment in the Atom stack (according to him, Atom/RSS can be used in 80% of the cases to mitigate interface and format proliferation), and a standardized HTTP test suite.

  • This article is part of a featured topic series on SOA
webservices as a back-end services layer...definitely different! by Floyd Marinescu Posted
  1. Back to top

    webservices as a back-end services layer...definitely different!

    by Floyd Marinescu

    As an application developer to me the interesting point is that Yahoo (and Amazon too!) are two massive websites that are decomposed into a web services back ends, internal SOA's so to speak, instead of being your typical tightly coupled single-plaform approach. At least that's far more interesting to me from an architectural perspective than whether their services uses WSDL or REST to define their services. :)

Educational Content

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.