Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Presentation: Jinesh Varia About Amazon Alexa Web Service's Architecture

Presentation: Jinesh Varia About Amazon Alexa Web Service's Architecture

In this presentation, Jinesh Varia, a Web Services Evangelist at Amazon, talks about the architecture of one of Amazon's web services called Alexa. Jinesh explains how Amazon has reached scalability, performance and reduced costs for the Alexa service.

Watch:  Jinesh Varia About Amazon's Alexa Web Service (43 min)

The Alexa Web Service, backed by an application called internally as GrepTheWeb, gathers various information about web sites including traffic data, contact information, and more. The collected data is then made available to clients which can run specialized queries against it in order to find specific information.

Jinesh explains that GrepTheWeb uses Hadoop, a free Java software platform which can be used to run applications processing vast amounts of data which, in this case, are stored on Amazon's Simple Storage Service (S3), and are retrieved by Hadoop clusters when a client request is processed. Finally a result is returned to the customer. Hadoop runs inside Amazon's Elastic Compute Cloud (EC2). 

The whole architecture is in a cloud whose internals are completely hidden from the service customer. When a request is issued, an entire framework is built on as many machines as is necessary in order to process it and generate a result, then the whole framework disappears. The cloud architecture makes the whole service highly scalable. By being able to extend it on theoretically unlimited number of nodes, the service has good performance. Since the entire service support is created on the fly and exists only while processing a request, the costs are low.

One of the main features of the Alexa's architecture is fault tolerance. The data is duplicated and stored on physically different locations to avoid data loss, and Hadoop takes care of spawning and controlling as many processes as necessary to process the large amounts of data involved.

Rate this Article