Twitter's Answers is an analytics service for mobile apps that has come to see five billion sessions per day. Ed Solovey, software engineer at Twitter, has described how their system works to provide "reliable, real-time, and actionable" data based on hundreds of millions of mobile devices sending millions of events every second.
As Solovey explains, Answers main duties are the following:
- receiving events;
- archiving them;
- performing offline and real-time computations;
- merging the results of those computations into coherent information.
The service in charge of receiving events from the devices is written in Go and uses Amazon Elastic Load Balancer. It enqueues every payload into a durable Kafka queue. Given the sheer number of events to store, Kafka is only used as a temporary cache containing a few hours worth of data, while Storm is used to transfer the data to Amazon S3.
Once the data is in S3, Amazon Elastic MapReduce is used to process it in batch. The result is stored in a Cassandra cluster to be available for querying through an API.
This is only half of the story, though. Answers indeed also processes data in real-time. At this aim, the content of the Kafka store is piped into Storm and processed through probabilistic algorithms like Bloom Filters and HyperLogLog to provide timely results at the cost of a "negligible loss of accuracy." The results are stored again to Cassandra.
At the end of the process, Cassandra contains batch computation results as well as real-time results. The query API is in charge to combine those two streams to provide a coherent view based on the query parameters. Results from the batch computation are preferred when available, since they are more precise, otherwise real-time results are provided.
The architecture described here is also effective when it comes to failure handling, says Solovey, thanks to the use of durable queues to connect Answers components. This will ensure that any outage at one component will not affect the others. Furtermore, it also allows for recovery and guarantees no data loss if the system goes back to normal operation within a given timeframe.