Streaming Big Data With Amazon Kinesis
Amazon recently announced Kinesis, a service that allows developers to stream large amounts of data from different sources and process it. The service is currently in limited preview.
Kinesis seems to wrap up SQS queues and auto-scaling compute instances into a new product. Kinesis gives you the ability to accept millions of POST requests per second and process them all in real time as a stream. You can send the streamed data directly to S3, and send it to your apps for processing, and to relational storage and and and... All in real time.
SQS messages are limited to 256kb of text messages (generally JSON, but use whatever you like). Kinesis streams are provisioned in megabytes per second, and as far as I can tell, can accept any kind of data through HTTP PUT.
Additionally, Kinesis stream data is available to your apps for 24 hours across multiple availability zones while SQS messages are zone dependent and are not durable. If a zone goes out, or there is some glitch, your SQS messages are gone. And I don't think SQS has the scalability and IO that Kinesis has. I've never seen a published IOPS guarantee for SQS but Kinesis can accept 1000 PUT requests per second, per shard.
Applications use Kinesis streams to capture, store and transport data, each of which can have multiple readers and writers. The capacity of a stream is specified in terms of shards; each shard has the ability to write 1000 write transactions, upto 1 MB per second. Users can scale capacity of individual streams by adding or removing shards without downtime.
Developers can use the Kinesis client library to build applications that leverage Kinesis. The producer side uses PutRecord API to push data. On the consumer side, you provide an implementation to IRecordProcessor and the client will “push” new records as they get created. There are also lower level interfaces such as GetShardIterator and GetNextRecords. After processing a record, the consumer code can store it in one of the AWS storage services (S3, RedShift, DynamoDB) or can pass it along to another Kinesis stream.
Real-time processing as enabled by Kinesis is different from batch-processing (such as enabled by Hadoop) as data can be processed as soon as it is available, rather than in batches. Amazon lists log processing, processing of social media data, real-time processing of financial transactions and online machine learning as some of the possible use cases for this. Another product which enables large amounts of real-time complex processing is Storm.