Julien Nioche, director of DigitalPebble, PMC member and committer of the Apache Nutch web crawler project, talks about StormCrawler, a collection of reusable components to build distributed web crawlers based on the streaming framework Apache Storm. InfoQ interviewed Nioche, main contributor of the project, to find out more about StormCrawler and how it compares to other similar technologies.
Microsoft recently announced an addition to its Platform as a Service (PaaS) offering called Azure Functions. Initially launched as a preview service in March 2016, Azure Functions provide developers with an event-driven serverless compute platform that allow organizations to pay for only what they consume.
Javier Lopez and Mihail Vieru spoke at Reactive Summit 2016 Conference about cloud-based data integration and distribution platform used for stream processing in business intelligence use cases. Their solution is based on technologies such as Flink, Kafka and Elasticsearch.
Lambda architecture has been a popular solution that combines batch and stream processing. Kartik Paramasivam at LinkedIn wrote about how his team addressed stream processing and Lambda architecture challenges using Apache Samza for data processing. The challenges described are the late arrival of events and the processing of duplicated messages.
Confluent Enterprise latest version supports multi-datacenter replication, automatic data balancing, and cloud migration capability. Confluent, provider of the Apache Kafka based streaming platform, announced last week the new features for Confluent Enterprise, to help build streaming data pipelines and develop stream processing applications.
In her presentation "Large-Scale Stream Processing with Apache Kafka" at QCon New York 2016, Neha Narkhede introduces Kafka Streams, a new feature of Kafka for processing streaming data. According to Narkhede stream processing has become popular because unbounded datasets can be found in many places. It is no longer a niche problem like, for example, machine learning.
Event sourcing and CQRS are two patterns that has emerged in the Domain-Driven Design (DDD) community. Stream processing builds on similar ideas but has emerged in a different community, Martin Kleppmann noted in his presentation at the Domain-Driven Design Europe conference earlier this year comparing event sourcing with stream processing.
On Thursday, April 21 Microsoft announced the integration between Azure Stream Analytics and Power BI has reached General Availability (GA). Using this capability, customers can gain real-time insight into their business performance by analyzing in-flight data streams.
Version 1.0 is "a major milestone in the evolution of Apache Storm", writes Apache Software Foundation VP for Apache Storm P. Taylor Goetz, and it includes many new features and improvements. In particular, Goetz claims a 3x–16x boost in performance.
Embrace decentralization, build service-based systems and attack the problems that come with distributed state using stream processing tools, Ben Stopford urged in his presentation at the recent QCon London conference.
With many databases in a system they are rarely independent from each other, instead pieces of the same data are stored in many of them. Using transactions to keep everything in sync is a fragile solution. Working with a stream of changes in the order they are created is a much simpler and more resilient solution, Martin Kleppmann stated in his presentation at the recent QCon London conference.
Netflix has shed light on how the company uses the latest version of their Keystone Data Pipeline, a petabyte-scale real-time event stream processing system for business and product analytics. This news summarizes the three major versions of the pipeline, now used by almost every application at Netflix.
Architecting a scalable and dynamic system without caching is explained by Peter Morgan, head of engineering for the sports betting company William Hill. The values of the bets on sporting events change constantly. No data can be cached; all system values must be current. Distributed Erlang processes model domain objects which instantly recalculate system values based on data streams from Kafka.
A key problem with the whole Reactive space and why it’s so hard to understand is the vocabulary with all the terms and lots of different interpretations of what it means, Peter Ledbrook claims and also a reason for why he decided to work out what it’s all about and sharing his knowledge in a presentation.
Yahoo! has benchmarked three of the main stream processing frameworks: Apache Flink, Spark and Storm.