InfoQ Homepage Apache Spark Content on InfoQ
-
Productionizing H2O Models with Apache Spark
Jakub Hava demonstrates the creation of pipelines integrating H2O machine learning models and their deployments using Scala or Python.
-
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the Cloud
Yuval Degani shows how hardware accelerations in Azure can be utilized to speed-up Spark jobs, with the aid of RDMA (Remote Direct Memory Access) support in the VM.
-
Streaming SQL Foundations: Why I ❤Streams+Tables
Tyler Akidau explores the relationship between the Beam Model and stream & table theory, stream processing in SQL with Apache Beam, Calcite, Flink, Kafka KSQL and Apache Spark’s Structured streaming.
-
Scaling with Apache Spark
Holden Karau looks at Apache Spark from a performance/scaling point of view and what’s needed to handle large datasets.
-
Real-Time Recommendations Using Spark Streaming
Elliot Chow discusses the data pipeline that they built with Kafka, Spark Streaming, and Cassandra to process Netflix user activities in real time for the Trending Now row.
-
Exploring Wikipedia with Apache Spark: A Live Coding Demo
Sameer Farooqui demos connecting to the live stream of Wikipedia edits, building a dashboard showing what’s happening with Wikipedia datasets and how people are using them in real time.
-
Apache Beam: The Case for Unifying Streaming APIs
Andrew Psaltis talks about Apache Beam, which aims to provide a unified stream processing model for defining and executing complex data processing, data ingestion and integration workflows.
-
The Mechanics of Testing Large Data Pipelines
Mathieu Bastian explores the mechanics of unit, integration, data and performance testing for large, complex data workflows, along with the tools for Hadoop, Pig and Spark.
-
Rethinking Streaming Analytics for Scale
Helena Edelson addresses new architectures emerging for large scale streaming analytics based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) or Apache Flink or GearPump.
-
The Lego Model for Machine Learning Pipelines
Leah McGuire describes the machine learning platform Salesforce wrote on top of Spark to modularize data cleaning and feature engineering.
-
Lightning Fast Cluster Computing with Spark and Cassandra
Piotr Kołaczkowski discusses how they integrated Spark with Cassandra, how it was done, how it works in practice and why it is better than using a Hadoop intermediate layer.
-
Translating Imperative Code to MapReduce
The authors present an approach for automatic translation of sequential, imperative code into a parallel MapReduce framework using Mold, translating Java code to run on Apache Spark.