InfoQ Homepage Apache Spark Content on InfoQ

Articles

RSS Feed

Newer Older

Cloud

Two Misconfigurations That Caused Spark OOM Failures on Kubernetes

After migrating Spark pipelines to Azure Kubernetes Service, two infrastructure settings interacted destructively: spark.kubernetes.local.dirs.tmpfs=true backed shuffle spill with RAM instead of disk, and a hard podAffinity rule forced all executors onto one node. Together, they caused repeated OOM kills invisible to standard diagnostics.

Pranav Bhasker
on Jun 03, 2026
AI, ML & Data Engineering

From Batch to Micro-Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline

This article describes how a production delta-index pipeline migrated from scheduled batch to micro-batch Spark Structured Streaming. It covers why record-level streaming was rejected, how partition-based watermarks replaced fragile S3 completion markers, overlap-window correctness, and restart-as-design strategies for better predictability in object-store–based ingestion systems.

Parveen Saini
on May 04, 2026
AI, ML & Data Engineering

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

This article introduces a reinforcement learning (RL) approach grounded in Apache Spark that enables distributed computing systems to learn optimal configurations autonomously, much like an apprentice engineer who learns by doing. The author also implements a lightweight agent as a driver-side component that uses RL to choose configuration settings before a job runs.

Hina Gandhi
on Jan 30, 2026
AI, ML & Data Engineering

Accelerating Deep Learning on the JVM with Apache Spark and NVIDIA GPUs

In this article, authors discuss how to use the combination of Deep Java Learning (DJL), Apache Spark v3, and NVIDIA GPU computing to simplify deep learning pipelines while improving performance and reducing costs. They also show the performance comparison of this solution with GPU vs CPU hardware, using Amazon EMR and NVIDIA RAPIDS Accelerator.

Haoxuan Wang Qing Lan Carol McDonald
on Jun 11, 2021
Cloud

Evolution of Azure Synapse: Apache Spark 3.0, GPU Acceleration, Delta Lake, Dataverse Support

At Microsoft Build 2021, Azure Synapse has announced significant improvements for its Apache Spark pool, its performance, and data querying and integration capabilities. This article outlines the improvements and provides the context.

Lena Hall
on May 29, 2021
AI, ML & Data Engineering

Stream Processing Anomaly Detection Using Yurita Framework

In this article, author Guy Gerson discusses the stream processing anomaly detection framework they developed by PayPal, called Yurita. The framework is based on Spark Structured Streaming.

Guy Gerson
on Jul 10, 2019
AI, ML & Data Engineering

Real-Time Data Processing Using Redis Streams and Apache Spark Structured Streaming

Structured Streaming, introduced with Apache Spark 2.0, delivers a SQL-like interface for streaming data. Redis Streams enables Redis to consume, hold and distribute streaming data between multiple producers and consumers. In this article, author Roshan Kumar walks us through how to process streaming data in real time using Redis and Apache Spark Streaming technologies.

Roshan Kumar
on May 13, 2019
AI, ML & Data Engineering

Analytics Zoo: Unified Analytics + AI Platform for Distributed Tensorflow, and BigDL on Apache Spark

In this article we described how Analytics Zoo can help real-world users to build end-to-end deep learning pipelines for big data, including unified pipelines for distributed TensorFlow and Keras on Apache Spark, easy-to-use abstractions such as transfer learning and Spark ML pipeline support, built-in deep learning models and reference use cases, etc.

Jason Dai
on Dec 11, 2018
AI, ML & Data Engineering

Spark Application Performance Monitoring Using Uber JVM Profiler, InfluxDB and Grafana

In this article, author Amit Baghel discusses how to monitor the performance of Apache Spark based applications using technologies like Uber JVM Profiler, InfluxDB database and Grafana data visualization tool.

Amit Baghel
on Nov 18, 2018
AI, ML & Data Engineering

Apache Beam Interview with Frances Perry

InfoQ Interviews Apache Beam's Frances Perry about the impetus for using Beam and the future of the top-level open source project and covers the thoughts behind the programming model as well as some of the touch-points in integration with other data engineering tools like Apache Spark and Flink.

Dylan Raithel
on Jun 20, 2017
AI, ML & Data Engineering

Big Data Processing Using Apache Spark - Part 6: Graph Data Analytics with Spark GraphX

In this article, author Srini Penchikala discusses Apache Spark GraphX library used for graph data processing and analytics. The article includes sample code for graph algorithms like PageRank, Connected Components and Triangle Counting.

Srini Penchikala
on Mar 14, 2017
AI, ML & Data Engineering

Traffic Data Monitoring Using IoT, Kafka and Spark Streaming

Internet of Things (IoT) is an emerging disruptive technology and becoming an increasing topic of interest. One of the areas of IoT application is the connected vehicles. In this article we'll use Apache Spark and Kafka technologies to analyse and process IoT connected vehicle's data and send the processed data to real time traffic monitoring dashboard.

Amit Baghel
on Sep 28, 2016

Newer Articles

Older Articles

InfoQ Software Architects' Newsletter

Articles