InfoQ Homepage Apache Spark Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Pinterest's Moka: How Kubernetes Is Rewriting the Rules of Big Data Processing

Digital pinboard provider Pinterest has published an article explaining its blueprint for the future of large-scale data processing with its new platform Moka. The company is moving core workloads from ageing Hadoop infrastructure to a Kubernetes-based system on Amazon EKS, with Apache Spark as the main engine and support for other frameworks on the way.

Matt Saunders
on Jan 19, 2026
Architecture & Design

How Agoda Unified Multiple Data Pipelines into a Single Source of Truth

Agoda recently described how it consolidated multiple independent data pipelines into a centralized Apache Spark-based platform to eliminate inconsistencies in financial data. The company implemented a multi-layered quality framework that combines automated validations, machine-learning-based anomaly detection, and data contracts, while processing millions of daily booking transactions.

Eran Stiller
on Jan 14, 2026
AI, ML & Data Engineering

Decathlon Switches to Polars to Optimize Data Pipelines and Infrastructure Costs

Decathlon, one of the world's leading sports retailers, recently shared why it adopted the open source library Polars to optimize its data pipelines. The Decathlon Digital team found that migrating from Apache Spark to Polars for small input datasets provides significant speed and cost savings.

Renato Losio
on Dec 20, 2025
Architecture & Design

Lyft Rearchitects ML Platform with Hybrid AWS SageMaker-Kubernetes Approach

Lyft has rearchitected its machine learning platform LyftLearn into a hybrid system, moving offline workloads to AWS SageMaker while retaining Kubernetes for online model serving. Its decision to choose managed services where operational complexity was highest, while maintaining custom infrastructure where control mattered most, offers a pragmatic alternative to unified platform strategies.

Eran Stiller
on Dec 16, 2025
Architecture & Design

From Hadoop to Kubernetes: Pinterest’s Scalable Spark Architecture on AWS EKS

Pinterest revamped its data infrastructure by transitioning from a legacy Hadoop system to the Moka platform, leveraging Kubernetes and Spark on AWS EKS. This strategic shift enhances job isolation, simplifies deployment, and optimizes resource management, leading to reduced costs and improved efficiency.

Eran Stiller
on Jul 28, 2025
Architecture & Design

Databricks Contributes Spark Declarative Pipelines to Apache Spark

At the Databricks Data+AI Summit, held in San Francisco, USA, from June 10 to 12, Databricks announced that it is contributing the technology behind Delta Live Tables (DLT) to the Apache Spark project, where it will be called Spark Declarative Pipelines. This move will make it easier for Spark users to develop and maintain streaming pipelines, and furthers Databrick’s commitment to open source.

Patrick Farry
on Jul 03, 2025
Cloud

AWS Glue 5.0 Introduces Spark 3.5.2 and Enhanced ETL Performance

At the latest re:Invent conference in Las Vegas, Amazon announced the general availability of AWS Glue 5.0, designed to accelerate ETL jobs powered by Apache Spark. The latest release of the serverless data integration service introduces upgraded runtimes, including Spark 3.5.2, Python 3.11, and Java 17, along with enhancements in performance and security.

Renato Losio
on Jan 31, 2025
Architecture & Design

How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%

Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.

Rafal Gancarz
on Nov 13, 2024
DevOps

Scaling Uber’s Batch Data Platform: a Journey to the Cloud with Data Mesh Principles

Some months ago, Uber started the migration to the cloud, on Google Cloud Platform (GCP), of its batch data analytics and machine learning platform. In a recent post on its engineering blog, Uber provided additional information regarding its batch data cloud migration that incorporated crucial data mesh principles.

Claudio Masolo
on Oct 12, 2024
Architecture & Design

Netflix Uses Metaflow to Manage Hundreds of AI/ML Applications at Scale

Netflix recently published how its Machine Learning Platform (MLP) team provides an ecosystem around Metaflow, an open-source machine learning infrastructure framework. By creating various integrations for Metaflow, Netflix already has hundreds of Metaflow projects maintained by multiple engineering teams.

Eran Stiller
on Mar 27, 2024
Architecture & Design

Distributed Materialized Views: How Airbnb’s Riverbed Processes 2.4 Billion Daily Events

Airbnb created Riverbed, a Lambda-like data framework for producing and managing distributed materialized views. The framework supports over 50 read-heavy use cases where data is sourced from multiple data sources within the company’s service-oriented architecture (SOA) platform. It uses Apache Kafka and Apache Spark for online and offline components, respectively.

Rafal Gancarz
on Oct 04, 2023
Architecture & Design

Managing 238 Million Memberships of Netflix: Surabhi Diwan at QCon San Francisco

During the first day of QCon San-Francisco 2023, Surabhi Diwan, a senior software engineer at Netflix, presented on managing 238 million Memberships of Netflix. The talk is a part of the “Architectures You’ve Always Wondered About" track. Diwan's work at Netflix involves the backend work regarding membership engineering, which is critical for both signups and streaming at Netflix.

Steef-Jan Wiggers
on Oct 03, 2023
AI, ML & Data Engineering

Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.

Rafal Gancarz
on Jul 24, 2023
Cloud

AWS Introduces Athena Provisioned Capacity

AWS recently announced a new feature Provisioned Capacity for Athena, that allows users to run SQL queries on fully-managed compute capacity for a fixed price and no long-term commitments.

Steef-Jan Wiggers
on May 04, 2023
DevOps

AWS Data on EKS Provides Opinionated Data Workload Blueprints

AWS has released Data on EKS (DoEKS), an open-source project providing templates, guidance, and best practices for deploying data workloads on Amazon Elastic Kubernetes Service (EKS). While the main focus is on running Apache Spark on Amazon EKS, blueprints also exist for other data workloads such as Ray, Apache Airflow, Argo Workflows, and Kubeflow.

Matt Campbell
on Apr 02, 2023

Newer News

Older News

InfoQ Software Architects' Newsletter

News