InfoQ Homepage Big Data Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Cloudflare Details Unified Data Platform Where Billing Workloads Account for 53% of Queries

Cloudflare details Town Lake, an internal unified data platform, and Skipper, an AI analytics agent unifying access to operational, billing, security, and business data. The platform processed ~91K billing queries, with billing forming majority usage. Built on a lakehouse architecture using Trino, Iceberg, R2, and DataHub, it enables governed cross-system analytics and natural language access.

Leela Kumili
on Jul 03, 2026
Java

Hardwood Promises High-Speed JVM Apache Parquet Processing with Zero Mandatory Dependencies

Hardwood, the project Gunnar Morling kick-started to improve the handling of Parquet files in Java, reached version 1. Its multi-threaded approach and zero mandatory external dependencies promise a simpler, optimal alternative to the Apache Parquet Java implementation. For now, the library provides a reading via API and a CLI for visualisation; writing support is expected in the upcoming versions.

Olimpiu Pop
on Jul 03, 2026
DevOps

Pinecone Brings AI Agents Directly to Enterprise Data with Microsoft OneLake Integration

Pinecone has announced a new integration between its Nexus knowledge engine and Microsoft OneLake, aiming to fundamentally change how enterprise AI agents access and reason over corporate data.

Craig Risi
on Jun 12, 2026
DevOps

Discord Rebuilds Database Operations around Automation to Manage ScyllaDB at Massive Scale

Discord has detailed how it rebuilt its database operations around a new internal orchestration framework called the Scylla Control Plane (SCP), enabling its small infrastructure team to automate large-scale ScyllaDB cluster management tasks that previously took days of manual work.

Craig Risi
on May 22, 2026
Architecture & Design

Pinterest Reduces Spark OOM Failures by 96% through Auto Memory Retries

Pinterest Engineering cut Apache Spark out-of-memory failures by 96% using improved observability, configuration tuning, and automatic memory retries. Staged rollout, dashboards, and proactive memory adjustments stabilized data pipelines, reduced manual intervention, and lowered operational overhead across tens of thousands of daily jobs.

Leela Kumili
on Apr 06, 2026
Architecture & Design

Uber Launches IngestionNext: Streaming-First Data Lake Cuts Latency and Compute by 25%

Uber launches IngestionNext, a streaming-first data lake ingestion platform that reduces data latency from hours to minutes and cuts compute usage by 25%. Built on Kafka, Flink, and Apache Hudi, it supports thousands of datasets, enabling faster analytics, experimentation, and machine learning workloads globally.

Leela Kumili
on Mar 25, 2026
Architecture & Design

Hybrid Cloud Data at Uber: How Engineers Solved Extreme-Scale Replication Challenges

Uber’s HiveSync team optimized Hadoop Distcp to handle multi-petabyte replication across hybrid cloud and on-premise data lakes. Enhancements include task parallelization, Uber jobs for small transfers, and improved observability, enabling 5x replication capacity and seamless on-premise-to-cloud migration.

Leela Kumili
on Mar 02, 2026
DevOps

Datadog Integrates Google Agent Development Kit into LLM Observability Tools

Datadog recently announced that its LLM Observability platform now provides automatic instrumentation for applications built with Google's Agent Development Kit (ADK), offering deeper visibility into the behavior, performance, cost, and safety of AI-driven agentic systems.

Craig Risi
on Feb 06, 2026
DevOps

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

Etleap has recently launched the Iceberg pipeline platform, a new managed data pipeline layer designed to let enterprises adopt Apache Iceberg without building or maintaining a complex custom stack.

Craig Risi
on Feb 03, 2026
DevOps

Pinterest's Moka: How Kubernetes Is Rewriting the Rules of Big Data Processing

Digital pinboard provider Pinterest has published an article explaining its blueprint for the future of large-scale data processing with its new platform Moka. The company is moving core workloads from ageing Hadoop infrastructure to a Kubernetes-based system on Amazon EKS, with Apache Spark as the main engine and support for other frameworks on the way.

Matt Saunders
on Jan 19, 2026
Culture & Methods

How Data Contracts Support Collaboration between Data Teams

Data contracts define the interface between data providers and consumers, specifying things like data models, quality guarantees, and ownership. They are essential for distributed data ownership in data mesh, ensuring data is discoverable, interoperable, and governed. Data contracts improve communication between teams and enhance the reliability and quality of data products.

Ben Linders
on Feb 06, 2025
AI, ML & Data Engineering

QCon SF 2024 - Incremental Data Processing at Netflix

Jun He gave a talk at QCon SF 2024 titled Efficient Incremental Processing with Netflix Maestro and Apache Iceberg. He showed how Netflix used the system to reduce processing time and cost while improving data freshness.

Anthony Alford
on Nov 25, 2024
Culture & Methods

Setting up a Data Mesh Organization

A data mesh organization: producers, consumers, and the platform. According to Matthias Patzak, the mission of the platform team is to make the lives of the producer and consumers simple, efficient and stress free. Data must be discoverable and understandable, trustworthy, and shared securely and easily across the organization.

Ben Linders
on Oct 10, 2024
Culture & Methods

Measuring and Reducing the Environmental Impact of Software

Software applications often manage big amounts of data; most of them are internet-based applications, and incorporate artificial intelligence. According to Coral Calero, these three aspects improve the capabilities and functionalities provided by software but they have also increased the amount of energy needed. We need to measure energy consumption of software to control its environmental impact.

Ben Linders
on Sep 26, 2024
DevOps

Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform

In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs one of the largest Hadoop installations in the world, managing over an exabyte of data across tens of thousands of servers in each of its two regions

Claudio Masolo
on Jun 29, 2024

Newer News

Older News

InfoQ Software Architects' Newsletter

News