InfoQ Homepage Data Pipelines Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Canva Opts for Amazon KDS over SNS+SQS to Save 85% with 25 Billion Events per Day

Canva evaluated different data massaging solutions for its Product Analytics Platform, including the combination of AWS SNS and SQS, MKS, and Amazon KDS, and eventually chose the latter, primarily based on its much lower costs. The company compared many aspects of these solutions, like performance, maintenance effort, and cost.

Rafal Gancarz
on Aug 07, 2024
Cloud

Local Emulator for Azure Event Hubs in Preview: Offering Developers a Local Development Experience

Microsoft recently launched the local emulator's preview release for Azure Event Hubs. This emulator is designed to give developers a local development experience for Azure Event Hubs, allowing them to develop and test code against the services in isolation.

Steef-Jan Wiggers
on Jun 04, 2024
Architecture & Design

Yelp Overhauls Its Streaming Architecture with Apache Beam and Apache Flink

Yelp reworked its data streaming architecture by employing Apache Beam and Apache Flink. The company replaced a fragmented set of data pipelines for streaming transactional data into its analytical systems, like Amazon Redshift and in-house data lake, using Apache data streaming projects to create a unified and flexible solution.

Rafal Gancarz
on Apr 22, 2024
Architecture & Design

Netflix Creates Incremental Processing Solution Using Maestro and Apache Iceberg

Netflix created a new solution for incremental processing in its data platform. The incremental approach reduces the cost of computing resources and execution time significantly as it avoids processing complete datasets. The company used its Maestro workflow engine and Apache Iceberg to improve data freshness and accuracy and plans to provide managed backfill capabilities.

Rafal Gancarz
on Jan 15, 2024
Architecture & Design

Goldsky’s Streaming-First Architecture for Blockchain Data with Flink, Redpanda and Kubernetes

Goldsky created a platform for the real-time processing of blockchain data. The platform allows clients to extract data from blockchains into their own databases to support product features, but without running the data pipeline infrastructure. The event-driven architecture (EDA) of Goldsky leverages Apache Flink, Redpanda, Kubernetes, and cloud provider services.

Rafal Gancarz
on Oct 30, 2023
AI, ML & Data Engineering

A Modern Compute Stack for Scaling Large AI, ML, & LLM Workloads at QCon SF

Jules Damji, a lead developer advocate at Anyscale Inc., discussed the difficulties data scientists encounter when managing infrastructure for machine learning models. He emphasized the necessity for a framework that supports the latest machine learning libraries, is easily manageable, and can scale to accommodate large datasets and models. Damji introduced Ray as a potential solution.

Andrew Hoblitzell
on Oct 06, 2023
Cloud

Confluent Announces Apache Flink on Confluent Cloud in Open Preview

Confluent recently announced the open preview of Apache Flink on Confluent Cloud as a fully-managed service for stream processing. The company claims that the managed service will make it easier for companies to filter, join, and enrich data streams with Flink.

Steef-Jan Wiggers
on Sep 29, 2023
DevOps

Running Apache Flink Applications on AWS KDA: Lessons Learnt at Deliveroo

Deliveroo introduced Apache Flink into its technology stack for enriching and merging events consumed from Apache Kafka or Kinesis Streams. The company opted to use AWS Kinesis Data Analytics (KDA) service to manage Apache Flink clusters on AWS and shared its experiences from running Flink applications on KDA.

Rafal Gancarz
on Aug 16, 2023
Architecture & Design

Pfizer Uses Serverless Architecture on AWS to Scale Processing of Digital Biomarkers

Pfizer upgraded the serverless architecture for processing digital biomarker data at scale to make it more flexible and configurable. They created a framework that uses a file processing pipeline built with AWS Step Functions and other serverless services, as well as a custom Python package for data ingestion and processing.

Rafal Gancarz
on Jul 26, 2023
Architecture & Design

Yelp Rebuilds Corrupted Cassandra Cluster Using Its Data Streaming Architecture

Yelp created a solution to sanitize data from the corrupted Apache Cassandra cluster utilizing its data streaming architecture. The team explored many potential options to address the data corruption issue, however, ultimately had to move the data into a new cluster to remove corrupted records in the process.

Rafal Gancarz
on Jul 17, 2023
Architecture & Design

Instacart Creates a Self-Serve Apache Flink Platform on Kubernetes

Instacart moved their Apache Flink workloads from AWS EMR to Kubernetes to meet the high demand for data processing use cases using Flink within the organization, as using EMR became problematic for many teams with different requirements. As a result, they made the platform easier to use and reduced their operational and infrastructure costs.

Rafal Gancarz
on Jul 12, 2023
AI, ML & Data Engineering

Strategies and Principles to Scale and Evolve MLOps - at QCon London

At the QCon London conference, Hien Luu, senior engineering manager for the Machine Learning Platform at DoorDash, discussed strategies and principles for scaling and evolving MLOps. With 85% of ML projects failing, understanding MLOps at an engineering level is crucial. Luu shared three core principles: "Dream Big, Start Small," "1% Better Every Day," and "Customer Obsession."

Roland Meertens
on Apr 02, 2023
Cloud

AWS Publishes Reference Architecture and Implementations for Deployment Pipelines

AWS recently released a reference architecture and a set of reference implementations for deployment pipelines. The recommended architectural patterns are based on best practices and lessons collected at Amazon and customer projects.

Renato Losio
on Feb 18, 2023
Cloud

AWS Glue Now Supports Crawler History

AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.

Nsikan Essien
on Sep 19, 2022
AI, ML & Data Engineering

Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale

Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG, irregular level of traffic, resource contention among workloads, and more.

Reza Rahimi
on Jul 22, 2022

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News