InfoQ Homepage Data Pipelines Content on InfoQ
Articles
RSS Feed-
Apache DolphinScheduler in MLOps: Create Machine Learning Workflows Quickly
In this article, author discusses data pipeline and workflow scheduler Apache DolphinScheduler and how ML tasks are performed by Apache DolphinScheduler using Jupyter and MLflow components.
-
Building End-to-End Field Level Lineage for Modern Data Systems
In this article, the authors discuss the data lineage as a critical component of data pipeline root cause and impact analysis workflow, and how automating lineage creation and abstracting metadata to field-level helps with the root cause analysis efforts.
-
Implementing Pipeline Microservicilities with Tekton
Microservicilities is a list of cross-cutting concerns that a service must implement apart from the business logic. These concerns include invocation, elasticity and resiliency, among others. This article describes how a service mesh such as Istio may be used to implement these concerns.
-
Accelerating Deep Learning on the JVM with Apache Spark and NVIDIA GPUs
In this article, authors discuss how to use the combination of Deep Java Learning (DJL), Apache Spark v3, and NVIDIA GPU computing to simplify deep learning pipelines while improving performance and reducing costs. They also show the performance comparison of this solution with GPU vs CPU hardware, using Amazon EMR and NVIDIA RAPIDS Accelerator.
-
The Future of Data Engineering
Chris Riccomini examines the current and future states of the art in data pipelines, data streaming, and data warehousing. He presents a six-stage evolution that data ecosystems follow, from a simple monolith to a complex data-microwarehouse architecture as the data engineers who manage them solve problems and clarify their roles as infrastructure engineers, rather than data stewards.
-
Scalable Cloud Environment for Distributed Data Pipelines with Apache Airflow
In this article, author Lena Hall discusses how to use Apache Airflow to define and execute distributed data pipelines with an example of the workflow framework running on Kubernetes on Azure cloud platform.
-
Rethinking Flink’s APIs for a Unified Data Processing Framework
Since its very early days, Apache Flink has followed the philosophy of taking a unified approach to batch and streaming. The core building block is the “continuous processing of unbounded data streams, with batch as a special, bounded set of those streams.” Recent updates to the Flink APIs include architectural designs by the community to support batch and streaming unification in Apache Flink.