InfoQ Homepage Data Pipelines Content on InfoQ
-
Strategies and Principles to Scale and Evolve MLOps - at QCon London
At the QCon London conference, Hien Luu, senior engineering manager for the Machine Learning Platform at DoorDash, discussed strategies and principles for scaling and evolving MLOps. With 85% of ML projects failing, understanding MLOps at an engineering level is crucial. Luu shared three core principles: "Dream Big, Start Small," "1% Better Every Day," and "Customer Obsession."
-
AWS Publishes Reference Architecture and Implementations for Deployment Pipelines
AWS recently released a reference architecture and a set of reference implementations for deployment pipelines. The recommended architectural patterns are based on best practices and lessons collected at Amazon and customer projects.
-
AWS Glue Now Supports Crawler History
AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.
-
Data Collection, Standardization and Usage at Scale in the Uber Rider App
Uber Engineering recently published how it collects, standardises and uses data from the Uber Rider app. Rider data comprises all the rider's interactions with the Uber app. This data accounts for billions of events from Uber's online systems every day. Uber uses this data to deal with top problem areas such as increasing funnel conversion, user engagement, etc.
-
QCon Plus November 2021 is Now Hybrid. Attend Online and In-Person (NY & SF)
The QCon Plus software development conference will be back November 1-5, 2021 - online and in-person. Get the chance to engage and network with professionals driving change and innovation inside the world’s most innovative software organizations.
-
Designing for Failure in the BBC's Analytics Platform
Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."
-
PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform
PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform. In a recent blog post, PayPal engineers detail how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem and their shift towards a new implementation.
-
Data Mesh Principles and Logical Architecture Defined
The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.
-
Accelerating Machine Learning Lifecycle with a Feature Store
Feature Store is a core part of next generation ML platforms that empowers data scientists to accelerate the delivery of ML applications. Mike Del Balso and Geoff Sims recently spoke at Spark AI Summit 2020 Conference about the feature store driven ML development.
-
Amazon Introduces the New Streaming ETL Feature on AWS Glue
Recently, Amazon announced AWS Glue now supports streaming ETL. With this new feature, customers can easily set up continuous ingestion pipelines that prepare streaming data on the fly and make it available for analysis in seconds.
-
KSQL Now Available on Confluent Cloud
KSQL is the streaming SQL engine for Apache Kafka, and it is currently available as a fully-managed service on the Confluent Cloud Platform for all its customers on usage-based billing plans. In a recent blog post, Confluent announced the availability of Confluent Cloud KSQL.
-
Michael Berthold on End-to-End Data Science Using KNIME Software
Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year's KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle. The data science process lifecycle mainly involves create and productionize categories.
-
High-Performance Data Processing with Spring Cloud Data Flow and Geode
Cahlen Humphreys and Tiffany Chang spoke recently at the SpringOne Platform 2019 Conference about data processing with Spring Cloud Data Flow and Apache Geode frameworks.
-
Data Lakes and Modern Data Architecture in Clinical Research and Healthcare
Dr. Prakriteswar Santikary, chief data officer at ERT, spoke at Data Architecture Summit 2018 Conference last month about data lake architecture his team developed at their clinical research organization. He discussed the data platform deployed in the cloud to streamline data collection, aggregation and clinical reporting and analytics, using concepts like serverless computing and data services.
-
Confluent Cloud, Apache Kafka as a Service in AWS
Apache Kafka is a distributed, fault-tolerant pub sub messaging soltuion, originally developed by LinkedIn and open sourced. Confluent was formed by former LinkedIn engineers in the Kafka development group and today announced Confluent Cloud, a fully hosted and managed Apache Kafka as a Service in AWS. We also take a look at Confluent's second annual Streaming Data report and its findings.