InfoQ Homepage Data Warehouse Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

350PB, Millions of Events, One System: inside Uber’s Cross-Region Data Lake and Disaster Recovery

Uber’s HiveSync is a sharded, cross-region batch replication system keeping Hive/HDFS data consistent across multiple regions. Handling 5M daily Hive events and 8PB of data replication, it uses event-driven jobs, hybrid RPC and DistCp strategies, DAG-based orchestration, and dynamic sharding, enabling disaster recovery, horizontal scaling, and 99.99% cross-region data accuracy.

Leela Kumili
on Jan 16, 2026
Culture & Methods

How Data Mesh Platforms Connect Data Producers and Consumers

A challenge that companies often face when exploiting their data in data warehouses or data lakes is that ownership of analytical data is weak or non-existent, and quality can suffer as a result. A data mesh is an organizational paradigm shift in how companies create value from data where responsibilities go back into the hands of producers and consumers.

Ben Linders
on Jun 27, 2024
AI, ML & Data Engineering

Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.

Rafal Gancarz
on Jul 24, 2023
AI, ML & Data Engineering

Databricks Unveils Lakehouse AI and MosaicML Acquisition at Data + AI Summit

The Data and AI company Databricks recently unveiled Lakehouse AI, a suite of tools for building and governing generative AI models, including large language models (LLMs), within the Databricks platform. Among the tools were LakehouseIQ, a "knowledge engine" that uses AI to understand a company's unique data, culture, and language in order to improve natural language interfaces like chatbots.

Andrew Hoblitzell
on Jul 18, 2023
Cloud

Amazon Redshift Serverless Generally Available to Automatically Scale Data Warehouse

Amazon recently announced the general availability of Redshift Serverless, an elastic option to scale data warehouse capacity. The new service allows data analysts, developers and data scientists to run and scale analytics without provisioning and managing data warehouse clusters.

Renato Losio
on Jul 23, 2022
Cloud

AWS Introduces Amazon Redshift Serverless

As part of a trend towards serverless analytics options, AWS announced the public preview of Amazon Redshift Serverless. The latest version of the managed data warehouse service targets deployments where it is difficult to manage capacity due to variable workloads or unpredictable spikes.

Renato Losio
on Dec 01, 2021
Cloud

AWS Announces the Public Preview of AWS Data Exchange for Amazon Redshift

Recently AWS announced the public preview of AWS Data Exchange for Amazon Redshift. This new feature enables customers to find and subscribe to third-party data in AWS Data Exchange to query in an Amazon Redshift data warehouse.

Steef-Jan Wiggers
on Oct 27, 2021
Architecture & Design

Data Collection, Standardization and Usage at Scale in the Uber Rider App

Uber Engineering recently published how it collects, standardises and uses data from the Uber Rider app. Rider data comprises all the rider's interactions with the Uber app. This data accounts for billions of events from Uber's online systems every day. Uber uses this data to deal with top problem areas such as increasing funnel conversion, user engagement, etc.

Eran Stiller
on Sep 22, 2021
Cloud

Amazon Redshift Data Sharing Now Generally Available

Amazon has recently announced the general availability of the Amazon Redshift Data Sharing functionality to share live data across Amazon Redshift clusters. This allows the use of a single data warehouse cluster for multi-cluster deployments and sharing data instantly without the need to copy or move them.

Renato Losio
on Mar 20, 2021
Architecture & Design

The Distributed Data Mesh as a Solution to Centralized Data Monoliths

Instead of building large, centralized data platforms, corporations and data architects should create distributed data meshes.

Thomas Betts
on Jan 31, 2020
Cloud

Microsoft Announces Azure Synapse for Data Warehousing and Analytics

During Microsoft's annual Ignite conference the company announced a new analytics service called Azure Synapse. The service, which is a continuation of Azure SQL Data Warehouse, focuses on bringing enterprise data warehousing and big data analytics into a single service.

Eldert Grootenboer
on Nov 20, 2019
AI, ML & Data Engineering

The Future of Data Engineering: Chris Riccomini at QCon San Francisco

At QCon San Francisco 2019, Chris Riccomini presented “The Future of Data Engineering”. The key takeaway of his talk is about reaching an end goal with data engineering, which is having a fully automated decentralized data warehouse.

Steef-Jan Wiggers
on Nov 18, 2019
AI, ML & Data Engineering

Databricks Open Sources Delta Lake to Make Data Lakes More Reliable

Databricks recently announced open sourcing Delta Lake, their proprietary storage layer, to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark, while Delta Lake is already being used in several companies like McAffee, Upwork etc . Delta Lake is addressing the heterogeneous data problem that data lakes often have...

Alex Giamas
on May 20, 2019
AI, ML & Data Engineering

William McKnight on Data Platforms and Creating a Modern Data Architecture

William McKnight gave a keynote presentation last week at Data Architecture Summit 2018 Conference on creating a modern data architecture using different data platforms.

Srini Penchikala
on Oct 15, 2018
AI, ML & Data Engineering

Data Workflow Management Using Airbnb's Airflow

Airbnb recently opensourced Airflow, its own data workflow management framework. Airflow is being used internally at Airbnb to build, monitor and adjust data pipelines. Airflow’s creator, Maxime Beauchemin and Agari’s Data Architect and one of the framework’s early adopters Siddharth Anand discuss about Airflow, where it can be of use and future plans.

Alex Giamas
on Sep 08, 2015

Newer News

Older News

InfoQ Software Architects' Newsletter

News