Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform

In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs one of the largest Hadoop installations in the world, managing over an exabyte of data across tens of thousands of servers in each of its two regions. The open-source data ecosystem, particularly Hadoop, has been the cornerstone of the data platform.

The strategic migration plan consists of two steps: Initial migration and leveraging Cloud-Native Services. Uber's initial strategy involves leveraging GCP’s object store for data lake storage while migrating the rest of their data stack to GCP’s Infrastructure as a Service (IaaS). This approach allows for a swift migration with minimal disruption to the existing jobs and pipelines, as they can replicate the exact versions of their on-premises software stack, engines, and security model on IaaS. Following this phase, the Uber engineering team plans to gradually adopt GCP’s Platform as a Service (PaaS) offerings, such as Dataproc and BigQuery, to harness the elasticity and performance benefits of cloud-native services fully.

Scope of migration (source: Uber's blog)

Once the initial migration is complete, the team will focus on integrating cloud-native services to maximize the data infrastructure’s performance and scalability. This phased approach ensures that Uber users, from dashboard owners to ML practitioners, experience a seamless transition without altering their existing workflows or services.

To ensure a smooth and efficient migration, the Uber team have established several guiding principles:

Minimize use disruption by moving the majority of the batch data stack onto cloud IaaS as-is; they aim to shield their users from any changes to their artifacts or services. Using well-known abstractions and open standards, they strive to make the migration as transparent as possible.
They will rely on a cloud storage connector that implements the Hadoop FileSystem interface to Google Cloud Storage, ensuring HDFS compatibility. By standardizing their Apache Hadoop HDFS clients, we will abstract the specifics of the on-premise HDFS implementation, allowing seamless integration with GCP’s storage layer.
The Uber team has developed data access proxies for Presto, Spark, and Hive that abstract the underlying physical compute clusters. These proxies will support the selective routing of test traffic to cloud-based clusters during the testing phase and fully route queries and jobs to the cloud stack during the full migration.
Utilizing Uber’s cloud-agnostic infrastructure. Uber existing container environment, computing platform, and deployment tools are built to be agnostic between cloud and on-premises. These platforms will enable to easily expand their batch data ecosystem microservices onto the cloud IaaS.
The team will build and enhance existing data management services to support selected and approved cloud services, ensuring robust data governance. The company aims to maintain the same levels of authorized access and security as on-premises, while supporting seamless user authentication against the object store data lake and other cloud services.

Pre and post-migration Uber's batch data stack (source: Uber's blog)

The Uber team focuses on bucket mapping and cloud resource layout for migration. Mapping HDFS files and directories to cloud objects in one or more buckets is critical. They need to apply IAM policies at varying levels of granularity, considering constraints on buckets and objects such as read/write throughput and IOPS throttling. The team aims to develop a mapping algorithm that satisfies these constraints and organizes data resources in an organization-centric hierarchical manner, improving data administration and management.

Security Integration is another workstream; adapting our existing Kerberos-based tokens and Hadoop Delegation tokens for cloud PaaS, particularly Google Cloud Storage (GCS), is essential. This workstream aims to support seamless user, group, and service account authentication and authorization, maintaining consistent access levels as on-premises.

The team also focuses on data replication. HiveSync, the permissions-aware bidirectional data replication service, allows Uber to operate in active-active mode. It extends HiveSync’s capabilities to replicate the on-premise data lake’s data to the cloud-based data lake and corresponding Hive Metastore. This includes an initial bulk migration and ongoing incremental updates until the cloud-based stack becomes the primary.

The last workstream is providing new YARN and Presto clusters on GCP Iaas. Uber data access proxies will route query and job traffic to these cloud-based clusters during the migration, ensuring a smooth transition.

Uber's big data migration to Google Cloud anticipates challenges like performance differences in storage and unforeseen issues due to its legacy system. The team plans to address these by leveraging open-source tools, utilizing cloud elasticity for cost management, migrating non-core uses to dedicated storage, and proactively testing integrations and deprecating outdated practices.

About the Author

Claudio Masolo

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Claudio Masolo

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter