Reddit Migrates Media Metadata from S3 and Other Systems into AWS Aurora Postgres

Reddit consolidated its media metadata storage into a new architecture using AWS Aurora Postgres. Previously, the company sourced media metadata from various systems, including directly from AWS S3. The new solution simplifies media metadata retrieval and handles 100k+ requests per second with latency below 5ms (p90).

The company hosts billions of posts containing different types of media content, such as images, videos, and embedded third-party media. Reddit also observes trends in user behavior and expects more media content to be uploaded in the coming years.

Jianyi Yi, senior software engineer at Reddit, explains what media metadata means for Reddit:

Media metadata provides additional context, organization, and searchability for the media content. There are two main types of media metadata on Reddit. The first type is media data on the post model. For example, when rendering a video post, we need the video thumbnails, playback URLs, bitrates, and various resolutions. The second type consists of metadata directly associated with the lifecycle of the media asset itself.

Due to Reddit’s platform's organic evolution, media metadata ended up being stored in many systems, with inconsistent storage formats, varying query patterns for different media types, and a lack of auditing or content categorization. Worse yet, in some cases, S3 bucket objects need to be queried or even downloaded to fetch the corresponding metadata.

Example of Media Metadata (Source: Reddit Engineering Blog)

The company decided to create a unified system for managing media metadata and opted to use AWS Aurora Postgres for data storage over Apache Cassandra. Although both databases met Reddit’s requirements, Postgres was chosen because it offers better ad-hoc debugging and more flexible query patterns.

Anticipating future growth (50TB of media metadata by 2030), the engineers employed table partitioning to support scalability in a Postgres-based solution. They leveraged pg_partman and pg_cron extensions for partition management, using post ID as the partitioning key. Partitioning on the monotonically incrementing post ID offers performance gains as it allows Postgres to cache the indexes of the most recent partitions, minimizing the disk I/O. Additionally, batch queries retrieving multiple posts from the same time period retrieve all data from the single partition, further improving query execution time.

The team also decided to store all media metadata fields in a serialized JSONB format, effectively transforming the table into a key-value store, which simplifies the query logic, avoids joins, and further improves read performance. With all scalability and performance optimizations, the metadata store delivers a low read latency of 2.6ms (p50), 4.7ms (p90), and 17ms (p99) at 100k RPS (requests per second).

Media Metadata Store Architecture, Including Migration (Source: Reddit Engineering Blog)

The biggest challenge of the project was the data migration, and engineers adopted a multi-stage approach. First, they enabled dual writes and backfilled the data from the old data sources. Then, they enabled dual reads and compared outputs to detect and address any issues. Lastly, they gradually switched over to using the new data store exclusively while monitoring for any performance and scalability issues. During the migration process, the team used Apache Kafka consumer to stream data change events from the source database and report data inconsistencies to help engineers analyze data issues.

About the Author

Rafal Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write & Win: InfoQ Contest

About the Author

Rafal Gancarz

Rate this Article

This content is in the Cloud Computing topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Write Your Way to a QCon or InfoQ Dev Summit!

The InfoQ Newsletter