Pinterest Engineering Reduces Android CI Build Times by 36% with Runtime-Aware Sharding

Pinterest published a technical case study detailing how its engineering team cut Android end-to-end (E2E) continuous integration (CI) build times by more than 36 percent by adopting a runtime-aware test-sharding strategy and building an internal testing platform. Before the change, Pinterest's Android CI pipeline suffered from slow, unpredictable build times because its test suite was split by package names on a third-party platform, causing the slowest shard to gate the entire build. By creating a custom in-house platform and reorganizing how tests were distributed, Pinterest achieved more balanced execution, significantly reducing feedback latency for developers

In their previous setup, Pinterest relied on Firebase Test Lab (FTL) and sharded tests based on package groupings, resulting in imbalanced workloads when some tests ran much longer than others. This not only prolonged total build times but also introduced flakiness and setup overhead that added minutes to each run. After evaluating third-party alternatives, the team concluded that none met its requirements for native emulator support, reliability, and fine-grained control without a custom orchestration layer. This analysis led to the development of PinTestLab, an in-house testing infrastructure running Android emulators on EC2 instances, giving engineers full control over scheduling, environment setup, and runtime orchestration.

The core innovation involved a runtime-aware sharding algorithm that uses historical test execution data stored in Pinterest’s Metro test management system to group tests by expected duration rather than count. Instead of simple round-robin or package-based splits, the algorithm sorts tests by expected runtime and assigns them to shards in a way that minimizes variation in wall-clock execution time across shards. Using this method, the team compressed the gap between the fastest and slowest shards from hundreds of seconds to just a few dozen seconds, leading to consistent, faster completion of builds. The shift reduced the slowest shard runtime by about 55 percent and cut total CI feedback time by roughly nine minutes per build.

A critical enabler of this improvement was Pinterest's ability to collect and use historical runtime and stability data for each test. Instead of evenly distributing tests by number, the runtime-aware approach assigns tests to the shard expected to finish first based on past performance, keeping emulator resources busy and tail latency low. Implementation involved lightweight sorting and greedy assignment using a min-heap data structure, a practical compromise between scheduling efficiency and computational simplicity.

Pinterest's platform also prioritized operational stability: the runtime-aware sharding logic runs within Buildkite, using historical data to generate per-shard plans, but falls back to round-robin distribution if Metro is unavailable, ensuring CI reliability even during infrastructure hiccups. Looking forward, the team is exploring on-demand sharding using message queues for greater elasticity and finer-grained test execution, but has found that the current approach delivers strong performance with minimal complexity.

This work highlights two broader trends in CI optimization. First, leveraging historical empirical data for workload balancing can yield deterministic performance gains in parallel testing environments. Second, moving away from vendor-managed sharding toward in-house orchestration allows teams to tailor test execution strategies to their unique workload characteristics. Pinterest's efforts demonstrate a practical model for other organizations facing long CI wait times as test suites grow in size and complexity, a common challenge in modern software development.

There are multiple examples from other companies and projects that have implemented similar approaches to improving CI performance, test sharding, and build time reduction, often using historical data, parallelization, or custom algorithms tailored to their workloads. Here are some other examples with key details:

Dropbox engineers revamped their Android testing pipeline by selectively identifying and running only the affected tests for a given change, using an Affected Module Detector (inspired by AndroidX tooling) and test sharding via tools such as Flank/Fladle. They also standardized coverage reporting and moved on-device tests to Firebase Test Lab, reducing overall Android CI runtime from around 75 minutes to around 25 minutes. Their approach is similar in spirit to Pinterest's work in that it avoids running unneeded tests and applies strategic parallelism to improve CI efficiency.

Shopify's engineering team built a custom test splitter that uses historical timing data to balance test shards. By sorting tests by execution time, then assigning them in a way that minimizes shard variance, they cut the imbalance between slow and fast shards dramatically, reducing total CI build time from ~45 minutes to ~11 minutes, and tightening shard duration variance to within ~5 percent. This pattern closely mirrors Pinterest's runtime-aware sharding concept and demonstrates how historical timing metrics can yield large CI speedups when thoughtfully applied.

Square's internal CI tooling, Kochiku, supports automated sharding, intelligent scheduling, and distributed caching across hundreds of shards for Android builds. While not published in as much detail as Pinterest's case study, Square's approach underlines the importance of scaling CI sharding and scheduling logic to distributed environments where hundreds or thousands of parallel test workers can execute independently, reducing backlogs and improving resource use.

Platforms like Bitrise provide built-in test sharding and parallelization features that many app teams adopt to reduce CI times. Users of Bitrise report up to around 50 percent reduction in testing time with its sharding and parallel execution capabilities, though this tends to be at a higher level of abstraction compared to custom systems like Pinterest's.

Slack's engineering team optimized its end-to-end CI pipeline by conditionally skipping unnecessary frontend builds and reusing prebuilt assets, which led to ~50 percent faster builds and reduced flakiness. Although not strictly test sharding, this example illustrates how strategic reuse and selective execution based on build context can deliver major improvements in CI feedback loops.

These examples illustrate that while Pinterest’s runtime-aware sharding technique is notable for its simplicity and effectiveness, it fits within a broader set of best practices that many engineering organizations are adopting: using historical execution data to balance workloads, maximizing parallelism, and avoiding unnecessary test execution to drive faster, more reliable CI feedback.

About the Author

Craig Risi

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Craig Risi

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter