Pinterest Describes an Architecture for Efficient Retrieval of Hierarchical Documents

In a recent blog post, Pinterest engineers Haibin Xie and Michael Mi describe how their team implemented an efficient architecture to retrieve hierarchical documents. They achieved it by combining principles of index flattening, index normalization, and index denormalization.

Manas is Pinterest's home-grown search engine. Like other search engines, it was designed and optimized to support flattened documents. However, Pinterest required it to serve documents with a hierarchical structure. The figure below shows such a structure, where the Campaign entity contains multiple Ad groups, consisting of a collection of Promoted Pins.

Source: https://medium.com/pinterest-engineering/manas-two-stage-retrieval-the-efficient-architecture-for-hierarchical-documents-dcaba6e78b0d

To avoid flattening all the Campaign and Ad group properties to the Pin level, Pinterest engineers initially employed a two-stage query execution plan that required multiple round-trips between clients and clusters. However, this strategy complicated the client code "since clients need to deal with messy aggregation and query construction for multiple stages." As a result, they built the two-stage retrieval framework in Manas to further optimize this scenario.

The framework employs several steps to enable efficient querying. First, Manas flattens an index for a hierarchical document and builds an inverted index with skip lists. The figure below illustrates this process for a two-level hierarch with "Documents" ("field1" and "field2") and "Groups" ("field3" and "field4").

Source: https://medium.com/pinterest-engineering/manas-two-stage-retrieval-the-efficient-architecture-for-hierarchical-documents-dcaba6e78b0d

Then, Manas normalizes the index to reduce duplication of top top-level properties. The figure below illustrates this process, which is similar to normalization in relational databases.

Source: https://medium.com/pinterest-engineering/manas-two-stage-retrieval-the-efficient-architecture-for-hierarchical-documents-dcaba6e78b0d

While Manas normalizes the index to reduce waste in duplicating higher-level group properties, it denormalizes the index somewhat to support scenarios where a lower level document belongs to several groups. It stores redundant copies for those groups the document belongs to, paying additional fanout cost in indexing time for better serving time performance.

Source: https://medium.com/pinterest-engineering/manas-two-stage-retrieval-the-efficient-architecture-for-hierarchical-documents-dcaba6e78b0d

InfoQ spoke with Xie and Mi about Manas and the two-stage retrieval implementation.

InfoQ: What were the main reasons for developing Manas in-house instead of using an existing solution such as Elasticsearch or Apache Solr?

The two main reasons we decided to build an in-house search engine were efficiency and flexibility.

Efficiency is a critical metric that significantly impacts performance, serving cost, and maintenance cost. Previously, our search engine was built on top of Lucene and written in Java. Pinterest is a visual discovery engine, growing bigger each day with billions of Pins saved by hundreds of millions of people that power recommendations. As Pinterest grew, the legacy system faced challenges and could not support us at scale. The in-house search engine written in C++ has dramatically reduced latency and improved efficiency.

Also, Search is a foundational feature for most of Pinterest's products, and a system built from scratch gives us more flexibility. It also allows us to make tradeoffs optimized for our use cases and tailored for business needs, such as integrating with in-house ML platforms. The results have been positive - we've added several innovations to Manas that are not available in open-source projects yet, like real-time serving, embedding based retrieval, and two-stage retrieval.

With that said, we do leverage multiple open-source projects in the implementation, such as RocksDB for forward index serving and Solr for indexing pipeline.

InfoQ: What makes Manas unique compared to other search engines?

Manas was designed and tailored for Pinterest use cases. This design means we don't need to support the full set of features that Elasticsearch provides. Meanwhile, we can employ more advanced solutions to tackle Pinterest's unique challenges.

InfoQ: Do other current search engines implement two-stage queries?

We're not currently aware of other examples. Two-stage retrieval is just one of the solutions Pinterest took to address scalability and efficiency challenges. For Pinterest and our scale, we were hitting blockers for our business that needed to be addressed immediately.

InfoQ: What was your main challenge in the implementation?

When it comes to implementation, the most challenging part was modelizing the problem with a useful abstraction that could be generalized for other use cases. With an Infra mindset, we strive to make the framework generic enough to support more scenarios.

InfoQ: Is it planned to open-source Manas?

We don't have any plans to share yet, but we strive to open source where we can and contribute to the community that has been crucial to us.

There are various other open-source components implementing search engines. The most notable ones are Elasticsearch (whose license was recently updated) and Apache Solr, both based on the Apache Lucene project.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Search Engine topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter