BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Spotify Open-Sources Voyager Nearest-Neighbor Search Library

Spotify Open-Sources Voyager Nearest-Neighbor Search Library

This item in japanese

Spotify Engineering recently open-sourced Voyager, an approximate nearest-neighbor (ANN) search library. Voyager is based on the hierarchical navigable small worlds (HNSW) algorithm and is 10 times faster than Spotify's previous ANN library, Annoy.

Spotify uses ANN to power their music recommendation features, such as Discover Weekly. They developed Annoy in 2013 to perform ANN search, but found that 10 years on it no longer performs well at their current scale. Partly this is because of the underlying algorithm: Annoy uses a tree-partitioning algorithm, while Voyager uses the newer HNSW method. Voyager uses 4 times less memory than Annoy and 16x less than hnswlib, a popular implementation of HNSW. According to Spotify staff machine learning engineer and Voyager contributor Peter Sobot

Voyager combines the increased accuracy and speed from HNSW with well-tested, well-documented, and production-ready bindings for both Java and Python. Voyager’s philosophy is to offer a rock-solid, stable, production-ready library that allows anybody to add nearest-neighbor index lookup to their application, in Python or Java.

Spotify's recommendation features start by computing embeddings for songs or tracks using machine learning algorithms. The more frequently two tracks appear together in playlists, the closer they will be mapped together in the embedding space. Users are also mapped into this space based on the tracks that they listen to. To recommend a song that a user may like, Spotify uses an ANN search to find tracks that are close to the user's location in the embedding space. Spotify can also use ANN for deduplicating tracks: by calculating an embedding of the audio data, ANN can identify tracks with essentially identical audio.

To quickly perform that search becomes difficult at Spotify's scale: their embedding vectors have thousands of dimensions, and their catalog contains millions of tracks. When it became clear that Annoy no longer provided adequate performance, Spotify engineers experimented with the existing open-source hnswlib, but "ran into snags." There were bugs in its core code that the Spotify engineers could not fix. They also wanted a simpler API. This led to the development of Voyager, which was designed with Spotify's production infrastructure needs in mind.

In particular, Voyager is designed to be fast and lightweight. Its API has bindings for both Java and Python, but the library itself has no Java dependencies and requires only NumPy for Python. The library runs on both Intel and ARM machines under Linux and MacOS as well as Windows on Intel machines.

Users of Hacker News discussed Voyager, comparing it to its predecessor Annoy:

The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created"), while Voyager allows you to call `.add_item` at any time.

Peter Sobot replied to a thread on X about Voyager, saying:

I think of Voyager as an embeddable, super lightweight, fast vector search library. We don’t have all of the bells and whistles of proper vector databases, but Voyager also has zero dependencies and weighs about 300kB.

The Voyager source code is available on GitHub. Voyager has been submitted to the ANN-Benchmarks page, but the benchmarks have not been run for it.

About the Author

Rate this Article

Adoption
Style

BT