Google has described an approach to use transformer models, which ignited the current generative AI boom, for music recommendation. This approach, which is currently being applied experimentally on YouTube, aims to build a recommender that can understand sequences of user actions when listening to music to better predict user preferences based on their context.
A recommender leverages the information conveyed by different user actions, such as listening, skipping, or liking a piece, which is then used to make recommendations about items the user could be likely interested in.
A typical scenario where current music recommenders would fail, say Google researchers, is when a user's context changes, e.g., from home listening to gym listening. This context change can produce a shift in their music preferences towards a different genre or rhythm, e.g., from relaxing to upbeat music. Trying to take such contextual changes into account makes the task of recommendation systems much harder, say Google researchers, since they need to understand user actions in the user's current context.
This is where the transformer architecture may help, they believe, since it is especially suited to making sense of sequences of input data, as shown by NLP and, more generally, large language models (LLMs). Google researchers are confident that the transformer architecture may show the same ability to make sense of sequences of user actions as they do of language based on the user's context.
The self-attention layers capture the relationship between words of text in a sentence, which suggests that they might be able to resolve the relationship between user actions as well. The attention layers in transformers learn attention weights between the pieces of input (tokens), which are akin to word relationships in the input sentence.
Google researchers aim to adapt the transformer architecture from generative models to understanding sequential user actions based on the current user context. This understanding is then blended with personalized ranking models to produce a recommendation. To explain how user actions may have different meanings depending on the context, the researchers depict a user listening to music at the gym who might prefer more upbeat music. They would normally skip that kind of music when at home, so this action should get a lower attention weight when at the gym. In other words, the recommender applies different attention weights in the user context versus the global user's listening history.
We still utilize their previous music listening, while recommending upbeat music that is close to their usual music listening. In effect, we are learning which previous actions are relevant in the current task of ranking music, and which actions are irrelevant.
As a short summary of how it works, Google's transformer-based recommender follows the typical structure of a recommendation system and is comprised of three different phases: retrieving items from a corpus or library, ranking them based on user actions, and filtering them to show a reduced selection to the user. While ranking items, the system combines a transformer with an existing ranking model.
Each track is associated with a vector called track embedding, which is used both for the transformer and the model. Signals associated to user actions and track metadata are projected on to a vector of the same length, so they can be manipulated just like track embeddings. For example, when providing inputs to the transformer the user-action embedding and the music-track embedding are simply added together to generate a token. Finally, the output of the transformer is combined with that of the ranking model using a multi-layer neural network.
According to Google's researchers, initial experiments show an improvement of the recommender, measured as a reduction in skip-rate and an increase in time users spend listening to music.