Twitter Open-Sources Recommendation Algorithm

Twitter recently open-sourced several components of their system for recommending tweets for a user's Twitter timeline. The release includes the code for several of the services and jobs that run the algorithm, as well as code for training machine learning models for embedding and ranking tweets.

Details of the system were described in a Twitter Engineering blog post. The process consists of three main steps: candidate tweet sourcing, tweet ranking, and heuristics/filtering. The release contains code for component services of the system, such as a machine-learning model server and a streaming event processor, as well as code for extracting features and models from raw data about tweets, users, and engagements (e.g., likes and retweets). Twitter framed the release as part of their effort to improve transparency but noted they did exclude some parts of their algorithm from the release, specifically calling out code for ad recommendations and code that could compromise user privacy. According to Twitter:

The goal of our open source endeavor is to provide full transparency to you, our users, about how our systems work. We’ve released the code powering our recommendations that you can view...to understand our algorithm in greater detail, and we are also working on several features to provide you greater transparency within our app.

The goal of the recommendation pipeline is to create a user's "For You" timeline page. The process starts with selecting a set of 1,500 candidate tweets, from both in-network (i.e, people the user follows) as well as out-of-network. On average, the timeline will contain roughly an equal amount of tweets from both sources.

Twitter Recommendation System Diagram. Image Source: https://github.com/twitter/the-algorithm

In-network tweets are ranked using a logistic regression model trained using Twitter's RealGraph algorithm, which tries to predict the "likelihood of engagement between two users." The higher the likelihood, the more of that user's tweets are included.

Out-of-network tweets are chosen from two sources. First, Twitter follows a social graph to find tweets engaged with by the people a user follows, which are also ranked by a logistic regression model. Other out-of-network tweets are chosen by using an embedding space called SimClusters, which uses a matrix factorization algorithm to identify 145,000 virtual communities of users. Tweets are associated with communities based on how many users in a community liked the tweet.

After the candidate tweets are chosen, they are ranked using a 48M parameter neural network model based on MaskNet. Finally, Twitter applies heuristics and filters to "create a balanced and diverse feed." This includes balancing the in-network and out-of-network tweets in the results, threading reply tweets with the tweet replied to, and removing NSFW content.

Twitter's release of their algorithm has sparked a vigorous online discussion. In a Hacker News thread, several users pointed to source code that extracts features from tweets, which included flags for author_is_elon, author_is_power_user, author_is_democrat, and author_is_republican, along with comments in the code which claim these are used for A/B testing algorithm changes.

On Twitter, machine learning engineer Vicki Boykis said of the release:

Going to be going through the Twitter codebase for a long time, this is truly a gift for recsys [recommendation systems] nerds. I've only browsed it so far but seems very standard recsys-y components, rankers, filters, generators, Kafka log collection, and timeline construction.

In a Reddit discussion about the release, one user wrote:

Putting aside the political undertones behind many peoples' desire to publish "the algorithm", this is a phenomenal piece of educational content for ML professionals. Here we have a world-class complex recommendation & ranking system laid bare for all to read into, and develop upon. This is a veritable gold mine of an educational resource.

The Twitter recommendation system code is available on GitHub, as is the code for training two of the ML models.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter