BT

LinkedIn and Twitter Contribute Machine Learning Libraries to Open Source

| by Alex Giamas Follow 9 Followers on Oct 24, 2014. Estimated reading time: 1 minute |

Twitter’s engineering group, known for various contributions to open source from streaming MapReduce to front-end framework Bootstrap recently announced open sourcing an algorithm that can efficiently recommend content. This is a really important problem for Twitter as it helps promoting the right ads to the right users and recommending which users to follow. The algorithm, named DIMSUM, can pre-process similarity data and feed the actual recommendation algorithm with a subset of users that are calculated to be above a similarity threshold.

As former Twitter engineer Reza Zadeh explains, DIMSUM is sampling the problem space to weed out the pairs of items that are not similar enough to matter. The DIMSUM algorithm may not matter as much in small data sets but its strength comes in play with big datasets, when one can’t bruteforce the problem and calculate all possible similarity pairs. The algorithm has been integrated into Scalding and Spark.

LinkedIn also open sourced a Machine Learning library of its own, ml-ease. ml-ease is a library focused in model fitting and training. Currently supporting ADMM (Alternating Direction Method of Multipliers), ml-ease can apply logistic regression in a highly paralelized fashion and converge to a solution that is theoretically close to what you could have obtained in a single machine algorithm execution.

Logistic regression is one of the most popular machine learning algorithms and not an easy one to parallelize. Mahout’s implementation of logistic regression using Stochastic Gradient Descent is one example of inherently sequential algorithm for a parallel problem. An evaluation of parallel logistic regression models has shown that given enough computing resources, a tradeoff between speed and precision can parallelize the problem for massive datasets. LinkedIn’s implementation is focusing in scalability, speed and ease of use with a small margin of error as a tradeoff for speed. This could be a good proposition for several commercial facing problems. The code is available in GitHub.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT