LinkedIn Signal: A Case Study for Scala, JRuby and Voldemort
On September 29th LinkedIn Signal was announced, providing a social search application both for LinkedIn shares and tweets from LinkedIn-Twitter bounded accounts. This article aims to provide more insight into the motivation and technical challenges of combining Scala, JRuby and Voldemort, at such scale.
John Wang, LinkedIn Search Architect, has published an architectural overview of the whole system:
The Scalatra-backend is a RESTful service written in Scala, on the Sinatra framework. The REST/JSON RPC model is chosen for quick ad hoc data manipulation for fast iteration.
The choice for going with a JRuby frontend was also made for the reasons of fast iteration.
Data used for facet decoration is fairly static and small, we just put a BDB instance behind a service abstraction.
Saved-searches and follow were features we thought of later in the developments. We wanted to have something running right-away that was certainly scalable and elastic. Voldemort became an obvious choice as our query access pattern fits a key-value store perfectly. Furthermore, Voldemort’s data-rebalancing and elastic features were to the point with respect of expected data growth.
The data stream is an aggregate of LinkedIn shares, Tweets from bounded accounts, LinkedIn Profiles and member derived information. This was built on our distributed messaging queue.
The implementation of the actual search system seems to have been the most challenging part of the new service. There, the team had to leverage several products that are part of the Search, Network, and Analytics team at LinkedIn, like:
- Apache Lucene, for text searching,
- Zoie, for realtime indexing,
- Bobo, for faceted searching and
- Sensei, a distributed realtime searchable database with dynamic clustering
Besides the details of how these products were configured, integrated and extended with custom software, John describes how the algorithm for the article’s ranking works:
After a few iterations, while the system seems to be holding up pretty well under load, that’s when we noticed that we were affected by people spamming the twitter stream. A simple example, is individuals doing some self promotions by ’sharing’ a given url upwards of thousands of times. It simply was affecting us as we were using a scoring scheme that put too much emphasis on share counts. Simply put, a popularity measure purely based on sharing was not good enough for ranking.
An intuitive fix to sharing spamming is to attempt to determine the number of unique individuals behind the overall sharing pool for a given article. In a sense the algorithm should promote articles also based on the number of unique individuals that share it and demote articles that are overtly shared a large number of times by a single individual. Now results are fantastic!
InfoQ contacted the people responsible for the LinkedIn Signal deployment to get more information on their project:
InfoQ: How are you using JRuby on the front-end? Through some specific framework?
Alejandro Crosa (JRuby/Scala expert @ LinkedIn): Currently the front-end layer is very thin, and is giving us things like session state, templating and higher level abstractions to the backend services and REST apis. The choice of JRuby comes primarily because it gave us three important things, flexible expressive code with fast time to develop and test, easy Java integration, and the ability to be able leverage libraries from both worlds. The web framework we are using is Sinatra, since all we wanted to do is have a very simple minimalistic REST interface for the app.
InfoQ: How much of your front-end is Scalatra and how much JRuby? Why do you find their combination more valuable than only using one of them?
Alejandro Crosa: JRuby is only used for the public API and front-end features; Scalatra on the other hand is used to expose the backend API's in a clean RESTful way. That's the pattern we repeated for the app, build services using the best tools for the particular job, and wrap them in a clean RESTful API.
Also, since the backend was all about performance we didn't wanted to pay any penalty with interpreted code, the iteration frequency is not as with the front-end, and Scala's type system it's a bliss to use when building backend services. In general instead of using the same tools regardless of the domain of problem (an industry stigma), we choose what we thought was best for that particular requirement, and then put some REST makeup on it.
InfoQ: Having accumulated a significant experience with Scala at LinkedIn, what would be your advice for teams that are considering adopting it? What are the common pitfalls and how can they blend it in their existing Java infrastructure?
Alejandro Crosa: Coming from a Ruby world I see lots of similar features in the language and some that are really attractive. Scala has been considered being too complex recently, but in reality, at the core the language is pretty simple, you can choose what features you want to use and which don't. You can write very simple minimalistic code, or write extremely hard to read stuff, it's the developer decision. It really scales at the developer level.
For the common pitfalls, I think embrace it drastically is a mistake, specially if there's a lot of legacy code in Java, instead find compelling reasons as to why migrate to it, and apply it to a specific domain problem where Scala excels, if you use Scala just like as you would use Java then you are just getting a more succinct syntax and nothing else. Embrace immutability, functional data structures, pattern matching, Actors, etc. and you'll never go back.
Chris Conrad (Scala expert @ LinkedIn, Creator of Norbert): For the most part I totally agree with Alejandro. Scala is a wonderful programming language to work in and provides some very powerful constructs. But if you are simply going to write Java with Scala syntax then you are not getting your money's worth.
Norbert was a great project for me to implement in Scala. Since it is a fairly small, self contained project I was able to spend some time experimenting with Scala's different features to figure out the best way to express the API I was looking for. And because Scala integrates so tightly and cleanly with Java I was able to provide a clean, Java specific API that allows Java developers to use Norbert.
My advice for others looking to start using Scala would be to find a small, self contained piece of functionality that they can experiment with in their project. Learning how to effectively utilize Scala can take some time and you don't want to destabilize your project by trying to implement key pieces in Scala until you have gotten over that learning curve. By leveraging the Scala/Java integration, you can be implementing this piece in Scala without the rest of the organization needing to learn Scala until they are ready.
Regarding pitfalls, I think that the most common has to do with the mixed object oriented/functional nature of Scala. Programmers will need to spend the time and learn when to model the program using objects and when to model their program using functions and how to mix the two together. Personally, while writing Norbert, I had to go back and revisit some pieces of code multiple times as I became more comfortable with using Scala.
InfoQ: Similarly, for teams that are considering a key-value store, what would be your word of advice? Since Voldemort is fairly new system that might have some rough edges, what things should they consider before the adoption?
Jay Kreps (author of Voldemort): It is true that it is fairly new by the standards of databases, which I think take around 5-10 years to mature. I think the reliability of any piece of infrastructure is really a combination of the software, the engineers programming to it, and the operational expertise running it. Some of the problems with new infrastructure are software problems, but many (I would say most) are a lack of knowledge by the users and operators. The advantage of MySQL isn’t that it doesn’t have any problems, but rather that those problems are very well understood (and avoided) by people who use it.
In this respect Voldemort is not that new at LinkedIn, we have been using it in production since late 2008.
Nonetheless I think that the QA cycle that any change to Oracle goes through is so complete and rigorous that it is essentially impossible for any open source technology to compete on software reliability. The newcomers can only compete on performance, scalability, and price. However, the common solution to scaling is to introduce a custom sharding layer on top of the database (whether MySQL or Oracle or whatever); once you do this, you have enough of your custom software in the mix that you are definitely in the same risk zone. So in comparison to a custom sharding implementation I do think Voldemort is better designed, better tested, and overall more stable.
Alejandro Crosa: Jay answered this one pretty well.
I might add from a user perspective that from all the options out there (which most are fairly new), I think Voldemort is the one that's most robust in terms of stability and feature set, the only problem I've seen is adoption, because on the feature list it stacks really really high. For example the fact that the storage layer is pluggable it means you can still use your MySQL storage and investment and gain all the features Voldemort gives you.