BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Guardian.co.uk Switching from Java to Scala

Guardian.co.uk Switching from Java to Scala

Lire ce contenu en français

The team behind guardian.co.uk which, according to its editor, has the second highest readership of any on-line news site after the New York Times, is gradually switching from Java to Scala, starting with the Content API, which provides a mechanism for selecting and collecting Guardian content.

The guardian.co.uk website comprises about 100,000 lines of code. It uses a fairly typical open-source Java stack of Spring, Apache Velocity and Hibernate with Oracle providing the database. Like the website, the Content API was initially being developed in Java, but the team decided to switch to another JVM-based language, Scala, in its place. Web Platform Development Team Lead Graham Tackley told us

We've been a primarily Java development shop for a number of years now, and this has largely served us well. However, as a news website we want to be able to respond to events very quickly. The core Java platform that delivers www.guardian.co.uk has a full release every two weeks. Compared with many enterprise Java applications, this is excellent. Compared with other websites, it's very poor.

So we've been looking for a while at tools, approaches and languages that enable us to deliver functionality faster. This includes using lighter weight Java frameworks like Google Guice, radically different approaches to Java development like the Play framework, and using other platforms such as Python with Django. As part of this exercise we'd been playing with Scala for a while, but unlike the others we hadn't yet used it for any production code.

We were very keen that the first non-beta release of the Content API (API, Open Platform) should be the first iteration of an ongoing evolving API, which could quickly evolve as we discovered all the interesting use cases that we hadn't initially thought of. To do this safely without breaking API clients, we needed a comprehensive set of integration tests. After some experimentation of writing these in Java, we decided instead to write just the integration tests in Scala, for three main reasons:

  1. The flexibility of the testing DSL provided by ScalaTest.
  2. We wanted to be excited about writing the integration tests, rather than them being a chore.
  3. Using Scala just for the tests meant we got to use it in anger without impacting production code directly.

After about four weeks of writing just the tests in Scala, we got fed up of having to write the main code in Java, and decided to convert the whole lot to Scala.

InfoQ: In general terms, how did you go about the migration? Did you re-write all the Java code in Scala for instance, or did you combine the two for a while?

The beta version of the Content API was based on a proprietary search engine. The current API uses the excellent Apache Solr (a talk on guardian.co.uk's use of Solr can be found here), and is also quite different in style to the beta one - the beta did a great job of showing us what we didn't want the API to look like. Therefore, before Scala came into the picture, we'd decided to re-implement the API rather than reuse the beta codebase.

We'd spent around six weeks with three people implementing in Java before we introduced Scala, so there wasn't a massive codebase to migrate. However, we weren't prepared to stop the project for a couple of weeks while we converted to Scala, so we migrated the existing integration tests gradually. As we'd used Maven as a build tool, introducing Scala was a matter of following the instructions to use the maven-scala-plugin to build mixed Java/Scala projects. This allows Java and Scala code to co-exist in the same project, and bi-directionally depend on each other. So we could convert on a class-by-class basis from Java to Scala, which worked far better than we ever imagined: it really did just work.

We took the same approach when converting the main code: over a number of weeks, as we touched a bit of code, we converted it. We then had a couple of days mop up at the end.

InfoQ: What are the libraries/frameworks that you have used for development?

Since we were using a language new to us all, we decided to limit the amount of new stuff that we needed to learn. We chose to stick with plain servlets wired with Google Guice, which is how we build our Java apps now. We use SolrJ, the Java Solr library, to talk to Apache Solr, Joda-Time for date time manipulation and Mockito for unit test mocking (this worked fine with Scala code too).

Sometimes we consciously chose to stick with what we knew to ensure timely delivery: the XML formatted endpoints are generated not using Scala's excellent XML support, but using javax.xml.stream.XMLStreamWriter just as we would in Java code. We'd already written this before moving to Scala; it worked, it was readable, so we left it. However, we did switch to use the excellent JSON library from Lift - lift-json - to generate the JSON formatted endpoints as the code was far clearer than with the Java JSON library we were using.

InfoQ: What IDEs do you use for development? What is Scala IDE support like?

We use Jetbrains IntelliJ IDEA 10, some of us use the community edition and some use the ultimate edition. The Scala plugin is pretty good but not perfect. Code completion, find usages, and similar navigation nearly always works just fine. It's not as good as Java at red highlighting code that isn't valid, and we had some problems with it finding ScalaTest test methods, but other than that we were in our familiar environment working as we always had, just in a much more powerful language.

InfoQ: I'm assuming the majority of the developers on the project were Java programmers? How easy did the developers on the project find learning Scala?

Yes, all of us were quite experienced Java programmers. The initial team of four had huge fun learning Scala: often one of us would come in raving about this new Scala feature we'd discovered and sharing it with the rest of the team. We had a buzz that had long been missing in our Java development. Because we were all learning together, this worked really well. In the first couple of weeks, though, there were occasions when we'd be battling to implement something in a good Scala way and couldn't figure it out. Knowing you could just churn out the Java code made this particularly frustrating. There were a few days where we went home in frustration saying, "We're going back to Java tomorrow". Each time, a fresh look in the morning was all it needed to move on.

Since then, we've had around ten other Java devs move to pick up Scala. As always, people learn at different speeds and in different ways, but all have come through that and nearly all now get frustrated when they have to write Java code.

One of the things we compare learning Scala against is moving to a different platform like Python/Django or Ruby on Rails. With Scala, at least 75% of what you're working with is the same as in Java. You can use the same libraries and IDE, the way you package jars and wars is the same, your runtime environment and runtime characteristics are the same. A good Java developer can learn to write Java-style code in Scala in a day, then they learn the power of closures and implicit conversions and very soon they're more productive than they were in Java.

InfoQ: One of the common criticisms of Scala as a language boils down to it being too complex. A lot of the time I think this is really about readability - the idea being that it is easier to pick up someone else's code if it is written in a more rigid language like Java. Do you think the criticism is fair? How do you counter it?

I agree, readability is by far the most important characteristic of a codebase. I don't care whether code is imperative or functional, or is idiomatic Scala or Java-without-semicolons, I only care whether it's readable. When we were learning new Scala features, we chose whether to use them based on whether the intent of the resulting code was more obvious. In one example, we tried using the Scala Either class to eliminate a few If statements: the team collectively concluded that the If statements were more readable, so we dropped the use of Either in that case.

It's true that due to the rigidity of Java individual lines of code are always easily understood. But that's rarely the problem in understanding any non-trivial codebase: I don't want to understand the detail, I want to understand the intent. Good class design and OO techniques help address this in Java, but I still often find when reading Java code that I cannot see the wood for the trees. In Scala I have the power to express the intent in a way I rarely can in Java.

For example, the Content API needs to decide whether to return results in XML, JSON or redirect to the HTML explorer. We support a format=query string, adding a .xml or .json extension, and specification in an http Accept header. Here's the code that does that, which I think is a good example of how Scala's power aids expression of intent (it's just chaining calls to Scala's Option class):

def negotiateFormatParameter =getParam("format").
orElse(getExtension).
orElse(getExtensionFromAcceptHeader).
getOrElse("html")

There's also a good case that readability is at least partially a function of how much code you have to read. My Java code tends to end up with lots of lines of code unrelated to the problem I am trying to solve, whether this be null checks, getters & setters, constructors for dependency injection or manipulating collections. All of these problems have much more concise expressions in Scala. Of course, much of this can be autogenerated by your IDE, but when reading your codebase I still have to read your constructor and your getters and setters to see if you've customised them.

A classic example is to compare a simple class in Java and Scala:

Java:

public class WelcomeClass {
    private String name;
    
    public WelcomeClass(String name) {
        this.name = name;
    }
    
    public String sayHello() {
        return "Hello " + name;
    }
}

and in Scala:

class WelcomeClass(name: String) {
    def sayHello = "Hello " + name
}

The Java version has to tell me that name is a String three times and it mentions "name" five times. The Scala version mentions "name" only twice, and only says it's a string once. Of course this is a trivial example, but it's symptomatic of what we've found in Scala: less boilerplate code and less needless repetition means fewer trees and more wood, i.e. it's easier to see the intent, not just the detail.

I tend to find that when reading an individual line of code in Scala it sometimes takes a little longer to understand how it's working, but this is more than made up for by the drastic reduction in the number of lines of code.

InfoQ: Sticking with complexity for a moment, do certain aspects of the language - I'm thinking here mainly about things like symbolic names and implicits - cause problems in real-world usage?

Actually implicit conversions really helped us. As I mentioned, we use the SolrJ Java library to talk to Solr, which is an excellent library but, as a Java library, it loves returning nulls. To avoid null checks littering our codebase, we implicitly convert key classes to ones which have more Scala-like methods. So far from causing problems in real-world usage, it actively solves them. In addition, the IntelliJ Scala plugin now understands implicits in nearly all cases, so if you're not sure what's happening control+click will take you to what's actually being called.

We've tended to steer clear of heavily symbolic libraries and using symbols for method names, but I think this is an important feature of the language, which like any feature is possible to over-use. Sometimes it makes sense: our method to extract query strings from the http request is called "?", which reads really well in the code. Much more than in Java, the great power of Scala brings great responsibility to focus on whether you're making the intent of your code easier to read. Just because that power can be misused doesn't mean I don't want the power.

InfoQ: Another concern for using Scala in enterprise applications is that each new version seems to break backwards compatibility - so programs compiled against Scala 2.8 are not compatible with binaries compiled earlier and so on. How significant do you think this is? How do you manage incompatibilities within Scala projects?

We started off writing to Scala 2.7.7, and migrated to both 2.8.0 and 2.8.1 soon after they were released. It was pretty painless; the 2.8.0 migration took less than a day (and that was only because I wanted to eliminate the deprecation warnings) and 2.8.1 was drop in. All of the libraries we were using already published versions for multiple Scala versions so that all just worked.

The only time it was painful was when, on my personal projects, I was using 2.8 pre-releases. But then, it was my choice to use clearly labeled pre-release code.

We tend now to use simple-build-tool for Scala projects instead of Maven, which makes it easy to release internal libraries for multiple Scala versions.

I'd much rather face some incompatibility when I choose to upgrade than face the situation in Java where some things are never changed and can never be changed. The commonly used HttpServletRequest.getHeaders still returns a java.util.Enumeration, which was effectively deprecated in Java 1.2.

InfoQ: What is the situation with regards finding programmers to continue supporting the Scala code - are good Scala programmers as easy to find as good Java programmers for instance?

We primarily recruit good software developers, rather than looking specifically for Scala developers. We tend to find that good web software developers are polyglot in outlook and have often at least played with Groovy, Scala, Clojure, Ruby or Python. Such people have usually relished the opportunity to work with Scala.

InfoQ: About how much Scala production code is guardian.co.uk running today?

The core codebase behind guardian.co.uk - the one on a two week release cycle - as of two weeks ago has one single Scala class within it. We've maintained a self-imposed no-Scala rule on that codebase until recently, just to be sure we're fully ready as a team to embrace Scala (and to stop me rewriting it).

However, many different parts of the site are now driven by Scala microapps including Search (which is written in Scala with Lift), Most Viewed, Punctuated Equilibrium Mystery Bird and the related content component on every article page.

Furthermore, our new identity platform, under development but with its first iteration in production, is written in Scala.

InfoQ: Will you be using more Scala code in the future?

We've found that Scala has enabled us to deliver things faster with less code. It's reinvigorated the team. We'll continue to use the right tool for the job whether that be Scala, Python, .NET, PHP or Bash.

In the last six months, all of the new JVM-based projects have used Scala and none have selected Java. I can't see us starting any new project in Java now, especially given the disappointing feature set and timeframe of Java 7.

For developers interesting in learning the language Tackley recommended "Programming in Scala" by Martin Odersky et al. The 2nd edition covers Scala 2.8. He also told us

We found using the Scala REPL (command line) was a good way to experiment with writing Scala code. And, despite what others may say, don't fear just using Scala as a better Java-without-semicolons in the first few days, weeks or months. You'll be missing out if you end your journey here, but it's not a bad phase to go through. Keep learning and embrace features of the language incrementally. The ability to do this is what I think makes Scala unique as a next step for Java developers.

Besides the technical aspects, guardian.co.uk's Content API, and the broader Open Platform suite of services of which it forms a part, is interesting from a business point of view, since it represents a very different approach from that favoured by a growing number of quality newspapers in the UK and elsewhere, ie to place their content behind paywalls. To date The Financial Times, and News International's Times and The Sunday Times have all decided to go down this path, and more recently The New York Times has started to roll-out a paid content model. Veteran BBC journalist John Humphrys argued in The Sun, a tabloid News International paper, that "Good journalism has to be paid for, just as we have to pay for the plumber who fixes a leak, or it will not survive". Tackley however, takes a different view

We firmly believe that the future of digital publishing is to engage and integrate with the rest of the web, not to retreat from it.

The Content API is able to extend the reach and brand of the Guardian into significant areas that we would otherwise not be able to reach. This is helped by third parties and partners who use the API, as they can invest in an area and use the relevant Guardian content.

We have a number of different tiers of access to the Content API: without registration, you can access content metadata but not the actual content, with a limited QPS (queries per second) rate. After registration, you can access the article bodies, which include embedded adverts, again with a limited QPS. At the top tier, you become a partner of the Guardian and we agree an appropriate commercial agreement.

A good example partner is WhatCouldICook.com, built by a small independent developer. This site is a good example of tight integration with Guardian content: it uses the API to extract and parse all the recipes we publish and present them in a great way for people wanting to cook. Furthermore, our readers benefit because we include functionality from whatcouldicook.com on our site, see for example the recipe search on the right hand side of the Guardian website.

Our wordpress plugin, built on top of the API, enables any wordpress user to include related Guardian content on their blog.

This level of innovation in specific areas is something that simply would not happen without the Content API. Furthermore, we extensively use the Content API for Guardian-driven projects to aid our rate of innovation: site features like search and zeitgeist, our mobile site and our iPhone app are all driven by the Content API.

 

Rate this Article

Adoption
Style

BT