Clojure and Rails - the Secret Sauce Behind FlightCaster
Clojure is a LISP for the JVM created by Rich Hickey. Over the past year it has gained a lot of attention, mostly due to its concurrency features such as support for Software Transactional Memory (STM) and other powerful data structures. The recent rise of interest in functional languages also didn't hurt. A few months after the release of Clojure 1.0, real world projects implemented in Clojure are now appearing.
FlightCaster is a new site for flight delay prediction. Its web frontend is built using Rails deployed on Heroku. The backend for processing data is written using Clojure, using Hadoop and Cascading, Cloudera and other tools.
We talked to Bradford Cross of the project about the architecture powering FlightCaster, how Clojure was used to implement it, as well as tips for budding Clojure and Lisp programmers coming from OOP languages.
InfoQ: Could you explain what FlightCaster does, ie. the analysis part?
Flightcaster predicts flight delays in real time. The analytical work involves applying statistical inference and machine learning techniques. The precise techniques in question concerning matters of flight delay prediction do not exist, and even if they did exist, I certainly wouldn't know anything about said matters. :-)
Cloudera is a wonderful provider of services, and Hadoop distributions, deployment scripts, and AMIs (Amazon Machine Instances) for folks doing large scale distributed processing with Hadoop. We use Cloudera to deploy Hadoop clusters on EC2 of between 10 and 100 nodes for our data processing and analytical work. We have found a tremendous amount of value in using the Cloudera distributions to alleviate some of the complexities of deploying Hadoop on EC2. There are only two of us that have been working on the research side of things, so leaning on Cloudera has been a great help.
InfoQ: Which parts are actually written in Clojure?
Another critical piece of infrastructure is Cascading; an excellent layer on top of Hadoop that adds additional abstraction and functionality. We definitely recommend Cascading to anyone doing serious data processing and mining with Hadoop.
All our Clojure runs on top of Cascading in production.
Two main parts of the system are written in Clojure.
One is all the preprocessing and transforming of the data into the proper view for analysis. This involves filtering, as well as multi-stage distributed joins and such. Getting the right view into unstructured data from heterogeneous sources can be quite tricky. For instance, we have to build time series views into the data becasue a lot of our analysis takes temporal factors into account. Anyone who has really built these sorts of systems knows how much work the data processing can be - and Clojure + Cascading are a massive help for that.
The other part that is built in Clojure is all the statistical inference and machine learning code. We would go into more detail explaining this part of the system, if such a part of the system were to exist. Supposing that such a part of the system were to exist, it certainly seems like it would benefit greatly from Clojures' lovely functional abstractions, macro system, rich immutable data structures and sequence processing libraries, destructuring, and monadic abstractions for composing complex multi-phase computations that might have intermediate failures.
InfoQ: Do you make use of Clojure's concurrency and STM features?
We don't use Clojures built in concurrency features despite their coolness. Rather, we take advantage of another of Clojures' attributes, the pragmatic choice to build it on the JVM. We just delegate parallelism and distributed computation to Cascading + Hadoop, upon which we have build a very pleasant little layer that we might open source if we get the time.
InfoQ: Could you give a short overview of how your Clojure code is structured? Eg. how you use namespaces, multi-methods, macros, etc.
Having previous experience with functional programming has (unsurprisingly) lead us to structure the code in a very functional style. We use namespaces as one would use them in any other language. We tend to use very few macros and multi-methods, although where we do use them they are just the right abstraction for the job. People get the impression that Lisp is all about lots of macros and meta-sauce. While that is true in some ways, you can and should go very far with the basic building blocks of FP; lambdas, HOFs, currying, partial application, and so on.
I am not a monad master by any means, but I feel similarly about mondads as I do about multi-methods and macros. These abstractions are great, powerful, cool, and they get a lot of attention. I try to use these tools when they seem like the natural abstraction to make things easier, but a lot of the times all it takes is a little rigor in your thinking and you can do fine with plain old functional programming and data structures.
A last word on monads in Clojure. Some people seem to have the impression that monads are for state, and Clojure already has lots of effective ways of dealing with state, so why use monads. We haven't used the state monad, but there are lots of other useful monads. I like the way Brian Beckman describes monads; they are just function composition in disguise. Out biggest win has come from using the maybe concept for safe-composing multi-stage computations that can fail or bump into a nil along the way.
Destructuring bind, which seems to garner less attention than abstractions like macros and monads, is really a massively powerful abstraction in practice. The way that Rich elected to de-couple destructuring bind from pattern matching was brilliant. I am sure that sometime soon we will have ML style pattern matching in Clojure, people have already been working on implementations.
InfoQ: You mentioned Clojure readers for importing data formats toClojure data structures; do you use Reader macros?
Clojure doesn't have reader macros and we haven't had a pressing need for them as of yet. In order to read and write Clojure data structures we just use the idiomatic print-dup, which allows you to define multimethods to dispatch your printers. Printer implementations are only really necessary where you need to read and write a type that has no built in literal representation. For instance, we have a special printer for joda-time dates and we also read them back in a special way.
InfoQ: You mention writing out Clojure data structures - do you use that to serialize data for transport, for storage, or other uses?
We use Clojure data structure literals as both intermediate representation for communication, and for storage. For example, the output of all our data transformation jobs are Clojure data structure literals, and the intermediate representation in all our Hadoop jobs are as well. This is a critical part of our layer of goodness on top of Cascading and Haooop that lets us weasel out of dealing with Hadoop input formats.
InfoQ: Is there something you'd like to get added to Clojure or the Clojure ecosystem (libraries, tools, ...)?
Someone will come up with a high quality destructuring pattern matching facility and that will be quite useful. I'm a big fan of that style of "guard, guard, base-case" composition for programming in the small. We still do it without pattern matching, and Clojure has a lot of nice little abstractions that make it so you don't miss pattern matching as much, but it would still result in cleaner code in a lot of places in my small opinion.
I suppose it would be cool if Rich opened up the reader, it might help out for creating nice syntax for things like the pattern matching and monad implementations. Then again, Clojure is my first Lisp experience so I have no experience with reader macros - so I am just speculating and I don't really know what I am talking about on this topic.
It would also be nice if Rich doesn't reserve the vertical bar for anything so we can keep it as part of our core DSL for conditional probability notation. :-)
InfoQ: Do you have any tips or recommendations for Clojure libs that you use(d)?
The number 1 tip for working with Clojure libs is that Clojure-core and Clojure-contrib are small, so go read all the code. You will find cool stuff in there. Keep an eye out for all the amazing data structures and data structure processing functions. For example, Clojure has a delightful implementation of sets and operations on sets, as well as some quasi-relational algebra thrown in for good measure.
One parting thought is that I have come to think of Clojure as sort of a data structure oriented functional programming.
Clojure has an amazing core set of data structures. Moreover, all these data structures have literal representations so they work naturally with the reader and destructuring. The combination of all this is downright pleasurable.
In the ML-descended functional languages, functions have type signatures. In Clojure, functions have data structure topology signatures.
InfoQ: FlightCaster's web frontend is written using Rails, and deployed on Heroku. What where the reasons for choosing Rails and Heroku?
When I came on board to build the intelligence, Heroku and Rails were already the direction for production. It is a direction that makes sense; the Ruby and Rails ecosystems are productive and well trodden paths for building webapps. Heroku and Rails were a natural choice for the team since two of the other founders had Rails experience, and one of the Herkou founders is close friends with our CEO. :-) Flightcaster shares office space with Heroku and they have been great! Having an inside connection surely doesn't hurt.
InfoQ: How do you integrate your Web UI with the Clojure backend?
Rails is not just the web frontend but also the webserver. We use Clojure for the data processing and machine learning research. We integrate using a very simple strategy: we have some Clojure code that produces a json intermediate representation of our predictive model, we then push that to the Ruby side of the world and read in the json.
Stuart Halloway, the author of the first book on Clojure, recently published an article showing different techniques of using Clojure. The article provides several examples of Encapsulation, Polymorphism, etc in Clojure - which should be interesting for developers coming from an OOP background (turns out, there's a life outside classes and inheritance).
For more general information on Clojure see, InfoQ's interview with Rich Hickey which touches on topics such as STM, concurrency and also multimethods. A talk by Rich's talk on Clojure offers a more detailed look at Clojure features and its design principles.
Meeting the Challenges of Unstructured DataBasho
Chris Richardson Oct 09, 2015