Bio Craig Motlin is the technical lead for GS Collections, a full-featured open-source Collections library for Java, and is the author of the framework’s parallel, lazy API. He has worked at Goldman Sachs for 9 years on several teams focusing on application development before moving to the JVM Architecture team to focus on framework development.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
1. Hi, I’m here with Craig Motlin at QCon New York 2014. Craig is the Technical Lead of GS Collections and a Software Engineer at Goldman Sachs. I wonder if you can start off Craig, by telling us a little bit about where GS Collections came from and how long you’ve been involved with it?
Sure, thanks Alex. GS Collections started many years ago, I think it was in 2006; it started as an internal project that just had a few optimized collections, as most frameworks start out to solve a specific smaller problem. The problem was back then Java heaps were a lot smaller, we didn’t have 64 bits yet and we were wasting a lot of memory on very small lists. If you have an ArrayList as a field in another object and you have a lot of those, then the default sizes (ten's a lot) and you can waste a lot of space if they are all just one or two elements. So we started with optimized memory efficient collections and over the years has grown since there. I’ve been involved with the project since about 2008 and we open sourced it in 2012.
Sure, well we have a number of frameworks that are internal. This one started out as an internal framework obviously for a number of years and with this one we realized that besides all of the regular benefits of open sourcing anything that people open source, that we had the opportunity to potentially affect future versions of Java, specifically Java 8, and we have several Tech Fellows who participate in Java Expert Groups on the Java Executive Committee and open sourcing GS Collections was a more direct way to participate by proving our credibility and our expertise in the area, but also by giving a concrete implementation of the sort of thing that we would like to see when lambda comes out in the Collections Library.
I believe it did have an indirect influence. One project that comes to mind is actually JodaTime and the path that it went through to eventually become part of Java. I would like to see the same thing happen for GS Collections in the future, but right now we have the Streams API and in an indirect way, it took some small inspiration from GS Collections as well as other collections libraries. And in addition one specific thing that went into the Streams API is the decision to separate some of the API on to a separate wrapper interface like Stream as opposed to putting all the API on Collection itself, and I think that was partially due to our involvement. One nice thing about separating it, is that the Streams have a totally read-only API. The mutating methods, including the new ones like removeIf() are all directly on Collection, and one observation was that too much new API going on to Collection could break other libraries out there, so you have Hibernate who have their own lists, and if we had our own lists, if you add to many new methods right on the Collection you risk breaking third party libraries.
So if I could draw a venn diagram of features, there are many collections frameworks out there and they overlap a lot, so a lot of the features that are there in Streams we have and vice versa. I think that one of the reasons the landscape is so fragmented, is because the collections that have been in Java up to version 7 are very bare-bones collections and so there are many frameworks out there that build on top of them. So we saw internally a lot of people using libraries like Trove in order to get memory-efficient primitive collections and we have that as well in GS Collections now, and a lot of the functional API that you find in patterns like filter() and map(), we have although to use the Smalltalk meta-names like select() and collect(). One of the biggest differences between our philosophy and Guava’s, is that Guava is deliberately a supplement to what’s built-in, there are no such things as MultiMaps or Bags (called MultiSets as well) in the Java Collections Framework, so Guava adds them; but they don’t attempt to re-implement anything that’s already built-in. GS Collections is a supplement as well but it can be a complete replacement; we have our own Lists, Sets and Maps, and when we implement higher level containers like MultiMaps we are basing them on our own and we’ve done that for full control and also for the ability to optimize every collection, so our Sets and Maps are optimized for memory and therefore the MultiMaps that we have and the Bags, they are also are going to benefit from that.
On the mutable side of our collections hierarchy, our MutableList, our MutableSet, and our MutableMap, inherit or implement I should say, the java.util interfaces that you'd expect. The immutable collections you have an extra step if you want to interoperate because on our immutable collections we deliberately do not want methods like add() and remove(), so it’s not just runtime immutability we want it contractually as well, so the best interop[ability] is on the mutable side.
Partially for the mental model, the readability benefit, you know contractually that this thing is really not going to mutate, all of the benefits of thread-safety for immutable collections as well as the memory savings. So an ImmutableList if it’s Array-backed it’s always going to be trimmed; if it’s small enough we are going to give you a specific container for the exact size, like a List of size three, it has three fields, no backing array.
It varies based on the types, so List I think we go up to eleven if you count the zero indexed one, for Sets and Maps I think we have three or four. There comes a point where there is a drop-off in benefit and there's just too many classes to write or generate.
The small memory efficient ones, we actually did write by hand. When we get to the area or primitive collections though it would be too much to write by hand, they are very repetitive and that it’s where we leverage code generation. So we have things like an IntList, a LongList, we do it for all of the primitives and for most of them we code generate them. There's some special cases like Boolean, you don’t necessarily want to code generate that.
That’s right, and so in your project you just import a JAR, it’s pretty simple.
I believe there is because of the fact that we have so many features. Streams are great, they bring in all these functional concepts and allow you to use lambdas, so people are going to start becoming much more familiar with patterns like filter() and map(). But Java 8 Streams are just wrappers on the mutable collections that are already there, so by adopting a framework like GS Collections you get the extra types like MultiMaps and Bags and BiMaps, and you get the Immutable containers. We also have more patterns I think than you’ll find. Everything in our hierarchy starts at the top with RichIterable and we have a lot of iteration patterns, I think it’s almost a hundred at this point methods on there in various flavors.
Sure, well we have the standard ones like select() and collect() which are like filter() and map(), and we'll have multiple arguments forms of them, so we'll have selectWith(), it will take a two-argument block instead of one-argument predicate. Besides that we build up and we have more and more complicated and special purpose algorithms, so we have groupBy() for example which will give you MultiMap and then one of the most complicated or advanced is aggregateBy() which will do something like grouping into a MultiMap but at the same time it will aggregate the values, and you can achieve a lot of the same thing with the Streams API but you wind up having to create complicated advanced collectors by chaining together several collectors. It’s not impossible but I like the approach of having more methods on the API because you can just – in your IDE do dot, do your keyboard shortcut for the auto-complete and see all your choices right there.
Right, so we have a Parallel-Lazy API which I presented on at QCon this week, it’s beta still, we are still filling it out and adding it to all our collections. It competes, I think favorably with the collections in Scala, the parallel collections in Scala as well as the Parallel Streams in Java 8. They all have a singular inspiration which is parallel computation is a little bit hard if you are using Thread pools and locks. A lot of your time when you are doing parallel computation, data-level parallelism, you are starting with a collection anyway, which you are going to batch up and do an iteration pattern on, so you might as well provide this API right on the collections. So all three libraries provide that and my talk that I gave this week was really diving into the performance of all three, so it wasn’t a talk on learning these APIs, it was more how well do they perform and what’s the difference in the implementations of like the Fork-Join approach versus batching and things of that nature.
We use JMH, Java Microbenchmark Harness Framework which was written at Oracle and it’s a very nice framework for doing performance tests. For many of our tests GS Collections performed the best. You have to take that with a grain of salt because they are my tests, and there is a natural inclination to have a bias in the test, but I really wrote these not knowing exactly what I would find, and for a lot of the tests GS Collections scaled very well, for some of them it was linearly with the number of cores. I’d like to see how these tests run on machines with even more cores so I hope people will download our tests and run them on their own machines. There were some interesting tests where we don’t perform so well, so in the aggregation tests it matters a lot how many unique keys you wind up in the Map at the end, and if it’s a very small number of keys, a very high collapse factor, then the contention in our ConcurrentMap winds up dominating and we don’t perform so well. But on the other hand if there are a lot of unique keys then Java 8 Streams don’t perform so well because they have to do a Fork-Join merge step at the end and the merging of the Maps becomes very expensive.
The whole framework along with the tests is open sourced on GitHub, so if you go to https://github.com/GoldmanSachs/ you’ll find the library; other modules along with the library are unit tests and performance tests and memory tests; and then we have a second project open sourced which is the GS Collections Kata which is the training materials that we use internally for learning GS Collections. It is organized as a series of unit tests that fail when you check them out and there's some materials to read and as you go through it you can change the test to pass, so it’s a kind of fun way to learn GS Collections.
Sure, right now the most recent version is 5.1 and it’s in Maven Central along with the last few versions, and recently there were some interesting articles released by Spring about the performance of GS Collections and the fact that they’ve chosen to adopt it in the Spring reactor project. So the Spring reactor is an implementation of the LMAX Disruptor and there were few places where they found nice use-cases for some very specific collections in our framework that are for higher performance; very specific things like our SynchronizedPutMultiMap. So it’s interesting that they are one of the first popular users and we’ve see our usage in Maven increase a lot, now that we are a transitive dependency of a Spring library.
16. So you said that there is the Goldman Sachs Kata on GitHub on the actual code itself, but what’s happens if people wants to ask more questions about it or find out more information about Goldman Sachs Collections?
If you want to find out more info we have a lot of materials on our GitHub Wiki including a lot of presentations we’ve given and feel free to ask questions on StackOverflow, we watch for the questions tagged with gs-collections and we answer them.
Alex: Craig Motlin, thank you very much.
*DISCLAIMER: Alex Blewitt works at Goldman Sachs along with Craig Motlin.*