Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Craig Motlin Talks to InfoQ about the Origins and Benefits of GS Collections

Craig Motlin Talks to InfoQ about the Origins and Benefits of GS Collections


1. Hi, I’m here with Craig Motlin at QCon New York 2014. Craig is the Technical Lead of GS Collections and a Software Engineer at Goldman Sachs. I wonder if you can start off Craig, by telling us a little bit about where GS Collections came from and how long you’ve been involved with it?

Sure, thanks Alex. GS Collections started many years ago, I think it was in 2006; it started as an internal project that just had a few optimized collections, as most frameworks start out to solve a specific smaller problem. The problem was back then Java heaps were a lot smaller, we didn’t have 64 bits yet and we were wasting a lot of memory on very small lists. If you have an ArrayList as a field in another object and you have a lot of those, then the default sizes (ten's a lot) and you can waste a lot of space if they are all just one or two elements. So we started with optimized memory efficient collections and over the years has grown since there. I’ve been involved with the project since about 2008 and we open sourced it in 2012.


2. That’s an interesting for an investment bank to open source some of its software, can you give us a bit of background as to how that happened?

Sure, well we have a number of frameworks that are internal. This one started out as an internal framework obviously for a number of years and with this one we realized that besides all of the regular benefits of open sourcing anything that people open source, that we had the opportunity to potentially affect future versions of Java, specifically Java 8, and we have several Tech Fellows who participate in Java Expert Groups on the Java Executive Committee and open sourcing GS Collections was a more direct way to participate by proving our credibility and our expertise in the area, but also by giving a concrete implementation of the sort of thing that we would like to see when lambda comes out in the Collections Library.


3. So did GS Collections have any influence on Java 8 Collections?

I believe it did have an indirect influence. One project that comes to mind is actually JodaTime and the path that it went through to eventually become part of Java. I would like to see the same thing happen for GS Collections in the future, but right now we have the Streams API and in an indirect way, it took some small inspiration from GS Collections as well as other collections libraries. And in addition one specific thing that went into the Streams API is the decision to separate some of the API on to a separate wrapper interface like Stream as opposed to putting all the API on Collection itself, and I think that was partially due to our involvement. One nice thing about separating it, is that the Streams have a totally read-only API. The mutating methods, including the new ones like removeIf() are all directly on Collection, and one observation was that too much new API going on to Collection could break other libraries out there, so you have Hibernate who have their own lists, and if we had our own lists, if you add to many new methods right on the Collection you risk breaking third party libraries.


4. So how does GS Collections compare both to the collections Libraries in Java 8, but also to other existing open source ones like Guava?

So if I could draw a venn diagram of features, there are many collections frameworks out there and they overlap a lot, so a lot of the features that are there in Streams we have and vice versa. I think that one of the reasons the landscape is so fragmented, is because the collections that have been in Java up to version 7 are very bare-bones collections and so there are many frameworks out there that build on top of them. So we saw internally a lot of people using libraries like Trove in order to get memory-efficient primitive collections and we have that as well in GS Collections now, and a lot of the functional API that you find in patterns like filter() and map(), we have although to use the Smalltalk meta-names like select() and collect(). One of the biggest differences between our philosophy and Guava’s, is that Guava is deliberately a supplement to what’s built-in, there are no such things as MultiMaps or Bags (called MultiSets as well) in the Java Collections Framework, so Guava adds them; but they don’t attempt to re-implement anything that’s already built-in. GS Collections is a supplement as well but it can be a complete replacement; we have our own Lists, Sets and Maps, and when we implement higher level containers like MultiMaps we are basing them on our own and we’ve done that for full control and also for the ability to optimize every collection, so our Sets and Maps are optimized for memory and therefore the MultiMaps that we have and the Bags, they are also are going to benefit from that.


5. So how does GS Collections interoperate with the standard Java Libraries?

On the mutable side of our collections hierarchy, our MutableList, our MutableSet, and our MutableMap, inherit or implement I should say, the java.util interfaces that you'd expect. The immutable collections you have an extra step if you want to interoperate because on our immutable collections we deliberately do not want methods like add() and remove(), so it’s not just runtime immutability we want it contractually as well, so the best interop[ability] is on the mutable side.


6. So why do you have Immutable and mutable collections then?

Partially for the mental model, the readability benefit, you know contractually that this thing is really not going to mutate, all of the benefits of thread-safety for immutable collections as well as the memory savings. So an ImmutableList if it’s Array-backed it’s always going to be trimmed; if it’s small enough we are going to give you a specific container for the exact size, like a List of size three, it has three fields, no backing array.


7. So how many different types of specialized Lists and Maps are there for small values?

It varies based on the types, so List I think we go up to eleven if you count the zero indexed one, for Sets and Maps I think we have three or four. There comes a point where there is a drop-off in benefit and there's just too many classes to write or generate.


8. So are all of these hand written classes or do you use code generation to be able to create these?

The small memory efficient ones, we actually did write by hand. When we get to the area or primitive collections though it would be too much to write by hand, they are very repetitive and that it’s where we leverage code generation. So we have things like an IntList, a LongList, we do it for all of the primitives and for most of them we code generate them. There's some special cases like Boolean, you don’t necessarily want to code generate that.


9. And this code generation happens in compile time rather at runtime?

That’s right, and so in your project you just import a JAR, it’s pretty simple.


10. So now that we’ve got Streams with Java 8, do you think that there is still benefit in using GS Collections?

I believe there is because of the fact that we have so many features. Streams are great, they bring in all these functional concepts and allow you to use lambdas, so people are going to start becoming much more familiar with patterns like filter() and map(). But Java 8 Streams are just wrappers on the mutable collections that are already there, so by adopting a framework like GS Collections you get the extra types like MultiMaps and Bags and BiMaps, and you get the Immutable containers. We also have more patterns I think than you’ll find. Everything in our hierarchy starts at the top with RichIterable and we have a lot of iteration patterns, I think it’s almost a hundred at this point methods on there in various flavors.


11. So give us an example of how you could iterate through a List in different ways?

Sure, well we have the standard ones like select() and collect() which are like filter() and map(), and we'll have multiple arguments forms of them, so we'll have selectWith(), it will take a two-argument block instead of one-argument predicate. Besides that we build up and we have more and more complicated and special purpose algorithms, so we have groupBy() for example which will give you MultiMap and then one of the most complicated or advanced is aggregateBy() which will do something like grouping into a MultiMap but at the same time it will aggregate the values, and you can achieve a lot of the same thing with the Streams API but you wind up having to create complicated advanced collectors by chaining together several collectors. It’s not impossible but I like the approach of having more methods on the API because you can just – in your IDE do dot, do your keyboard shortcut for the auto-complete and see all your choices right there.


12. It sounds like this might be quite a lot of expensive computation if it’s single threaded?

Right, so we have a Parallel-Lazy API which I presented on at QCon this week, it’s beta still, we are still filling it out and adding it to all our collections. It competes, I think favorably with the collections in Scala, the parallel collections in Scala as well as the Parallel Streams in Java 8. They all have a singular inspiration which is parallel computation is a little bit hard if you are using Thread pools and locks. A lot of your time when you are doing parallel computation, data-level parallelism, you are starting with a collection anyway, which you are going to batch up and do an iteration pattern on, so you might as well provide this API right on the collections. So all three libraries provide that and my talk that I gave this week was really diving into the performance of all three, so it wasn’t a talk on learning these APIs, it was more how well do they perform and what’s the difference in the implementations of like the Fork-Join approach versus batching and things of that nature.


13. So how did you do the performance testing and where does GS Collections fit in with the results?

We use JMH, Java Microbenchmark Harness Framework which was written at Oracle and it’s a very nice framework for doing performance tests. For many of our tests GS Collections performed the best. You have to take that with a grain of salt because they are my tests, and there is a natural inclination to have a bias in the test, but I really wrote these not knowing exactly what I would find, and for a lot of the tests GS Collections scaled very well, for some of them it was linearly with the number of cores. I’d like to see how these tests run on machines with even more cores so I hope people will download our tests and run them on their own machines. There were some interesting tests where we don’t perform so well, so in the aggregation tests it matters a lot how many unique keys you wind up in the Map at the end, and if it’s a very small number of keys, a very high collapse factor, then the contention in our ConcurrentMap winds up dominating and we don’t perform so well. But on the other hand if there are a lot of unique keys then Java 8 Streams don’t perform so well because they have to do a Fork-Join merge step at the end and the merging of the Maps becomes very expensive.


14. And where can people get the test code from if they want to re-do this themselves?

The whole framework along with the tests is open sourced on GitHub, so if you go to you’ll find the library; other modules along with the library are unit tests and performance tests and memory tests; and then we have a second project open sourced which is the GS Collections Kata which is the training materials that we use internally for learning GS Collections. It is organized as a series of unit tests that fail when you check them out and there's some materials to read and as you go through it you can change the test to pass, so it’s a kind of fun way to learn GS Collections.


15. So if people want to get hold of the binaries to be able to use them, where can they get them from and is anyone else using them outside of Goldman Sachs at the moment?

Sure, right now the most recent version is 5.1 and it’s in Maven Central along with the last few versions, and recently there were some interesting articles released by Spring about the performance of GS Collections and the fact that they’ve chosen to adopt it in the Spring reactor project. So the Spring reactor is an implementation of the LMAX Disruptor and there were few places where they found nice use-cases for some very specific collections in our framework that are for higher performance; very specific things like our SynchronizedPutMultiMap. So it’s interesting that they are one of the first popular users and we’ve see our usage in Maven increase a lot, now that we are a transitive dependency of a Spring library.


16. So you said that there is the Goldman Sachs Kata on GitHub on the actual code itself, but what’s happens if people wants to ask more questions about it or find out more information about Goldman Sachs Collections?

If you want to find out more info we have a lot of materials on our GitHub Wiki including a lot of presentations we’ve given and feel free to ask questions on StackOverflow, we watch for the questions tagged with gs-collections and we answer them.

Alex: Craig Motlin, thank you very much.

Thanks Alex.

*DISCLAIMER: Alex Blewitt works at Goldman Sachs along with Craig Motlin.*

Sep 01, 2014