BT

Trisha Gee on MongoDB, Java 8, and What Excites Her About Writing Software
Recorded at:

Interview with Trisha Gee by Charles Humble on Apr 05, 2014 | NOTICE: The next QCon is in London Mar 2-6, Join us!
29:34

Bio Trisha is a developer at MongoDB. She has expertise in Java high performance systems, is passionate about enabling developer productivity, and has a wide breadth of industry experience from the 12 years she's been a professional developer.

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

   

1. I am Charles Humble. I am here at QCon London with Trisha Gee from Mongo. Trisha, can you introduce yourself to the InfoQ community?

Hello, I am Trisha Gee from Mongo. I work on the Java driver for MongoDB and I have been involved in Java development for the last 15 years or so, but doing the sort of 'evangelizing Java' stuff for the last couple of years; so probably since we last spoke actually, two years ago.

   

2. Yes; so when we last spoke, you were in London and you were very heavily involved with the London Java Community and I know now you have moved to Spain. So what has that been like? Is there a big community there?

Well, it is much more embryonic. In London obviously there is loads and loads of developers and when I moved back to London about five years ago, the London Java Community was just starting to ramp up and we were very early and young in those days and we have seen the power of what that can do over time: having influence on Date Time for Java 8 and speaking at conferences and writing books and doing these things; and I was really keen to take what we learned, not just me but the other person I moved over there with, we were both really keen to take what we have learned in the London Java Community and start that in Spain, in particular, because Spain has a slight problem with its unemployment rates at the moment and the economy is in a place where it could be better; and what we really wanted to do is to kind of inject some energy into the tech scene there because it seems like a great place to start in order to get jobs running and to try to get this spirit of entrepreneurship.

So having a community where people can talk to each other and find out what they are doing and learn about new stuff and learn from some of the lessons in places like London seemed like a really good thing to do. So we started a Java community in Seville and a Mongo community in Seville, both of those only in the last month or two and we have been quite pleased with the up-take. The first Mongo user group had like 25 people turn up, which doesn’t sound like the 3,000 London Java people, but it is pretty big for a community on its first day, and similarly with the Java community. What we've found is that loads of people there are really passionate about learning technology and they want to come and hear about this stuff because there is so much stuff out there. You cannot go and find things necessarily. You need to have a place where they are going to tell you: "This is valid" and "This is important right now" and "Don't worry too much about that". And people want to meet each other and share stories. So, that is quite exciting.

   

3. I want to get into some of the details around the work you have done on the Java driver, but to set that in context I think it is probably worth having a quick rundown on MongoDB for anyone who is not that familiar with the technology. So, can you briefly tell us what MongoDB is?

MongoDB is one of these new fangled NoSQL databases and what I certainly did not realize before I started working with Mongo is that NoSQL is this massive broad term which just basically means not relational. And it can be any number of things. It can be a document database - which is what MongoDB is - it can be a graph database like Neo4j, or it can be a key value store or a column family. So there are all these different ways which not only store the data differently, but there's also a different ... According to CAP theorem you cannot have all three of the 'CAP' - the C, A and P – I can't remember what the C, A and P are! - but you can't have all three. So each of the NoSQL databases tends to focus a little bit more or less on one or two combinations of those things.

MongoDB focuses on strong consistency; it's a document database that focuses on strong consistency. It has high availability via replica sets – so you can have multiple servers running with a primary, multiple secondaries, so you get this kind of automatic failover really really easily. Also, it can support horizontal scaling for 'big data' stuff with sharding. So, again, you have multiple machines. So, a lot of NoSQL databases are designed for these multiple machines and having commodity hardware, and just spreading your data across all these machines and have the database and the drivers figure out where to put your data and how to get it back.

   

5. In terms of sharding, what methods can I use to shard data?

So you will pick a shard key, something like – say you often access your customer records by customer name. You might want to shard based on surname. So then it will put A through D over here, and E through M over here and N through Z over here and MongoDB will automatically work out the balance of those names. So you do not need to say, "I want A through D over here", it will just say, "OK, I've got three shards, you are going to shard it on surname, so I am going to work out more or less an even split between those different names and where they should belong".

   

6. What are the documents stored as?

Well we call it Binary JSON, but it looks like JSON. The documents are much more structured than something like rows and columns in a relational database. With rows and columns you think of something like Excel, so you have all this square data, right? When you are talking about a document database, it is not Microsoft Word versus Microsoft Excel, it is more like a JSON document. So, your customer might have “name is this” and “age is this”, but then you can have embedded sub-documents. So, I can have an embedded address object and I can also support things like arrays. And arrays are really interesting because we couldn't do that so easily with relational databases. So, for an array I might be able to push order IDs onto an array for a customer, or I might be able to use it to do some interesting stack type stuff. You can treat it as a list or treat it like a stack. So, you can do some really interesting stuff with something which looks much closer to your domain object. So I am storing my customer object which is kind of more complex than a simple set of key value pairs. It's got maybe embedded documents, and maybe simple collections and I can store that whole costumer object in a MongoDB collection without worrying about it being spread all over the place in different tables. So it maps much closer to the way we, as developers, tend to think about our domain objects.

   

8. So you have been working on the Java driver for a while. Is it just you doing that?

No, that would be unfortunate, since I travel so much that I do not get as much time to work on it as I want. It is me and a guy called Jeff Yemin; he is the lead on the Java driver, but also we have Justin Lee who's just fairly recently joined us - summer last year. He is working a bit on the Java driver with us and also on a project called Morphia, which is our object document mapper which will turn your Java objects magically into Mongo objects.

   

9. What state was the code in when you took it on?

It is an open source project; it is a four-five year old open source project, so you can imagine. It is a code base which is four or five years old, it has had a number of owners and I used the word “owner” loosely; it has had a number of contributors. So you end up with a mix of different styles. As I said, the kernel is written in C++, so it is quite influenced by a C++ style. It does not support things that we kind of expect to see from modern Java, like generics for example. So, it is a very low level driver, it is meant to be a very low level driver, but there is definitely room for improvement in terms of making it look more modern, in terms of making it more intuitive.

The thing I've discovered, having tried to work on it, tried to create the new driver, is ... What I didn’t realize is that if you're working in a library and you have an API, that API is your user interface. Your users are developers, they are not customers clicking on things, but you need to think about usability, you need to think about, "When I do command space I want to see the expected list of methods that I can call. If I do a find and then do command space I want to be able to see limit or count". I want my IDE to tell me, to prompt me what I can do, and so we wanted to start thinking about the driver from a user’s point of view and what the API looks like from a user’s point of view, instead of simply reflecting what MongoDB does out to the outside world and saying, “This is MongoDB. Use it!”

   

10. That is interesting, because I would have naively assumed that what the driver does is basically just take Java objects and serialize them into Mongo objects and vice-versa, right? So obviously there is a bit more to it than that. Can you describe maybe the architecture for us a bit?

Sure. I think I was in exactly the same place as you, and I think everyone who started on the Java driver has had exactly the same thought: it is a serializer - it takes your Java objects and serializes them into BSON so that MongoDB can understand the BSON. So you sort of think of it as a fairly small serialization layer, but it does a little bit more than that. I mentioned that Mongo has multiple servers; it might be talking to a sharded system or it might be talking to replica sets. So the driver needs to do more complex things like, obviously things like connection pooling - you would sort of expect that, but it also needs to do things like figure out which server to talk to. If I am reading, I could be allowed to read from one of these secondaries.

So, the driver needs to work out for reads you’ve set a secondary preference, I can go over to this server over here, or for primaries ... or for writing we have this thing called “write concerns”. So I can say “I want to write the data to the database.” Of course, I want to write the data to the database, but how much do I need that data to replicate before it returns back to me to say, “Yes, that was definitely a successful write”. Am I just going to fire and forget? Am I going to say it has to replicate to both of my secondaries? Am I going to say, “Actually, my secondaries have spread across multiple data centers. It needs to have made it to another data center before you return”. So the driver needs to be able to take all of your preferences on how much you love your data and what you want to do with your data, and then figure out which physical servers to connect to and how to put that and codify that into the protocol. So I was kind of surprised, because there is a lot more to the domain of the driver than I really expected.

So it turns out that we have this serialization layer, obviously, which sort of serializes effectively just primitives to BSON primitives. So things like longs, and ints, and strings and so forth and this includes things like arrays. Also, it can speak the protocol so it knows what an insert message looks like, it knows what a query response looks like. So this is the kind of BSON layer, if you like.

Charles: BSON being Binary JSON.

Binary JSON, yes. That is both the name of the format that we store it in, but also the protocol itself. On top of that we have, rather unimaginatively called, the core layer and this is the key stuff that the Java driver does: the connection pooling, the figuring out of which servers to talk to, pinging servers to see if they are still alive, doing things like all of the stuff which is ... So this layer is called BSON because it deals with the BSON level stuff. BSON is a spec in its own right; it does not have to be tied into MongoDB. But this stuff here is MongoDB specific stuff so knowing about replica sets, knowing about sharding, knowing about how your servers are going to respond to certain things. This core layer does operations. You can call an INSERT operation and it knows then where to send that INSERT operation and then tells BSON to do the BSON serialization bit.

Really, the stuff that we thought was the exciting bit – the API, the user interface – is actually a very thin layer on top of this. So the current user interface looks in a particular way and we can ... It has got names like “the database” as in a classical DB, “the collection” is a classical DB collection “documents" are represented as a class called DB object. I never quite understood why it is not called a document, but this is the API level. In the new world, this very thin layer will have names which are more closely mapped to the domain, so it will be called a document to represent documents and so on and so forth. But this API layer ... Because we need to support backwards compatibility, because we have loads of people out there using the existing driver including big organizations that do not want to have to download a new driver and rewrite all their code, and why should they? We want to be able to plug on either the old API so that you can use the new driver with the new architecture, and the new functionality, which will also support in future new features on the server with the old API, or this new API which we are trying to make more usable and more user friendly for our Java users, you can plug that onto the front-end as well.

This has a really interesting side-effect as well. We also have a guy, Ross, who is working on a Scala driver and what we want him to do, or what he is working on at the moment, is the Scala driver is not a full end-to-end Scala driver; and why should it be, because the Java driver is the JVM driver and it runs in the JVM. We can have the Scala driver represented as a Scala API, but under the covers we only have to do this heavy-lifting work of the serialization and all the server discovery stuff just once in the Java driver and we know we've got it right. You can pop the Scala API on the front or any other different APIs onto the front, if you want to.

Charles: So presumably things like Morphia which you mentioned before or Spring Data say, do the same kind of thing; sit on top of the JVM driver.

Exactly. So Spring Data does not have to necessarily turn all of its stuff into something which forces through the API. It can just plug straight into these operations that it wants to use. So we do not have to transform stuff from Spring-shaped stuff to Mongo Java driver-shaped stuff to BSON-shaped stuff; that is kind of a bit ridiculous. So we just have them plug in at a lower level. That's the theory.

   

11. Tell me about the design process you went through. I am guessing you did not sit down and draw lots of UML diagrams and write a long detailed design document. But how did you approach it?

We did a little bit of that, surprisingly. We did a lot of brainstorming as to what was wrong with the current driver. We had a lot of things that we wanted to move it towards: we definitely want it to be very thread-safe; very easy for people to use in concurrent fashion; we want it to move towards immutability; we want it to be more usable. So we brainstormed a bunch of stuff that we thought was more useful and then at the end of that we have a set of design goals. So every time we are working on - even if it is just a single feature - for example, when we were working on what the new collection class should look like and how should we do inserts, for example, then when we were trying to decide, "Should the insert thing look like this? Should you override the methods or should you provide some sort of stream-type API?" we could look at the design goals we had, which is “it has to be simpler, it has to be...” - I can't remember what the goals were now - but "it has to be simpler, the exception handling has to be obvious", all these different things; we had like six key design goals. When we were trying to make decisions, we'd just measure it against those design goals and go, “This is simpler for the users” or “It is more intuitive for them when they use their IDE to do auto-completion” so therefore this seems like the right way to go.

But we were iteratively working through the code and seeing what the code was going to look like and we also generated some UML off the back of that code sometimes to see "Does this seems sane?”. The Java driver is probably the oldest of all the drivers, but there are loads of other language drivers that we support and some of them have already solved some of the problems that we have, that we have not fixed yet. So we want to be able to include those guys in the conversations and say, “How do you do connection pooling? What problems do you see? What things have you solved?” And then we could talk to them with our diagrams, or with our documents, or with our goals and say to the Ruby guys, “What do you do with connection pooling?” or “How do you handle errors under these circumstances?”

The other really interesting thing was async, because we do not support async kind of operations in the Java driver and we want to, but Java is, well it's not really late to the async game, but it is not really designed ... We do not use Java normally in a way which is async friendly, but by talking to other languages like the Node.js guys and the Python guys who have been doing async type of stuff for ages, we get some really good ideas on the direction to take the driver in.

So, we did not do it alone. We are one of the first drivers to try and do this rewrite to a shinier, better, cleaner world. So we are kind of forging ahead, but we are really using a lot of the lessons learned from the other guys to try and take it in the right direction.

   

12. Tell me about your development environment. What IDE are you using? What are you using to test?

I am using IntelliJ because it's awesome. And I learned a lot of tricks and tips when I worked at LMAX about really productive ways to use IntelliJ. I was really pleased when I joined Mongo and Jeff was using IntelliJ and then when Justin joined us, he was using IntelliJ, and yesterday when I was coding with Ross, who is fairly new to Scala and I was showing him some tips and tricks on IntelliJ he was like, “This is amazing”. So, we are totally sold in IntelliJ.

For testing, we were using, we were using ... We weren't using JUnit before; we were using the other one, which I can't remember.

Charles: TestNG?

TestNG; we were using TestNG before. I shifted us towards JUnit, but that might have been a naïve choice to move from TestNG to JUnit. It was more because I felt more comfortable with using Hamcrest matches with them; I thought we could do more with the way that JUnit worked. But last year I discovered Spock and it is this amazing behaviour driven development framework and it is written in Groovy. I was a little bit resistant to begin with because: I do not want to learn a new language; I do not want to write tests in a different language to the language I actually write code in; and it is an open source project, so I want other people to find it easy to read the tests and to write new tests. But Spock has this really great way of forcing you to think in a BDD way. In order to define the test you have to write “given”, “when” and “then” in your test. So you have to think which bits are set up, which bit is the bit which is really under test and what are my genuine assertions, and so it forces you to think in that way. That is really nice. It also supports this really great thing where you are testing something like ... You might be testing one method, but you want to test multiple inputs and see what all the outputs are, and in JUnit you would have to write like 20 tests, each one with a new input with a new expectation – that is the right way to do it. But, it tends to get a bit verbose and people do not want to do that. And of course, with something like a Java driver, or a driver for anything, what you should be testing is lots of different types of inputs against lots of different types of outputs. So you want to have lots of these types of tests and in Spock they have data driven testing. You give it a table of expected inputs and expected outputs and then you can run that - it is very succinct, in one method, but it will run that one method with 20 different inputs and you can see which one fails but it will run all of them. It is really, really awesome. I am just massively in love with Spock.

Charles: What are using for build?

Gradle. Again, that was because – that was sort of my fault. We didn’t really have a dependency management thing before. We try not to have dependencies in the Java driver, so that when you download the Java driver, you don't have to download the whole of the internet with it; you just download the Java driver. But we want some dependencies for testing so we've got Spock and we've got Groovy and we've got JUnit; so having something which can manage dependencies and the build, and just do it for us, seemed sensible.

I spent about a day trying to get Maven to work, and I have used Maven before so I am not coming to it completely fresh, and I could not get it to work with Java 7, or something. Some combination of things just would not do what I wanted. So someone said, "You should try Gradle". And I said, “OK; I will try Gradle” and I got Gradle doing exactly what I wanted in about an hour and a half and I had never used it before, never programmed in Groovy before and it just worked and it is short; it's not XML, it's little bits of Groovy scripting. So that was my Groovy moment of “Ahh I think Groovy is amazing!”

   

13. So you like the language as well; you like Groovy the language?

I do. The other thing we do in the Java driver is ... This document storage thing, these JSON documents are effectively maps - sort of key value pairs - and they can be nested as far down as you want really, which means that you are doing lots of map manipulation.

In Java it is not terribly pretty: you do new hashmap() and then you do “.put” and then you do “.get” and it is a bit verbose. I have never really worried too much about the verbosity of Java because the IDE will do a lot of that for you and sometimes if it looks a bit of a mess then maybe your model is not quite correct or maybe your design is not spot on.

But if you are talking about big JSON object manipulation you are just going to have a lot of nested maps; and in Groovy it is just with little square braces and the syntax is just much more succinct, and you can read it instead of having all these new key words getting in the way. So, it really aids readability which is why ultimately I was sold on it for testing because although it is a new language to write the tests in, it is more succinct and arguably more readable even for people who are not used to Groovy, so I am totally sold on Groovy. I think it is amazing.

   

14. Are there things coming in Java 8 that you are excited by?

Yes, definitely. I mentioned a bit earlier on the LJC was quite involved in the new Date Time spec. What was really needed for that was some people to really get on and do the testing for that, because you cannot release something that is core to the library without having the tests. A couple of guys from the LJC said, “Look, we will help with the testing side of stuff” and ended up with Steven Colebourne, who is the guy behind Joda-Time, and the guys from the LJC all working together and talking to Oracle and stuff, and we actually managed to get the new shiny Date Time thing into Java 8 and it looks quite a lot like Joda-Time but with even more stuff. It is suitable for more internationalization and for everything; it turns out that - who knew? – date and time is kind of tricky.

Charles: Given how long it has taken Java to get a workable calendar, I guess that is kind of obvious!

Exactly. It sounds so simple. It is like, “Finally; after – I don’t know - 20 years, Java gets date and time”. And it is making our life a little bit difficult because now we, as library developers, need to figure out how to support java.util.date, Joda date and time - because people have been using that for ages because util.date was always a little bit sub-optimal - and the new date and time, but I think it is going to make life a lot easier for a lot of developers. I think that is very exciting from a London point of view.

From a developer point of view I think lambdas are just going to be really interesting. Every other language in the world has lambdas and we are just about to get them, and I think that it will force us to think a little bit differently. So, for example, most of what we do when we are dealing with, let’s say, lists or collections of any kind is we have a list and then we iterate over it and we do stuff to it. So you cannot always do “tell, don’t ask” type programming. You get a list and you cannot say to the list “Go ahead and filter yourself” at the moment. We get the list and then we have to iterate through it and say, “If you are one of these, then I want you. If you are one of those, I do not want you” and it makes our code a little bit clumsy; these external iterations.

With lambdas and the new Collections API then you will be able to do a bit more of “Here is a list. Now go and filter all this stuff out" or "Go and do map reduce on yourself”. It means that you can do a lot more “tell, don’t ask”; tell the list to do something with this lambda and then when you are done, then I get the results. I think that this programming is going to be a little bit backwards to what we are used to but I think it is going to simplify the code so that when you are looking at your domain object and your domain logic, you can say, “OK; I got this list back from somewhere – say MongoDB. I got an object back from MongoDB and then I just told it to filter out the stuff I do not care about and I got the right object back" instead of all of that logic being in your domain, which just makes no sense at all. So I think that is going to be really interesting.

   

15. Something that has been kicked around in the Java enterprise space for quite a while is whether we should have some kind of standardized support for NoSQL, and you have already said that NoSQL is a very broad term. But do you think the market is mature enough for some sort of standardization?

I think at the moment - no. When I was over here, and I was in the LJC, the LJC is quite involved with the JCP – the Java Community Process – and so we are quite involved in helping to shape the standards for Java, specifically going forwards; and before I was involved with that I would have said, “I do not care that much about standards. Just tell me what I need to know and I will just deal with it”. But being involved in trying to come up with standards, it becomes a bit clearer as to what some of the troubles are.

One of the things that we liked to see, the LJC liked to see, when someone proposed a standard, was that there was at least one working implementation of that standard, preferably two or three. So for example, you might have a standard of web services for Java – What does it look like? How do we implement this? or annotation based this, that and the other; JPA is a good example. So, if you are going to do JPA, Hibernate is an implementation of that stuff but there is more than one, so you could sort of say, “OK; This works and this does not work”. So, you need to have a couple of implementations to figure out what you mean by a standard and how it is going to look and can developers really work with it.

I do not think we are there yet with the NoSQL space. We have got lots of implementations not just of NoSQL, but, for example, if you look just at the document database space, the way that we interact with document databases is not really standard across the document databases. I think we should move in that direction, but I do not think we are there yet and I do not think we have found – not tiny edge cases - I do not think we've found the high level things that are going to be consistently have to be done in a standard way; yet. That is my gut feeling, but maybe I am wrong. Maybe it is just because I think that it would make my life hard as a library developer; I am not sure. But I know that our drivers work very differently to the way that other document database drivers for example would work. And then where do you slice it? Do you say, “This is the standards for document NoSQL databases versus this is for graphs and this is for key value pairs?” But the more the NoSQL space expands, the more you see these overlapping things that do not necessarily fall into buckets. We have these very big gross buckets but we have document databases which are also key value pairs... key value stores, and I do not think it's settled down enough to be able to say, “Right. We have this shape of stuff and we are going to standardize it that way”. In my opinion!

   

16. What excites you about working in software?

I love the fact that it is creative and logical. We get told these stereotypes about programmers which I think are completely untrue. OK, we are a little bit OCD; we do like things the way we like them; but you have to be really creative to come up with a new solution to a problem, even if it is a new mathematical solution to a problem. You are not doing things that have been done before. Otherwise, we would automate it; otherwise, we would have a tool to do that. I really like the fact that the computer only understands logical stuff, so you have to put stuff in there so that it gets it. My OCD likes slicing it down into something like that. But I like the fact that someone will come up to you and say, the business will say, “I need an App to order coffee”. And you're like, “OK. Tell me what that would look like” and they're like, “I don’t know. You are going to order a coffee and it is going to go to the coffee shop and then everyone lives happily ever after”. You have to fill in those gaps; you have to fill in the gaps from the fluffy thing, ask the right questions, sit with them and try and figure out what it is they really want and turn that into something really rigid that the computer understands. That is quite challenging! And understanding how people work; you have to understand how people work to write good programs. Otherwise the users are not going to use it, the business is not going to like it, and other developers that come along and try to maintain it are not going to understand it. It is really important to get people. And we get told all the time that developers do not understand people and it is nonsense.

Charles: Brilliant! I think that’s a lovely place to end it. Trisha, thank you very much indeed.

Thank you very much.

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT