Bio Jevgeni Kabanov is the founder and CEO of ZeroTurnaround, a development tools company that focuses on productivity. He wrote the first version of the ZeroTurnaround flagship product, JRebel. Jevgeni also started the first Java conference in Estonia, Geekout. Jevgeni is on the Expert Group for the JSR 342 (Java EE 7). He has started two open-source projects - Aranea and Squill.
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
Well for starters, I am a carbon based life form. I am a PhD student still and I still have hope to finish my PhD which is cool. I used to work at a consulting company, I used to do R&D development there, and there basically my job was to solve the hardest problems of the company and one of those problems which I solved, which was not my direct task to solve but which I solved was the incredible inefficiency that Java development can entail. So I focused on one particular issue and it led to the JRebel product, and that is a obviously the time it takes to bounce the application after the changes when you edit the application, so the building are redeploying, and that is a kind of where it started. So that was back in 2007, we spun out the company from that consulting and it’s been very cool ride since then.
Today we like to talk more about the experience rather than the features. So feature-wise, if you talk about particular class reloading then you can also add methods, you can add fields, you can add classes, you can add annotations, things like that, but in addition to that we also actually do things with builds like we directly map the workspace where you edit your classes and resources. With that we map it directly to application and that allows us to use to skip builds because you can just compile things in the IDE and that is pretty much it, and on top of that we also integrate with a ton of Java frameworks, containers, IDEs, and in the end what matters is the experience, and the experience is that whatever changes you make in your IDE, you immediately see them in the application, in the web browser. And this experience is something that HotSwap just does not deliver, even if we stop comparing the features, it will work some of the time for some of the things and that’s a very big difference to having it working all of the time for all of the things.
It’s a total magic, "Fairy Dust". It’s way easier to describe it like that but the way it works is that, it kind of takes roots in my background that I was doing research on virtual machines, on programming languages, on type systems, and it basically is a compiler. A compiler it’s a great deal more than people think about it, because compiler is anything that translates from something to something, so JRebel translates from bytecode to bytecode, but it does it in a way that makes the classes reloadable, and the main thing that we do is that we use the fact that you can load new classes and so we use that to version the classes and then we make sure that the translated classes actually use the latest version of the classes so it’s all very hairy. But it’s hard to describe it naturally, without having all these diagrams in front of me. I gave talks on that and I wrote a paper that is actually called "A Thousand Years of Productivity", so I encourage you to find that paper and read up: http://crocodoc.com/cxaKv3p
Yes, actually we have a lot of fun at the company, one of our goals is to have fun and one of the things that we really wanted to do is to get JRebel to as many users as we can, because not everybody can afford to pay for it, not everybody is actually employed like there are students, there are hobby projects and open-source projects, a lot of things like that. And we already have the open-source, we give for free license to the open-source developers, we give for free license to all Scala developers and we wanted to do more and we kind of figured out that what we’ll do is we’ll give JRebel for free in exchange that people will post statistics once a month to either their Twitter stream, or Facebook, and it’s kind of an experiment. If you go to http://social.JRebel.com you can go register and get JRebel for free, and it’s kind of an experiment from the point of view, you see that is better and we just kind figuring out what is the best way to do that like we want the guys who’re using it for non commercial purposes to be able to use it for free, and I think is kind of a cool model but we definitely want to improve on it as well as we go along.
There are different aspects of that. One of the aspects is that with JRebel as a tool, we want to deliver an experience, so JRebel is all about the experience. Once you have it where ever you throw it, it should just work, it should be easy to install, should be easy to configure, should be invisible most of the time and should just improve the way you work, but to deliver an experience you actually have to continue the improvement because one of the powers of JRebel is that to integrates with the whole Java ecosystem, but the ecosystem continuously evolves. There are new versions of frameworks, new versions of containers, new versions of everything, so the typical model is this upgrade model, when you have to pay for the new version, and we decided that, no, we want the users to have always access to the latest version, because doing anything else will compromise that experience.
Because even if we don’t have big features they basically as time goes by, the JRebel become useless to them, so they anyway will have to upgrade but I just didn’t like the whole upgrading model, so we decided that the right way to do it is exactly a subscription model that actually correlates to the value that they getting for both because in a year of time they saved that much time and also because they can always use the latest version of software and things always just work. So for me, the subscription was kind of the obvious choice and I think it worked out very well for us and for our users.
LiveRebel is a very cool thing. From day one when we are talking to our users, JRebel users, they were telling us that it’s a really cool thing, I want to have the same thing in production, I want to have in production also I just can change things and they are there. And LiveRebel 1.0 was exactly that, we took the hot-patching Technology from JRebel with all the integrations with everything and ported it to production. And it was great: we got deployments, users got excited about it, but one thing did we realize is that when they were talking about this JRebel, they were not talking about a particular functionality, they are talking about experience, they wanted to be able to do the same thing at production, that is just throw wherever they want, whatever changes they want, then they want them just to apply. But the problem is the production is a much more complicated environment than development is, so in development you can always restart, if something doesn’t work out you can restart.
JRebel has some of its own limitations, like it cannot replace, it cannot change this superclass, like it cannot change from extends A to extends B, it cannot add implement on interfaces which already is kind of a limitation in production so not every single change can be applied. Then on top of that you have things like if a state changed, like for example, you added a field and in development if is not initialized, you get an NullPointer at some point, you restart, big deal, that happens. We can do a lot, but we cannot do magic, we can’t define what value that field should have had, if it were constructed, because we just attach it to existing objects, if it’s initialized, that is great. In most places, in most development scenarios you can go back a page and basically you reinitialize the component and whatever and that usually works well, in production obviously you can’t do that, so you can’t just have an uninitialized field in a production environment, because it can cause all kind of trouble.
We devised ways around that you can provide us with the full values, but there some limitations, and another obvious limitation in production is concurrency like that in development you have one user, in production we have a thousand users, so we did protect against that so far. So for hot-patching in production we actually devised a structural comparison which takes two versions, compares them and makes sure that we can actually apply the difference. So we do that ahead of time which guarantees that only those versions which are actually compatible for the hot-patching purposes will get to be applied, and we also solved the concurrency issue by saying that during the update for the duration of the update, which is not long, it’s like seconds but we will pause incoming requests and we will let the current request go on, then we actually do a "Stop the world" on this particular machine for the duration of the update, and then it will resume all the requests for all, but during that moment there is not concurrency on the machines.
So we did solve that, but what we found out then because hot-patching only would support some of them, maybe it would support the minor updates, it wouldn’t support the major updates which have a lot of changes, different structures of the codes and so on. So, it maybe support the minor updates which customers are happy about because it allowed them to deploy like a lot of small bug fixes and so on rapidly, that is great, but what they really wanted is they wanted this JRebel experience that they can just throw the update, throw the "change added", and it just applies. And for the 2.0 we focused not on the features but again we focused on the experience, and what we built is a tool which actually can deploy all and any update that you throw at it in a way that is transactional so it either succeeds in full or it doesn’t succeed at all. That is online so it doesn’t just drop the users, the users keep using the application during the update and is reversible. That is also actually one of the things that they care about, and if something goes wrong we have a panic button and you can just roll back. And we did it, I guess I’ll tell also just a little bit how we did it because I think it’s wicked cool.
The way we did it is that we got separate different strategies for different sizes of updates, so the smaller updates still get the hot-patching strategy, but the bigger updates now have this rolling restart strategy, which means that the LiveRebel agent which is installed on each server, (and is really, really easy to install, that is another thing that I am very proud of, that how easy it is to install both JRebel and LiveRebel) so the agent will actually survive at the restart of the server and that agent is a tiny little load-balancer but it will actually redirect the requests and the sessions to data sources. So what happens is that we completely automate this process where we do the sessions upgrades, we bring all the new users to other servers, and once the old server base doesn’t get any requests anymore, then we’ll restart it with a new version and then we put it back in the rotation and we repeat the process for every other.
Yes, the cool part about it that today is that everybody is involved, that is the whole DevOps thing is about, bringing development, operations and business together. I was actually at DevOps conference and there was a whole discussion that we should call it "DevBizOps" because business is the key and integral part of it. It’s exciting because developers have a stake in this because what developers want is they want their code to be in production. By nature every developer, if the code is not in production he kind of pissed off, like "I wrote this thing, it is done, why does it take three months to deploy it?" The Ops on the other hand, they are often measured by the up-time so they want to keep things stabile, keep things simple and the business guys they just want whatever drives the revenue.
The cool part about LiveRebel and the cool part about dealing with operations is that (again, this is the whole part of the DevOps, the whole part of continuous delivery) you can explain that you actually decrease the risk not increase the risk by deploying updates often because: a) if you had smaller updates they have also smaller impact and also you know well if something breaks that this particular update broke it versus if you deploy some huge monstrosity, and something breaks and you have no idea what broke it and why did it break and what’s going on; b) you can do like A-B testing and actually with LiveRebel you can roll out just one or two severs at a time, see how it works, and then roll out the other things.
It’s so remarkably easy that you can do it with a click of a button and Ops guys are very happy about that, because right now it’s usually a script, and the problem with the script is that if it fails in the middle then you suddenly have a system which is half-way there and you not even know where it failed usually. They are very happy about being able to press a button, A-B test it and then take it back and have this predictable process. Truth is it’s really all about the predictability. The typical manual update process even scripted update process is not really predictable, because basically if it’s manual then you introduce a human error. Obviously, a human is doing that and he forgets something, humans are notoriously bad at repetitive tasks and which means that somewhere down the line they do one thing differently and then basically you are stuck with a system half-way there and half-way not there. And the script is bad because it’s doesn’t leave an audit trial and it executes steps but it isn’t transactional so it can get to some point and it can fail there and that is a normal action because deploying things is hard and you always are dealing with failure, and if something fails you want to be sure that you can recover from it.
When we talks to Ops, we always talk about this failure and recovery and they always get that, it’s beautiful, and I have to thank the guys who did the work for that, like guys who wrote the continuous delivery book, guys who was started the DevOps Movement and because today the Ops team actually get that just by resisting all change is not the way, the way is to embrace change but make it predictable and recoverable.
There are two different things. As a company I leave the choice to our customers, I do not impose my own views to anybody and if you prefer one or the other that is fine, but the most people don’t understand the difference. By the way, for viewers the difference between continuous delivery and continuous deployment is that they are all the same up to the point of the actual deployment to production, and then the difference is that continuous deployment says that everything that passed up to here should go to production, every release candidate that actually was tested should go to production and continuous delivery guys say "Yes, but there should be a button". Like it should not automatically go to production, somebody has to press a button and take the responsibility.
So personally I think the continuous deployment guys have it right and there is no silver bullet, there is not the perfect solution for everyone but I think it’s a perfect solution for the majority of deployments out there, which is not "NASA style mission", the rocket doesn’t crush from the moon if there is a failure. If the result of the failure is just a little bit of down time, or some users get an error message, then I would say is go to the continuous deployment because the thing is that, what you want to do is you have very good monitoring and you want to have very good recovery, you want to be able to recover very quickly. And that is what is A-B testing is for, that is for all kind of testing and monitoring is for, and by the way eventually there is no difference between testing and monitoring, your monitoring is just test-running on old production server.
If you have the environment set up correctly, if you have the testing, if you have the monitoring and if you have a panic button somewhere or you can fix things quickly, then you really want to go with continuous deployment, because those bugs they will come out anyway. If you have a bug in the code it will come out sooner or later and you want it to come out sooner because the sooner it happens the less impact it will have down the line. It is just kind of this thing, but I do understand that is a scary thought, we have a stream going from commits to production and it just goes there. By the way, the reason why you exactly want to deploy it is, why you don’t want to press the button is because you want to do updates in small batches, it’s the same thing.
Small batches maybe A-B testing, it’s first a few servers then we’ll allow it on all, but if you do this in small batches then when there is a failure you always know what’s at fault: it’s the latest batches, the latest small thing. Whereas in continuous delivery and typical deployment failure is big batches which it means that you have a bunch of stuff and something fails and then you have to do like a post-mortem or you have to start evaluating what exactly fails versus if you have this one change set coming in, you know "Ok, this is the problem, we’re rolling it back and let’s just start fixing that bug, because we know that it is in there."
Database still remains a complication. We do accommodate the database update to LiveRebel, we allow pausing the update for the database update to be applied and we also allow all kind of scripting, but database remains a problem mainly because if you have a lot of data you want to migrate or if you have a lot of things, it’s really hard today just you do it in a seamless way. And the problem is the data always keeps coming in and it’s persistent so you cannot just take a snapshot of the data, migrate it and then put it back, so the rolling restart thing doesn’t work for databases. So that is one of the big issue still unsolved and another one is that, in large organizations the sheer dependencies are kind of tough.
There are some good orchestration tools out there and we will help a little bit from the point that LiveRebel actually provides an overview of your servers what applications are deployed and what versions are deployed, but we don’t at the moment know explicitly what depends on what. The order of deployment is also straightforward, there are some tools that help with that, like asset management tools which basically change management tools which help deploying in the right order and so on.
But it’s a very tough problem so the whole deployment, you are talking about the live system, it would be all very simple if not for those "pesky" users. In fact you ask me what is the most complicated part of the updates -it’s the "damn" users, so they keep using the application. Stop doing that! So if they would just stop for an hour it would be way easier, so the whole thing is how to do that without showing it, without any disruption to the users. That is the toughest, all the other complications that basically wouldn’t be quite as hard if you could just take all the systems down and do it, and that is why today the majority of the updates are done just offline, at 3 AM on a Sunday.
Funny enough, the biggest challenge to the developer productivity that came up in our latest survey was the multitasking. That is the sign of our times that everybody is so overloaded, everybody has some many small things to do, that it’s just crazy. And there is plenty of advice how to fight that: turn off email, turn off your phone, bar the door, put furniture against the door so the manager wouldn’t walk in. In the end, is very much about the environment that you build in the company and it’s funny because there are two schools of thought and actually the Agile school of thought maybe contributed a lot to this whole multitasking theme because the Agile idea was that we should have a lot of communication, that we should have a smarter communication that everybody should keep "jabbering" and then you overhear it and that is a good thing.
This is a great idea, but nothing comes cheaply, nothing comes for free in this world and on the other hand of Agile, is that yes, there is a lot of communication, yes there is a lot of things, and there is also a lot of interruption, there is a lot of context switches. Also just because of the way the Agile organizes stories and things like that, I think that is also a bit more multitasking because of that, it’s not this kind of analyses, requirements, implementation, delivery. It’s like a long time term cycle but is much shorter term time cycle which means that there is a lot of more switches, like for example, I personally don’t try to code for a product anymore for a while now because for me, I cannot do any meaningful amount of work in less than two days.
First day I am going to get in to it and the second day I will be productive and then I will be very annoyed that I have to switch to something. So actually more like when I used to really do that and taking would take two weeks at time and just do one thing after another and for me that is the meaningful amount of time. But now I cannot code because I cannot take the time and I understand that a lot of other guys also feel this and you really have to fight with that, you have to say that "Multitasking sucks, guys!". Go to the manager, tell it’s your job to make sure that I get little interruption, it’s your job to make sure that I can focus on one thing at a time, because human beings don’t multitask actually, it’s a misconception. Human beings focus at one task at a time then they context switch, then they focus at another task at a time, then they context switch, and the context switch is like 15 minutes.
Actually there was an amazing test, I think was it done by HP, but there was just that basically if you workday is constantly interrupted like every hour or so you are interrupted, then your IQ decreases by ten points and it’s about equal to smoking marijuana or a night without sleep. So they are getting people stupid and that is bad!