00:21:02 video length
Bio Adrian Cole is the founder of the open source jclouds project and CEO of Cloud Conscious, LLC. Adrian also runs the Cloudhacker’s group in San Francisco, a regular gathering of cloud developer enthusiasts. His 17-year career in IT includes design and implementation of mass automation and deployment products for financial, hosting, and education contexts.
Strange Loop is a developer-run software conference. Innovation, creativity, and the future happen in the magical nexus "between" established areas. Strange Loop eagerly promotes a mix of languages and technologies in this nexus, bringing together the worlds of bleeding edge technology, enterprise systems, and academic research. Of particular interest are new directions in data storage, alternative languages, concurrent and distributed systems, front-end web, semantic web, and mobile apps.
Basically, if your goal is to have your machines do your bidding, then there are a few steps to get there. One is to provision either an existing machine or rack a machine or use something like VMware to create a virtual machine. All that’s in the provisioning space, where you are just making a blank machine exist. Then you have the configuration where you are installing and configuring software on it. Then integration where that running software might be connecting to a data store, Cassandra or maybe a traditional database or maybe it’s a load balancer. Then you get to lights on, where are operating machines.
Technologies like Puppet and Chef really leaving this land between configuration and integration and jclouds, for example, lives on the provisioning side and its goal is essentially to allow you to launch whatever configuration and integration process you want.
Jclouds is a library, it’s not a framework for example. The jclouds tools allow you to act on Cloud resources in a provider agnostic way. For example, you don’t have to know anything about VCloud or EC2 to use either. It also allows you to access provider specific features. For example it’s not an "either/or" decision, it’s a "when you need it" decision. This extends to compute provisioning, we also have a provider agnostic utility for storage services, like Amazon S3. These are generally the two use cases jclouds helps out with.
The concept we use in jclouds is called Template and that associates a specific operating system image with a specific set of hardware and a specific location, which is generally at least a logical network location. We have a builder system to say "I want something with an Ubuntu of 10.04 and I want at least 2048MB of RAM and build me one." Then, if that fails, it will say that no such configuration exists on this Cloud. If you choose no options, which you can, it will find you something that can run Java. For example, jclouds doesn’t build images, it selects from images and then if you wanted to build images and you could couple it with other utilities that run on jclouds and then save an image based on them. But currently it doesn’t build images.
The biggest ability it offers application developers is not to be caught up in how to achieve what you are asking to achieve. I think that’s what most libraries are intending to do, so that the time it takes you to describe what you want is directly proportional to the amount of code you write to get it. For example, if you look at how to get a machine in VCloud or something, you have a few step process to find an image related to what you want, tune it to the hardware configuration, instantiate deploy and do several steps to do it, which have good reasons for why they exist. But a developer may want to say "I want a machine running Linux, this much RAM, build me one." That actually is almost exactly the code in jclouds for achieving that command.
One of the benefits is a simplified interface that isn’t limited and I think that’s very hard to do. Sometimes is easy to make a simple interface, but then it doesn’t do anything. The other thing we do is we do a lot of testing. For example, our beta releases aren’t cut until they actually pass for every Cloud provider. For example, that means that we must be able to run multiple nodes at the same time, configure, control them, run scripts as root on all of them at the same time and do all sorts of orchestration commands. It’s a tracking issue for us, which is challenge and we take that challenge so that the developers don’t have to worry about what changes in the Cloud happen.
Finally, we provide unit test utilities. For example we have an in-memory version of Amazon S3, so you can do unit test without connecting or getting access to the whole network. The same thing with the compute provider, we have a stub that allows you to write code based on compute providers with actually not launching anything. So that’s pretty useful to people too.
It’s a good question, because one of the things we have to ask ourselves when we create an open source is where we start and stop. If you don’t know where you stop, it’s hard for people in your ecosystem to know when they should start. As I described earlier, in the various phases between turning your process on to actually having your software configured and ready for use, there is the provisioning configuration integration. Any project could do all of those phases, but if you focus on one and make that link to other projects clean, then they can actually come up with multiple solutions to those other phases. A lot of our integrations happen above the provisioning and at the configuration integration layer, sometimes even extending to control.
For example, there is a project called Apache Whirr, which is written mostly by some of the guys that work at the Cloudera that was founded by Tom White. He wrote the Hadoop book and essentially it can do things like spin up and control Hadoop, Cassandra and Zookeeper clusters, which is pretty useful. There is a whole lot of Clojure folks working in the jclouds ecosystem and one of the more notable tools in that space is called Pallet, which allows you to specify stacks in Clojure DSLs essentially to say "I want to configure Nginx and Ganglia and all these other technologies that somebody might want." Or it should make a bespoke configuration as opposed to what happens to already be on an image.
From the storage perspective there are lots of interesting things. For example, there is a very popular Twitter library called Twitter4J and they’ve had a lot of requests for people to store their tweets longer than Twitter holds them for. The founder of that project is currently working on integrating that with jclouds so that the user doesn’t have to know that Twitter stopped storing their tweets. Then there are frameworks that are using jclouds, too. For example, there are a few JBoss projects that are using jclouds to provision resources for testing or for building web frameworks and things like that.
Cloud Foundry is a service that people use to deploy applications and it handles the creation of resources such as MySQL and Tomcat servers and such. There are a number of commercial offerings that are using jclouds to do similar things to that. Jclouds is a library and not a service, so you could build a utility like Cloud Foundry with jclouds and other tools that would layer on top to do things like MySQL, but jclouds itself isn’t a service. You can build a Cloud Foundry with it but it isn’t the same thing.
I really like that analogy. It’s not true, but I like it. In a way, you want to say "Yes" because there is virtualization underneath of most clouds. But there are different hypervisors and even some offer, like bare metal clouds for example now, which is very interesting. But when it turns out into implementation, there are various differences. Some of them are developed in Visual Basic, some of them are developed in Python and a lot of them are developed in Java themselves. The software development and how that represents itself in code is sometimes opinionated based on libraries that are used to generate XML or generate JSON.
As a party working against multiple Cloud APIs, you’re really seeing them as a black box and you see things that hint of implementation, but you never really care unless it gets in your way. For example, how they are written is certainly different, how they represent data is very different, sometimes the control flow is different. These things are easier challenges because that’s just portability of data and strategy. What’s more challenging than that is the chaos factor, like the reliability of the APIs. For example, a new cloud will have unreliable service and so they might in fact choose a REST API but use the wrong error codes and it might be that way for a month.
In jclouds we ended up having to write a lot of code about how to switch out error handlers so that for example we can work around the issue now without leaving a permanent trace in the code that will always have to do this hack, for example. When you work with Cloud providers, I think one of the things is you are tempted to use the services, even if the software quality is poor. Part of the challenge in working in that environment is that you have to encapsulate that quality concern somehow without compromising your own quality.
That’s very interesting. A lot of people who work with clouds, especially early adopters equate the word "cloud" with Amazon’s implementation of EC2. This actually has a lot of gravity to the implementation decisions and even the way they approach problems. For example in EC2 storage is an independent service called EBS and people get used to the ability that they can detach volumes and move them around. As it is, I think there are about a dozen different Cloud APIs if there are not more than that, and there are many services running their APIs, so it’s like a "one-to-many-relationship."
There are a lot of differences. The model of EC2 is pretty good; it hasn’t changed substantially since it was created and because it’s been very stable in that regard, it may not be the best API in the world, but it’s something that a lot of people have figured out how to use. Some cloud director projects are exposing EC2 also, so they might have their own API. OpenStack for example has an EC2 API even though they are developing their own API as well. The thing about EC2 is that you are going to find a lot of people will emulate it and that actually causes confusion, too, because if you’ve noticed in the news in the Cloud world, Amazon releases news every month or two - there is some new feature that’s very exciting for people.
There is this tracking between the clones and EC2 and sometimes it gets pretty wide. You want to say that you are EC2 compatible but when were you compatible? I was compatible in March, maybe. It’s an elusive goal, but I think that EC2 is the closest thing to a de facto Cloud API. I think that the vCloud is probably number two from an adoption or likelihood to be adopted perspective. That API is completely different than EC2 and it’s much more datacenter oriented. For example, in EC2 it really makes you want to think that a machine is a blob with an IP address stuck to it. In vCloud it’s really more about a machine that’s connected to a network and things like that.
Jclouds has quite a few folks who are using what we call the "blob store" which is our portability API across things like Microsoft Azure storage and Amazon S3 - that’s very popular and it’s used as an API. Essentially you would tell it a key value pair with maybe some metadata on it and people would integrate directly with that. Then, from the provisioning side, it’s a combination. For example, folks like CloudSwitch have their own abstraction that they write some of their libraries directly and some of them use jclouds components to make their platform deployment easier. The same model is used for enStratus; they have some components they write themselves and some choose from jclouds.
In some production scenarios folks are making a decision as to which components they want to use for specific clouds. Some of them use our portability API directly and then I think there are still some that use pieces of the API. For example the CloudBees guys use the EC2 API from jclouds, wrap it in Scala for a workflow. We don’t have Scala bindings for jclouds at the moments; we just have Clojure and Java, but they are also using the blob store as is. It actually runs the gamut, like the features that we have in jclouds about like accessing the provider specific implementation sounds niche when you talk about it because most people look at it like SQL, like "Why would I want to use the native Oracle driver?" or something like that.
But in the Cloud world it’s as likely or more likely that you want to access the underlying API. I definitely see that user’s pattern very often.
The last two months we had a lot of requests for local virtualization control. What that means is I’m a developer, I’m on my laptop and I want you to launch VMs on my laptop. It happens to be the case of a lot of people who are writing frameworks on top will want to prove out the end result, like they want to actually use a cluster, not a simulation of a cluster. We resisted for a while, but the thing is that the demand is too high, so we’re definitely going to do virtualization control. What that means is we’re going to use this very useful tool called libvirt, which is basically portability for hypervisors. We’re going to hook into that so the people can use whatever hypervisor they want under jclouds.
So that’s going to be pretty significant and then we’re probably going to have a control existing machine functionality to essentially put in dummy information into jclouds so it thinks you just provisioned a whole bunch of machines you already have. That way you can use the same provisioning tools at interface with jclouds API to control existing machines, locally virtualized or Cloud machines. So it should really be useful for folks. The other thing that we’re working on is that because it’s a split-brain, in jclouds we have as many folks who are working with the blob store related stuff as a compute stuff. The other channel of effort is directed towards the blob store and so we’re going to work on ACL support. We’ve gotten a lot of requests for at least the ability to produce public URLs for data and then some refinements to some of the provider specifics.
For example, EC2 recently released a tagging support into their API and a "bring your own SSH key" support. We’re going to do a little bit more work on our EC2 implementation to incorporate these features and also support asynchronous Cloud APIs, such as Amazon spot instances. People who want to launch a bunch of clusters that they don’t necessarily want to wait for them to provision can work as well as the ones that they synchronously create.
If you think about it, the more tools you have in the ecosystem that are interfacing with the jclouds API, because they are interacting with that API, you can bring that functionality to existing machines. It’s not perfect, because the thing is that a lot of the assumptions you can make on a cloud is that no one touched it before you got to it. It’s not going to be like the perfect match for all scenarios because you’re going to have to be careful what functionality you use and not to step on other people. I think that a libvirt type of thing where you are launching VMs on existing machines is going to be safer for all the tools to interact with, the same way as they would a Cloud, whereas you would have to be a little bit more careful.
But one of the examples of this is the GigaSpaces guys who are using jclouds now and they have a lot of the environments with existing machines because they deal with a financial industry. They use jclouds to provision their agents to spawn up the GigaSpaces service. That doesn’t actually compromise the machine. It really runs at a single user space and if there happen to be other processes, it’s not likely to conflict. If you know your domain well, then it would make your life a lot easier if you use the same provisioning tools regardless of whether they’re on an existing machine or on a remote one. But there is no magic bullet that makes it work the same way on a shared machine because part of what Cloud gets you is control over your space.