00:42:39 video length
Bio Tom Preston-Werner works for Powerset Inc. and is one of the founders of GitHub. He's created various Ruby tools such as the monitoring tool god, and many more. Kenneth Lundin joined the Erlang/OTP project in its early stage in 1996 and has been working both with application components and the runtime system since then. He has been managing the team for about 10 years now.
The Erlang Factory is an event that focuses on Erlang - the computer language that was designed to support distributed, fault-tolerant, soft-realtime applications with requirements for high availability and high concurrency.The main part of the Factory is the conference - a two-day collection of focused subject tracks with an enormous opportunity to meet the best minds in Erlang and network with experts in all its uses and applications.
I’m co-founder and CTO of GitHub. It’s a company that does Git hosting and collaboration tools. Around Git we do training, we have a locally installable version for enterprises and we do a lot of outreach for conferences. We’re just trying to get the word of Git out and tell people how it can help their businesses and help them be more productive at developing code and collaborating with each other. I’m Kenneth Lundin and I’m manager at Ericsson for Erlang OTP team and we are releasing the Erlang OTP open source software.
2. People are mostly hearing about Git and a lot of people take it for another SVN or distributed SVN which is missing the point. Can you give us an introduction into what Git is and the core concepts of Git for the people who think about learning about Git?
Git is along the same lines as Subversion and CVS and RCS and all these, but it’s a bit of an evolution of that concept. Things like Subversion and CVS are centralized repository. So all of your code sits in one place on a server and then people use clients to communicate with that central server. Everyone communicates to that central server. If there is a problem at all with that central server then you really are kind of stuck. You can’t go forward, you can’t do anything. The evolution of that is an idea that is called Distributed Version Control.
Instead of having one central point you have each repository as its own full repository with all history. Every developer that has a clone of the repository can work offline so they can commit offline because it’s local, so they are only committing to their local repository. Then, being distributed it’s easy to push commits, to push those changes that you make to other people or you can arrange a bunch of developers around a central model. You can still use a central kind of model, but you get the assurances of a distributed mechanism, in case there is a problem with the central point.
People can still work and commit locally and do all that kind of stuff. It’s just taking that concept of source control and exploiting it out, giving the developers more power over what they are doing locally. Another thing that Git does extremely well is branching. In Subversion in CVS doing things like branching off code, making some commits over here and maybe the mainline gets some commits on it and then when you have to branch later on, that’s a really big pain in a lot of systems. Git makes that a central concept.
Let’s say you are working on an open source project, like Erlang for instance. You want to pull down a code and you want to make some changes and then get those back to the mainline so that they can merge them back in. In the older way you are dealing with patch files mostly in Subversion and CVS land. You make some changes locally, you roll one big patch file and that you email that to a list or you send that to a person somehow. With Git, since branching is built in, it’s a core concept, you branch off, you create what’s called the feature branch and you make your commits there and you might make 5 commits (logical little pieces of commits).
Then, you can push this code to a central location, like GitHub and now you can point other people at that change set really on the same or just point them to their repository, wherever it happens to live and say "Here are my changes in this topic branch and I would like you to review them." If you want to pull them back in they can just pull down those commits locally. It just grabs those commits from a remote repository, pulls them in locally and now you can deal with those commits locally.
You are not applying patch files any more, you are actually getting full commits with author and commit data, the day it was committed, the commit message, all the stuff that you want in your commits. You get all of that because it’s really just pulling an extension of your own repository back into yours. It streamlines the whole process of contributing to projects and collaborating between developers.
There is always a potential for conflicts. If 2 people change the same line of code, then Git is going to say "Hey, I don’t know what to do!" You can say "Always prefer mine" or "Always prefer yours", but usually if you don’t tell it what to do, then it will just say "I don’t know what to do" and now it’s up to you. That’s no different from any other system. A user’s going to have to intervene at some point, because the computer doesn’t know how to handle that conflict. That requires a human brain. Git does a few things to make that easier though. It will actually separate the changes that merged cleanly from the changes that didn’t.
Git has this concept of an index, an intermediary place where your changes go and then you make commits from there. It will put in this index the changes that merge cleanly and keep in your working directory the changes that did not, so you can very easily run a single command and see what didn’t merge. So it’s giving you this intermediary step that allows you to handle merge conflicts a little bit easier. You can use merge tools and what not, same as with a lot of systems, but really it’s just the ease with which you can get those changes. Another nice thing with Git is that because it keeps the commits as a graph and it knows where things branched off, it can do a 3-way merge.
A 3-way merge takes the 2 endpoints of the branches that you are going to merge and the common point in the history and it can use that as context. A 3-way merge is going to succeed more often than just a 2-way merge where you are taking 2 different files and attempting to match them together. That context makes the algorithm a lot cleaner. Git does give you that because it knows the history and knows where the common ancestor is.
Right now my responsibilities are on the backend. That’s what I do day-to-day, but from the company perspective it’s about making it easier for people to collaborate on code. That is kind of our main goal. Let’s say you and I or Kenneth and I are working on a project. How can we reduce the friction between writing code and getting that code into a final system? Because that’s what writing code is all about. I mean you write code so that code ends up in production and it’s useful. A lot of systems put roadblocks in your way.
They make it challenging to branch or a central server can go down or there is all kinds of policies that people make. With Git you can say "Here’s code, you can fork that" When I say ‘fork’ I mean make a copy of that and then pull it down locally and work on it. Then, eventually the whole piece will be merged back in. But you don’t need access; you don’t need permission generally from a person to do that in the open source world. You can just go to the site, you can get a copy down, it’s the full history, so you can do whatever kind of inspection and stuff you want.
With Subversion usually it’s a closed system where you’re going to have access, you’re going to have a read only of this and you can look at that code, but you can’t commit to that code base, because there is access permissions. With Git, since you are making your own full copy of the repository, you can do whatever you want. It’s up to the person that you’re sending the code back to, to say "Is this code good?" - Maybe. Maybe not.
But they are doing it on their own little sandbox and that’s just giving people permission to do that saying "Go ahead, try! And then show me what you did." That means that people were actually fix tones of bugs that they wouldn’t otherwise because you’re reducing that barrier. You’re saying "Do you see a little problem in the documentation? That’s cool. Fork it down, make a little fix, push it up". They can do that within a few minutes, they don’t have to go through this whole patch thing and e-mail list and whatever. It would be just like "Yes, it’s right here. We use the commits over here." It’s like a 2 character change. Send that to the list and they just merge it in on one command and then the whole process is done. It just smoothes the whole process up. That’s what we are focusing on - tools to make that process of collaboration and getting code back into the mainline as simple and easy as possible.
I can tell first a little background. We have had Erlang OTP as open source since 1998. Then we started letting out a tar file with a source. The users can fetch it and they can build the system. When they had some problems, they could email us on a mailing list to say what’s wrong or maybe they could provide a fragment of code suggesting a change and things like that and it was quite hard for us. We had to work a lot ourselves to take those contributions in to the mainstream.
We have been thinking of how to make Erlang more open and easier for the community to contribute. At first we were thinking maybe we should have some open or public SV on the repository or so, but when heard about Git and GitHub, we thought that seems to be a more perfect match. You don’t have to open the central repository for contributions. With GitHub you can put your stuff there, everyone can fork his own copy and do changes and notify us that we have this suggested on new code and we ask to take it in to separate branch for coming feature.
We do regular nightly tests on that branch and it’s easy for us to see if this new contribution will cause any problems for the existing code and so on. It makes life much easier for us and at the same time it’s easier to get more contributions from the users.
Yes, much more, maybe 90% more than before. I can mention we went out on GitHub in November last year with that release and then we had a release on February 25 this year and between those 2 points we had over 50 contributions that we took in to the main release from more than 30 contributors. I think maybe before that we could count maybe 10 contributions between each release or so. It’s a huge improvement.
8. You told us that when you branch you are copying the whole repository. Isn’t it a challenge for GitHub to store all this data and to make it available all the time and also write the graphs that every user needs?
Not really. That’s what we spend our time thinking about - ways to make that possible and efficient. Git itself is quite efficient at storage. It can use a Delta offset pack file. The way Git works is really based around snapshots. At every given point in time, when you make a commit, you are really saying "What did the file system look like at that point in time?" The next commit says "Take a picture of that." You would think there is all this duplicated storage because you are making a copy every single time that you make a commit, but Git is much more intelligent than that.
It can say "All of these snapshots are discrete points in time, but the way it stores the files is by content". If a file didn’t change from one commit to the next, it doesn’t need to store anything. Each of those commits points to the same actual file in storage. That’s one way that it gets good storage compression. Another way is if you take a bunch of these ("loose objects" are called these files that represent content and the directory structures) you can pack those up into one big file called pack file using some really advanced compression and Delta offset technology.
Git was created by a bunch of Linux kernel hackers so this is what these guys do - they love file systems and Git is basically a file system. They know how to do this stuff and make it really fast and efficient. Ironically, a checkout of a repository and all the history in Git will quite often be smaller than just a single version of the Subversion repository. Because the version is scattering metadata throughout the thing with these .svn directories, it puts a lot of stuff in there. You have to stuff in the files, whereas with Git you have all the files for the current checkout, whatever commit you are at, and then a single .git directory at the very root that has all this compressed storage.
When they were building Git, they were thinking very much about how to be able to store this stuff efficiently, because it was built for the Linux kernel, which is a very large project with a very long history. If it can work for the Linux kernel, it can work for the most projects. That helps us out too, because it produces the amount of file storage that we have to have. From a technical perspective, Git takes care of a lot of the headaches that we would have if it wasn’t quite so intelligent. That’s how we handle the storage problem. We build it out and make it as stable as we can and as fast as we can and just try to improve on that every month.
No, not in our case, because we are not typical for Ericsson in that we are releasing open source software. That’s not typical for Ericsson. Then we have of course more freedom to collaborate with the external parts. Internally we have been using ClearCase, a commercial version handling system for maybe 10 years or more. When we introduced Git now, we had a guy writing some tricky script for importing from ClearCase to Git and vice versa. We are working with that now. Now we can build up these both in ClearCase and Git, but we are striving towards abandoning ClearCase in some future release.
We are still using Git because the internal work is also easier with Git. In ClearCase even though it supports changeset nowadays we haven’t really used it. We are versioning separate files and see a big benefit from having this changesets you can have in Git. When we are taking contributions and around the tests if there is some problem with one contribution it’s really easy to back data out.
It’s challenging. There are some edge cases where things will happen, where someone does something that we hadn’t really prepared for and it can cause some slowdowns on the servers or other kinds of stability problems. Technically it’s a very hard problem. It’s kind of outside the scope of what a lot of internet companies need to tackle. A lot of companies aren’t based around having to post this huge number of repositories that require a vast amount of interaction with the disk. We’ve created some technologies that help us to handle that.
As we grow, we have to tackle new kinds of things, but we’re inventing new methods to do this cheaply. We’re a bootstrap company; we haven’t taken any outside funding. We like that; we like the freedom that it gives us and we like the growth curve that that gives us right now. But it means we have to be a little bit cleverer about how we handle some of these problems. Some companies might say "We need to store a bunch of stuff on disk and make those accessible to some web interface. Let’s spend 5 million dollars on file servers." We can’t do that!
We have to be a little smarter and build some technology and actually use Erlang to do the spec and stuff. I do a lot of Erlang. We use Erlang to let the front end communicate to the backend and be able to scale each of those layers out separately. So, there is a whole Erlang infrastructure in between the web aspect and the file aspect that allows us to grow the way that we need to.
Of course, we are in the beginning so we have to learn more about how to be really efficient in taking on new contributions from the users, so we have to have better descriptions about how to make contributions and some rules about that. We will also provide more test suites that makes it easier for us to test when we take a new contribution and even the contributor has to add tests for the new stuff he is providing. We also have plans in connection to our new website that we are working with to handle suggestions and new features and Erlang enhancement proposals, to handle them in a separate Git repository, which would make it easy for users to contribute.
In this case it’s text based descriptions, but they get version handled for free and you can also format them so they are easily readable in HTML. For example, if you are using this mark down format, that’s supported by GitHub. These are some of the plans. Other plans I mentioned before are we are striving to work with Git all over and only Git for all development in house and externally, in our projects
They use it for all kinds of things, anything text based. Just recently someone twitted about the recipes that they were putting on GitHub. I think that’s really cool. You have some family recipes and they scrolled on some old pieces of paper, backs of envelopes or whatever and you end up losing them and that’s really sad. If people take the time to just type those into a text file and put them up on GitHub, they can share them, they know they are not going to lose them and as they make them and they figure out "I should have used 3 teaspoons of butter instead of 2" then they just go in there, make that little change.
It’s cool to be able to see revision by revision, like the history of a recipe. That works for a lot of different things. We’d like to support all kinds of different documents in the future. If we could do something around being able to intelligently show the changes in PDF files or Word documents or anything. Just recently we added a feature where it will show you side-by-side images that have changed. When you are looking at a commit, you’ll say "Here is the old version of the image and the new version."
You can look at them side by side and say "Oh, that’s what changed." Normally, you’re only looking at one at a time, and if you try and flip back and forth locally it’s kind of a pain. We have a bunch of features like these inner works in our heads that we want to go for and just provide support for all kind of documents.
Absolutely. If you go to http://develop.github.com, then you can see the documentation for the whole API. People have done all kinds of stuff with this, like to get access to the issue tracker that we have and be able to show that locally. Someone made a command line issue tracker that just uses the API and open and close tickets via the command line. Another guy today just put something out that’s really cool. It’s like a visualization of the relationships between people on GitHub. GitHub users can follow other GitHub users.
That means they get a feed of what that other user is doing. But you can take this information like a giant graph out of it and say "These users are connected to these other users" and then try and locate them geographically, so you can then say "Here are all the US GitHub users and here are the Japanese ones and here are the ones in Germany, the ones in Brazil". You can graph this whole thing out and get this really interesting visualization of all the users on GitHub and how these communities are and how they are geographically dispersed, how they are interacting, which is really neat.
I don’t know what the URL for that is, but if you look up "GitHub user visualization graph", you’ll find it. It’s on hacker news today, but that doesn’t do people in the future much good. But you can search for it and you’ll find it. He made a really nice visualization. You do all of that with the API. Just pull down a user, get a list of followers, follow all those and then just map this big graph out.
That would be trickier now. You wouldn’t use the APIs so much for that. You would just say "Here is a repository that I’ve already created and I’m going to use it as some kind of storage mechanism. At that point, it’s not really a good fit, though we do have some plans to have some kind of application platform on top of GitHub, so that you could build whole features that then could sign up for and would add widgets to the interface. It would enable you to do all kinds of stuff, like continuous integration, show you when you deployed, all this kind of stuff. Just insert that right into the user interface, right where your code is.
You are on GitHub already looking your at code and stuff; it would be nice to be able to see different kinds of events or have different kind of controls in there. If you wanted to build some layer, we use a really nice recipe interface that gives you a custom view of this certain type of file. Maybe you could implement that as well. At that point, people are really just using our infrastructure, but they are putting a layer on top. That’s something that we’re going to do in the future. Right now, you’d have to be clever about how you did it.
What we needed to do was decide what is written in Ruby mostly and Ruby on Rails. We’re using a shared file system, Red Hat GFS for this and it wasn’t really a good match. Red Hat GFS in our installation had some problems when we were trying to build it up, just a number of nodes that were connecting to the shared file store. They wanted to move to a federated strategy, so that you’d have a number of discreet file servers and users would be federated on these file servers by their username.
We have a separate web layer that was doing the Rails work and also running background jobs for long running stuff. The question was how the Rails app gets the information about the Git repositories without having too much knowledge about where that user is. That should be automatic. We decided on our architecture that it goes through a proxy. There is a library that we have called Grit, which is the Ruby Git binding that myself and Scott have written. We wanted to take this and basically separate it into 2 parts: one of them runs on the Rails side and one of them runs on the file servers. It was a simple proxy in between, like Stub paradigm.
Just take a certain class and any method that it has is going to turn into a remote procedure call to the remote location. That’s step 1. Step 2 is once you know what user this is, how do you target the correct file server that the user is on? We decided to put a proxy in the middle that intercepts this request, decodes it, looks at the user name and then looks up in a routing table (we use Redis for that. Redis is a really nice persisting key value store - very fast), get the username out of there and now you know which files you’re going to go to.
At that point it becomes a transparent proxy. The Erlang comes in on the server side. On the server is a project called Ernie that I’ve written, that is an RPC server for the specific protocol called BERT, which is binary Erlang term that I’ve also created. It is very simple RPC packet structure and it has an RPC protocol spec around that. Ernie sits there, accepts these incoming BERT RPC requests that happen to be just "Run this Git command and send the data back". What it’s doing is the request comes in, is then sent off to a Ruby handler, so that there is an Erlang server that has spawned up on a number of Ruby processes because it used to be Ruby and we want to reuse the Ruby that we’ve written to do the backend information gathering from the repositories.
Erlang server (because Erlang is really great at writing servers so it’s written in Erlang) spawns and maintains a group of Ruby processes that do the actual work of talking to the Git repositories. Then, every request gets load balance to one of the those Ruby handler, it does the work, send it back, turns it back into a BERT RPC response and then that goes all the way back to the front end where the other half of Grit is waiting and receives that (usually it’s binary information) and then just parses that and does what it needs to at the Rails layer. That’s the whole flow through. Erlang is instrumental in handling the distribution of these calls.
We do tens of millions of RPC calls every day. We need something that’s really solid, very reliable and is going to handle this kind of traffic and be easy to write. Erlang makes writing those kinds of servers just ridiculously simple. I would never write a server like that in any other languages, Erlang is so perfect for that.
Yes. I think we got in contact because we heard of you using Erlang and got interested. Then they also visited us in Stockholm.
Yes. We came to Stockholm for the Erlang Users Conference there and met with these guys and showed them a little bit more.
I think that was the day when we took the final decision to get GitHub. It could be any other Git repository as well. Another positive effect of being with Erlang source distribution on GitHub I think is that it drives that you will also have many other Erlang products on GitHub. Maybe GitHub will be the major repository for open source Erlang based projects.
It’s quite common for when frameworks choose a version control system that allows a lot of the projects written in that framework or using that framework will follow suit because they’re creating an ecosystem based around a selection of technology. It’s pretty common and we’d love to see that. More projects rally around frameworks that are on GitHub that makes for a richer community and everyone can just work together easier overall when that happens.
I think it will be easier to communicate to other users when you have many products hosted in the same form. You know how to handle them or how to contribute to them and so on.
It’s making a standard around the whole process. Choose a system, build a community around it and as far as everyone is following the same procedures then the whole thing just gets way easier. We like to think that we’re helping all that process out.
You take each one of those in a series and you commit those to a Git repository, recreating a history - just the releases, but the release history in Git repository for each one of those projects and put them up on GitHub. Now there is something like 20,000 Perl repositories out there that come from CPAN that people can now look at and see the evolution of them through their history, which you couldn’t do before. You’d have to pull down every single tarball and look at them. This is a really nice thing that someone’s done for the Perl community.
I think that helps a lot of people feel comfortable putting their Perl code up on GitHub. Python community is coming on board, Erlang community is coming on board, there is Java, there is Scala, just anyone who’s working on code and wants an easy way to deal with that and collaborate on that. They’re going to be looking for those solutions and we hope they end up on GitHub because we think it’s the best way for them to ease that.
Because you can search for languages in GitHub, Erlang props us up as number 17 in popularity. How do you calculate that figure and how is updated? How can I get a better ranking?
I know that we have for example .py files, but it’s not Python.
That’s problematic right now because it’s based on file extensions because that’s really easy and fast. In the future we might want to refine this. We have some sort of heuristic for looking at actual content of a file and go "This query isn’t Python. Maybe it’s Erlang because it also has a file type that ends in that. Right now we don’t have the sophistication to do that, but in the future, as one of us takes the time to actually do it, if it becomes really problematic, then we’ll do it. We are going to do it eventually.
We don’t have a lot, so it’s not problematic.
19. You are talking about looking inside the files. What about letting users define merge that is specific to a language? It has more oft the context of the language put in and it makes the merges easier. I know in Erlang this is a method and the other is a method so I’m not reasoning into lines of code but rather in the language itself and it will help me maybe merge easier. Are there any planned features to do this kind of thing?
Not currently. That level of sophistication like content aware or method aware, like having some semantic knowledge of what is going on in the files, not really. That would be a Git feature. If someone is really interested in doing that for a language that might be able to make something and propose that to the Git mailing list and say "Hey, there is this really cool way for us to know more about the structure of a file and do merges more intelligently", then people could install that and have that kind of feature.
Stuff like that is possible, it just takes initiative. Git is an open source project, so anyone can contribute to it if they produce something of quality that a lot of people will find useful, then it could end up in the core as well.
There are already a bunch of visualizations. There is something called the network graph which will show you all of the commits visually. You can fork a project on GitHub and you get a copy. At that point it’s the same and it won’t show up in the network graph. Once you make a commit on the repository that is unique to that repository, like you change the e-mail that you want to end up back in the mainline some day, you push that change back up to GitHub to your fork. That will show open the network graph as well. I believe it will show you which commit it was branched off of.
It will stack a bunch of repositories up this way and say "OK, here is the main one and then here is another guy that has this branch and another guy that has this branch." We have a bunch of stuff like that; there are a bunch of graphs. I actually just converted the network graph from Flash to Canvas and now it will work better for people on Linux, because the Flash plug-in is not very well supported there. As time goes on, we want to improve the kinds of visualizations that we have. We recently released a compare feature, which is a way to have an aggregate comparison of a set of commits.
You say "From commit A to commit B, show me all the files that were changed and a combined diff of what was changed and allow people to make comments on that. Eventually this is going to get worked into a brand new pool request system and pool request is what forks issue to the owner to say "Here is some stuff that I changed. You should look at it and maybe merge it in. it’s ready to be merged in." We’re redoing that whole procedure right now to say "Here is a compared view. Transform that into a pool request."
Now you’ve got a first class thing on the site that lets you have a discussion around that change set as a whole and be able to very strictly define the beginning and ending of that change set. We’re always thinking about ways to make that process smoother. That integrates with the issues feature, the issue tracker so that you can make comments on either one and they’re reflected in both. You can have discussions around them and you can copy out snippets of code and have discussions.
Then they’ll show back on the compared view embedded as inline comments in the file. There are all kinds of just nice techniques to make coding easier. We’re also working on a code review feature that will take that another step further and allow people to have workflows around actually accepting and having better discussions on pool requests.
Do you think those features can for example be used by us if we receive a contribution commit done and we want to make some comments to do all that and ask him to change things on? Can we use those features?
Yes, that’s exact. It remains its own thing - a pool request is a full page on a site and you can keep track of them. They are listed and you can always know what the status of a specific change set made that series of commits. That’s really going to help people take this contribution mechanism to the next level. Right now it is a little hazy what specifically you want someone to merge in. but this is going to make a first class and allow really rich discussions around very specific chunks of code, make that whole process more explicit and easier for people to manage. Absolutely you are going to be able to use that.