BT

New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Natalia Chechina on Scaling Distributed Erlang with SD Erlang
Recorded at:

| Interview with Natalia Chechina by Greg Young on Feb 13, 2015 |
10:44

Bio Natalia Chechina received a PhD degree from Heriot-Watt University, UK in 2011. She is now a Research Associate at the University of Glasgow and leads Scalable Distributed Erlang work package (WP3) in the RELEASE project. Natalia’s main research interests are distributed and parallel computing, Erlang, mathematical and theoretical analysis, stochastic modelling, and overlay networks.

Sponsored Content

Begun in 2012 this now annual conference hosted in Vilnius, Lithuania brings the best of the developer world to the Baltic's. The overall theme is building stuff, we have a heavy focus on lessons from trenches from the people that were there.

   

1. I’m sitting here at Buildstuff with Natalia who is going to talk to us a little bit about SD Erlang. So talk a bit about yourself first Natalia.

Well, I’m Natalia Chechina and I work at the University of Glasgow, I’m just an research associate there, I work on the Release project that aims to scale Erlang. Generally it aims to scale the distributed actor model to work on a scale on something like 10^5 cores and we tackle that from different angles, so first one is VM, then another is the language level. With the VM it's about on how many cores a single VM can work, then it’s about language level, so how many VMs can actually work together and the last one is about tools, so what kind of tools we use to profile, to debug our programs on that scale. So at Glasgow we work with the language levels, so how many VMs can work together and for that we use SD Erlang, so that’s our work.

   

2. But I thought Erlang was some special magic sauce that you can just put in and it automatically becomes scalable and wonderful. Are you suggesting it's not?

Well it is a beautiful language I think and scales really well, a single VM can easily handle millions of processes and it scales well, it depends of the application very much; it depends how many nodes you have there, first of all, and how many global operations you have, so there are applications like, for example, games, and they scale well up to 16 nodes and then they hit a problem. There are other applications that scale up to 60 nodes but if the benchmark is really easy and you don’t have global operations and things like that, you can very easily scale to 200-300 of nodes and the throughput will grow and the scalability works great, but when you go beyond that and then we have problems and the first problem is about connectivity all to all connections, it does take resources for nodes to hold all those connections because there are heartbeats.

Then another problem is the name space, if you use global operations something like module global and you need to register those names, then all operation inolved and the throughput it actually falls quite dramatically as soon as you start introducing those operations. So from this point we tackle it, I mean our community didn’t experience those problems something like 2-3 years ago, it’s now that we are starting hitting and people start introducing different approaches, so there are for example workarounds for it, so you can just switch off transitivity and then you can have just nodes directly connected to each other and controlling what sort of connections they have and then you will maintain fewer connections, of course your system will be more scalable or you can, for example there is spapi_router, it was developed by Spill Games, they also reduced all sorts of connectivity through the use of hidden nodes and direct connections.

So there are workarounds around that, but those transitive connections and common name space they are really, really good things, because they are about fault tolerance, about scalability, about elasticity, so it’s really easy to spot that the process died and the node died and it’s also very easy to add a node, so you don’t need any special program, any special code writing that would add a new node and all other nodes know about that, just add a node and all the remaining nodes know about it, so our target, we call it SD Erlang, this is actually just a small extension of existing distributed Erlang, is just to enable, to keep all those good things, so keep transitivity, keep common namespace, keep the Erlang philosophy which is really great I think, which is “let it crash” and share-nothing, and at the same time allow to scale, so that’s why SD Erlang we introduced to these groups when within a group you have all-to-all connections and common name space but these namespace and common connections are not shared between groups, so that’s our approach.

   

3. One thing I’ve noticed working with Erlang in the past, Erlang really feels like it was, in particular OTP, feels like it was developed really with the focus on operating on a LAN, and what have you guys done with SD Erlang in terms of looking at WAN and environments, and multiple data centers as an example?

Right, the project itself was written in such a way that we assume that we still work behind a firewall, so for example when we work with the scalability and we right now experiment we think about Blue Gene which has 65.000 cores or we work with multiple servers but they are still in the same location. The thing that we did was developed by Erlang Solutions and this tool is called Wombat and it’s about deploying multiple nodes. It’s not only about using SD Erlang, it’s about using just normal distributed Erlang, or whichever Erlang you like, and it just, this is the tool that will help you to deploy your node in different clusters, in different clouds providers. Other than that we're focusing on scaling but again behind a firewall.

   

4. If I’m today using OTP, what steps do I need to go through if I want to move to SD Erlang?

Well I suppose if you have an application, it depends if you have an application already running, if you have an application running and you want to refactor into SD Erlang that you would need to use tools and for that there are tools, Devo, developed at the Kent University and there is Percept 2 also developed at Kent and ICCS in Greece. So that will help you when you want to move from distributed Erlang to SD Erlang, it means that you will reduce the connectivity, you will group your nodes and the first thing you will need to do is how you want actually to group those nodes, so what’s the motivation for that, do you want to group nodes because of the locality, so nodes that are on the same location, for example on the same cluster, they group to the same group, or is it because they communicate quite often with each other, so first thing is to decide and to do that Devo is the tool that will help you to identify that because it will show you how much nodes communicate with each other and things like that.

Then if you partition or think how you want to group to make those groups, how you want to set them up, and so the next thing you will need to think about gateways, so because I suppose the next step will be to think about the structure, how you want your groups to be grouped, to be connected to each other. So you can think about a hierarchical structure, something like a tree structure, so on the bottom we have the nodes that actually do the work and then you go to the top, it’s about the communication, how they communicate like gateway nodes or you can think for example something like a circle, like a ring, so we can think about a different natural topologies, how you want to structure your groups. The next thing will be introducing those gateways and how you want them to move, so there is some work to do with the refactoring, it’s not pressing one button and everything will be done magically, but if we work with a large scale, I don’t think anything happens magically pressing one button and I think if you really need scalability and network traffic and global operations, some namespace, something that stops you from that, I think it’s a really, really good idea to move to SD Erlang, for example to enable scalability.

   

5. If I want to find out more about SD Erlang, where would I go to find more information?

Right, so we have the source I would say, it’s the Release website [Editor's note: http://www.release-project.eu/ ] , so Release is the project that we are working on, we have 8 partners that work on this thing, so there is 3 industrial partners, EDF, Erlang Solutions and Ericsson, five Universities Heriot-Watt, Glasgow University and Kent in the UK, ICCS in Greece and Uppsala in Sweden, so we all work together on one project that’s called Release and there is a website and we keep up to date information, you can find all the deliverables, most of the deliverables are open source because the project was funded by the European Union. The code, most of the code is also open source, so all our benchmarks, SD Erlang that is a branch for it, so it’s open source, you can just download, have a look at it. Yes and a lot of benchmark that we use, all information is there, publications, we do talks, just go to the website and just find information about it there.

Greg: Ok great, thank you Natalia!

Thank you!

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and dont miss out on content that matters to you

BT