Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Building Distributed Systems - Technology Considerations

Building Distributed Systems - Technology Considerations

The success of the RICON distributed systems conference in Las Vegas this week, is a testimony to the importance of big applications in industry today. The presentations were well attended and the audiences were clearly engaged. The conference is hosted by Basho Technologies, maker of Riak, the well-known distributed NoSQL database. But RICON is not a Basho conference; it is a distributed computing conference, composed of presentations from industry and academia. We are talking with Tyler Hannan, Basho's director of technical marketing, and Dorothy Pults, senior director of technical and product marketing, about considerations in building distributed systems and technical lessons learned at the conference.

Tyler: I think whenever you consider the topic of distributed systems, you have to consider the definition. The definition I like the most is from Leslie Lamport who said “a distributed system is one in which the failure of a computer you didn't even know existed, can render your own computer unusable.”

It’s important to understand, that RICON is a distributed systems conference for developers; it's not a Basho show.

As we look through the conference topics and participants, there are two primary themes. One is the key collaboration between industry and academia, and this lends itself to a variety of topics.

The other theme is ease of operations. As industry continues to recognize the power available in distributed systems deployments, it must also ensure the system  is easy to manage and easy to maintain, essentially allowing operations to sleep well.

So those are the themes here, ease of operations, and collaboration between industry and academia.

RICON is focused on both systems and technologies. Basho benefits greatly from academic research, and we give back by sponsoring this conference as well as contributing to several research projects, for example the SyncFree CRDT project dealing with large-scale computation without synchronization. So when you bring those two audiences together, the entire community learns.
Dorothy: A lot of large companies, like Google and Facebook, have huge research teams trying to solve new and evolving problems of distributed systems. One of the benefits of RICON is that companies that don’t have these internal research teams get together with academia to learn and discuss the most recent issues and research. Like anything that’s new, understanding the breakthroughs happening in academia, and how to implement those for applications, can actually help solve real problems, which is an extremely important step. While RICON is a single event, the collaboration between industry and academia continues long after the event itself is over.

InfoQ: You mentioned CRDTs. Can you explain what that is?

Tyler: CRDTs are used for presenting data types that are resilient in an eventually consistent environment. The definition of “C” varies and you hear it described as “conflict-free” or “commutative” or “convergent”, but the RDT is “Replicated Data Types”.

In a distributed environment modeling for conflict resolution is more complex than in a traditional relational database. The Riak CRDT implementation is exposed as Riak Data Types (in 2.0). So directly from Riak APIs I can model data in my app without worrying about building conflict resolution into my client. So for the developer it is easier to develop and deploy than before, and for us, at Basho it is important that the academic research is provably correct and drives our underlying implementation.

Distributed systems are designed -- to a degree -- just as any system, with an emphasis on monitoring, measurement (which is not the same thing), and implementation.

So the tools, while important, vary wildly based on the organizations preference. And so the underlying theme of organization investment in the distributed system is what becomes most compelling.

InfoQ: Let’s talk about process. I want to build a wide-ranging distributed application, what goes first? Do I start with coding the business logic, or do I start with the clustering from the get-go?

Tyler: It’s a very good question.

We have seen companies adopt both approaches, and to be clear either approach will work. But the ones that tend to be most successful in the quickest amount of time are the ones who consider the requirements of fault tolerance, availability, operational simplicity, in their tool set choices, and then begin to implement. Once a distributed environment is up and running and it has become business critical, storing what I would call “critical data”, it becomes a greater investment to migrate to a new system. So we see the people who take the attributes of the system that they’re interested in, such as fault tolerance, scalability, operational simplicity, when they consider those choices as part of the design phase, it tends to be a smoother path to deployment.

InfoQ: In standard development I am using tools like build tools, source control, continuous integration tools, are these the same tools we would be using in building distributed systems?

Tyler: Yes they are, and I think that’s one of the interesting things about distributed systems generally, is that while there’s different behavior for the engineer and different things to understand, much of the toolset remains the same. It’s just that what they’re connecting to, such as Riak as a database, is designed to function in a distributed environment. And that comes with some interesting things to consider, but I don’t have to abandon the development language that I prefer and know and trust to interact with Riak, I am able to use the tools I am using today to interact with Riak, as an example of a distributed system.

InfoQ: Let’s talk about testing. For a smaller application I might use JUnit and mock objects, and faking tools to fake up say an email server. But in a distributed system there is a lot more opportunity for non-deterministic behavior. So how do you test for that in a large scale distributed system where you have a heavier load and a lot more opportunity for things to go wrong?

Tyler: There’s a variety of approaches ranging from tools like QuickCheck , which can do deterministic testing of code based on test specifications instead of “test cases”.

I think what is interesting is that you can begin to separate the testing of what I would call the application tier from the persistence tier. I can still test my application UI/UX much as I would in any traditional sort of unit test/continuous integration environment. And then I can begin to test the fault tolerance of my system through a variety of approaches. Chaos Monkey is one that's extraordinarily well known in industry. But also we have customers who do their own version of that by simply turning off network routes in a contained test environment, ensuring that their expectations around failure conditions is met by the products that have been chosen.

Other than that, testing is difficult to provide a standard formula because it depends on the toolset that the organization has adopted. My testing approach if I am an Akka, Scala, Java JVM organization may be different if I am a Python shop and very different for Erlang. And so what’s been interesting about RICON, as people talk about testing systems at scale, and there’s an excellent talk by Ines Sombra from Fastly on “Testing in a Distributed World”. Rather than providing toolsets she discussed the importance of approaches. Ultimately, the approach and mindset is what matters whereas the implementation of the toolset level differs depending on organizational preference.

InfoQ: What about monitoring tools?

Tyler: There’s a variety of monitoring tools, and all of the favorites, from Boundary to Splunk to internal implementations all work very well. The key thing in a distributed environment, and Matt Davis from OpenX shared about this in his session, the key thing is that you have more to monitor, particularly as the dataset grows. OpenX for example has several hundred Riak nodes in their production clusters strewn throughout the world. And you not only monitor the typical health of a machine, you monitor the level of concurrency that's happening, from my client talking to my database cluster and the sizes of objects that are being created. So there's more to monitor, but much of the tool chain remains the same as before.

InfoQ: But in a clustered environment, it’s more important for me to understand the path of a query. Are there tools that allow me to monitor the full topology of my request/response?

Tyler: Usually what we see people in industry do in that situation is to monitor the application, monitor the communication between the application and the cluster, monitor the cluster’s communication with itself, and monitor the health of the machines in the cluster. From that set of data you’re able to make inferences about the health of the entirety of the environment. However, monitoring at scale becomes more difficult, and the characteristics of tolerance and availability in the initial decision process become so important. Because, at some point, I am going to have to keep the system that I built running.

InfoQ: In traditional applications we frequently plan around one thing breaking. But in a large distributed system, there is a strong likelihood that several things will break, so how do you plan for that?

Tyler: In Riak, the underlying Erlang implementation actually allows you to hot-swap code, so that if you need to patch an environment, you can bring that patch into a running cluster without having to take anything down. It’s also important that the way the cluster is designed allows machines to fall out, transparently to the application, and then you can bring those up when it makes sense. So a machine in a cluster failing is not an issue immediately.

InfoQ: In a traditional enterprise application we might have a distributed cache backed by a relational database. Is that now replaced by a NoSQL database, taking care of your persistence as well as your caching, in one tool?

Tyler: It depends. There can still be scenarios, depending on the NoSQL tool chosen and depending on the read/write profiles, where a caching layer in front of a NoSQL solution may make sense. I may choose to put Redis in place because I have some very beefy machines with a lot of memory, and I want to store my entire key-set in memory. And then on cache-misses go and get it from the NoSQL database. In many cases we see customers just implement Riak to replace that standard n-tier-ish style of cache persistence as separate layers.

InfoQ: As you pointed out, monitoring is different from measuring.

Tyler: The key difference between measuring and monitoring is that monitoring, to me, implies action, whereas measurement implies a trend over time. This came out in the RICON session with Matt Davis from OpenX where he talked about the importance of looking at the health of your distributed system from the cluster to the application historically over time, whereas if I am monitoring I need to alert about specific conditions at a certain time. But the tools for measuring are generally the same as for monitoring.
Putting on my Basho hat, we are always looking at ways we can improve and provide greater telemetry out of a cluster than are available. That is keenly important to us as a company that prides itself on operational simplicity, and continues to invest in it. 

InfoQ: Is there any guidance for implementation?

Tyler: When you consider the existing development environments and tooling in the enterprise, there’s a reason that Basho provides .Net and Java client libraries… we recognize the importance of providing familiar ways of interacting with a distributed database. Similarly, there is a reason we have Ruby, Python, and many other libraries.  The richness of choice is important.
The maturation of the distributed systems market, provides companies the opportunity to look at the patterns of success that are happening. For instance, here at RICON, David Pick of Braintree (a PayPal company) spoke about Apache Kafka, the high-volume messaging broker, and how they used that with Clojure for fault-tolerance at high volumes. Alex Heneveld, CTO of Cloudsoft, spoke about using Apache Brooklyn “blueprints” for managing cloud deployments. In a talk I personally loved, Michal Ptaszek from Riot Games described — in great detail — how Riak powers chat for League of Legends scaling to 27 million players daily.
Distributed Systems definitely can require more thought and extra care in technology selections.  Events like RICON afford those of us building and deploying these systems to learn from each others experiences.

Rate this Article