Lessons Learned Working with Distributed Systems

When working with distributed systems problems like partial failure are going to happen. Preparing for this kind of problems and other challenges is the best thing you can do, not just hoping they will not happen, Vaughn Vernon explains in a conversation with InfoQ and refers to a blog post by Jeff Hodges noting its down-to-earth approach and the practical advice given, targeting developers less experienced with distributed systems.

For Vernon, author of Implementing Domain-Driven Design and the new Reactive Messaging Patterns with the Actor Model, two of the best recommendations by Hodges is trying to design for partial availability and using capped exponential back off to restore full operation when dependencies become unavailable. It's the best you can hope to do when failure strikes, and it will strike Vernon notes.

Hodges has found that new developers often think that latency is what makes distributed computing hard but for him the key differentiating factor is the higher probability of failure, especially partial failure and he therefore recommends finding ways to be partially available. He uses a well-designed search system as an example, when a search times out the results gathered up to that time should be returned, thus increasing the systems resilience.

For Hodges one of the basic building blocks when creating robust systems is a backpressure mechanism, where a serving system signals failure back to the requesting system to prevent overloading. Common ways of implementing this includes dropping messages or returning errors before handling a request which is likely to fail.

Hodges advices against coordination between servers as much as possible. Instead he prefers independent servers keeping the communication to a minimum. Whenever two servers have to agree on something the service becomes harder to implement.

Finding higher-level business logic that may be extracted to services has for Hodges several benefits. An extracted service provides increased encapsulation and allowing both for a simpler and faster deploy of code changes. He also thinks that with multiple clients the coordination cost using a service is lower than using a shared library which requires a coordinated deploy to all clients.

Hodges also describes several other lessons he has learned during his career including using feature flags for rolling out infrastructure and factors to consider choosing an identity space for a system.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter