BT

Microservices and Site Reliability Engineering

| by Mark Little Follow 14 Followers on Apr 29, 2018. Estimated reading time: 2 minutes |

Over recent years we've discussed the role of Site Reliability Engineering (SRE) and particularly how that group has grown from at one time the domain of companies such as Google, to being an expectation within companies in other sectors such as financial and medical. Recently Technology Journalist Alex Handy has written about how SREs and microservices architectures fit:

[...] while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.

For Handy the reason for this is fairly clear:

[...] SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.

Handy goes on to cover some of the background of SRE and how that function works at scale within Google as an example, quoting Todd Underwood, one of Google's SRE directors, about how Google has put practices and systems in place to help development groups consider reliability and availability as well as technology approaches such as using Paxos for consensus in their distributed systems.

Underwood highlights another aspect of the SRE job that is essential, here, however: visibility. When microservices are throwing billions of packets across constantly changing ecosystems of cloud-based servers, containers, and databases, finding out what went wrong where is essential to troubleshooting any type of problem. This is where the full stack aspects of an SRE’s job come into place.

According to one of the product managers at Google, Morgan McLean, the key here is monitoring and traceability of microsrvices, something others have stated in the past and we've covered elsewhere. In the article by Handy, he mentions a few new tools Google has released to help tackle the problem:

[...] Google recently released Stackdriver Trace, Stackdriver Debugger, and Stackdriver Profiler. There’s a reason these tools sound like old-school testing and operations tools from traditional enterprise vendors: they perform the more traditional troubleshooting tasks developers and operations people are used to, but with a focus on microservices and performing these duties in the cloud.

Morgan McLean is quoted within Handy's article summarising what these tools do to enable the SRE group to better manage new microservice-based architectures and stating that although tracing is important, Google believes that the profiling and debugging aspects of their tools are unique at this stage and bring key benefits to developers and SRE. Handy then finishes up his article by covering further the topics of monitoring, metrics and observability with more Google and other industry references, which are worth considering because they are likely to be relevant to a growing number of companies.

As we see more and more developers and companies employing microservices and many of them also using, or beginning to use, SRE teams, it will be interesting to see how architectures and tooling evolve to ensure that reliability, availability, consistency etc. are maintained, such that developers and SRE teams can work in harmony. If you have any experiences to share in that regard, positive or negative, it would be useful for the wider community to hear about them.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Good one by Alhuck Abdulkaffar

Good article on SRE
P.S: There seems to be a spelling mistake on this line,
"something others have stated in the past asnd(instead of and) we've covered elsewhere,"
Thanks

Re: Good one by Mark Little

Good catch! Corrected. Thanks.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT