BT

Bloomberg’s Standardization and Scaling of Its Monitoring Systems

| by Hrishikesh Barua Follow 15 Followers on Jul 21, 2018. Estimated reading time: 3 minutes |

One of the outcomes of Bloomberg's adoption of Site Reliability Engineering (SRE) practices across its development teams was the creation of a new monitoring system, backed by the Metrictank time-series database. The system provided new functionality to derive metric calculations, configurable retention periods, metadata queries, and scalability.

Bloomberg’s infrastructure is spread across 200 node sites in two self-operated datacenters, catering to around 325,000 customers, with a development team of 5000 engineers. For a long time, developers were responsible for the production monitoring of the products they built and deployed. However, monitoring was often added as an afterthought. This resulted in a lack of standardization - there were multiple data collectors, leading to duplication for measuring the same thing.

There was also no global view of the systems. According to Stig Sorensen, head of telemetry at Bloomberg, the role of operations ranges across "everything from our commercial website to market data feeds, to our main product, the Bloomberg Professional Terminal which hundreds of thousands of the key influencers around the world rely on." Various different tech stacks compounded the complexity.

Sorensen started leading the SRE initiative at Bloomberg in 2016. Along with pushing SRE principles and practices, his team aimed to build monitoring and alerting as a company-wide service. The first iteration was a homegrown StatsD agent with support for tags, that focused on getting the metrics out to the central systems as fast as possible. Once the metrics were collected, most of the validation, aggregation, rules and persistence was done on machines that were behind a Kafka cluster. This system soon faced issues with scale, as Sean Hanson, software developer at Bloomberg, noted in his talk:

After these two years, we’re at two and a half million data points per second, 100 million time series. Some metrics have high cardinality, like 500,000. So our initial solution did scale fairly well for us. We were able to push that to 20 million data points a second sustained. But we couldn’t actually query anything out of it while it was doing that, and it still was really poor at handling high cardinality metrics, which was a pretty common use case.


The new system that the team built also had a new set of requirements - functions to derive metric calculations, configurable retention periods, metadata queries, and scalability. Metrictank, a multi-tenant timeseries database backed by Cassandra that can be used by Graphite, met most of their requirements. Based on Facebook's Gorilla paper, it was orders of magnitude faster than their previous system for high-cardinality data. It paved the way to do queries that spanned metrics from across the organization.

The Bloomberg team optimized a few resource-intensive areas and contributed the code back to Metrictank. Other organizations have also used Cassandra as a backend to scale Graphite.

Along with the monitoring system, the adoption of SRE has been focused on standardizing the way that things are done. Sorensen elaborates:

We don’t actually have a centralized SRE team today. We rolled it out in a way where we aligned the SRE teams with the application teams. SRE teams are pulled from both app and core infra teams. It’s either people within operational or system admin background that’s sort of picked up programming and moving that way, or we have application engineers with a more active view towards systems and towards availability – because we see SREs as software engineers just doing something – building a different type of software.

With the adoption of a standardized monitoring system, there is a parallel need to track progress. This is something that the team is working on, Sorensen says, because "measuring availability is not black and white. It’s not how many failures you had on a website, because if you are a certain market player and the real-time market data is delayed by a few – by one millisecond or hundreds of milliseconds, it could make a big difference for you."

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT