BT

What Resiliency Means at Sportradar

| by Manuel Pais Follow 9 Followers on Apr 06, 2018. Estimated reading time: 3 minutes | NOTICE: The next QCon is in San Francisco Nov 5 - 9, 2018. Save an extra $100 with INFOQSF18!

At this year's QCon London conference Pablo Jensen, CTO at Sportradar, a sports data service provider, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels (PDF of slides). Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.

One of the technical practices Sportradar employs is regular failover testing in production (a kind of Chaos Engineering). Their fail fast strategy is tested at an individual service level, as well as at cluster level, and even for an entire datacenter. The latter is possible because, as Jensen stressed, production environments are created and run exactly the same way across all datacenters. From a client point of view, they act as a single point of contact, whereas internally workloads can be allocated (or moved) to any live datacenter. Applications know as little as needed about the infrastructure where they are running, and can be deployed the same way across on-premise and cloud (AWS), but the bulk of the work is done in Sportradar's own datacenters, for cost purposes.

Other common resiliency strategies employed at Sportradar include circuit breakers and request throttling, handled by Netflix's Hystrix tool. Jensen also mentioned decoupling of live databases from data warehousing to avoid potential impact of reporting and data analysis on live customers.

In terms of governance, Sportradar puts strong emphasis on managing dependencies (and their impact, because issues with 3rd party providers is still their #1 source of incidents), for example by classifying external service providers into three categories of accepted risk:

  • "single-served" for non-critical services provided by a single vendor;
  • "multi-regional" for a single vendor that offers some levels of redundancy (such as AWS availability zones);
  • "multi-vendor" for critical services that require strong redundancy, and single vendor dependency is not acceptable.

According to Jensen, expanding infrastructure to Google Cloud Platform has been on the cards in order to further reduce risk (thus moving the cloud infrastructure service from "multi-regional" to "multi-vendor"). Further, accepting that single vendor services might fail means that dependent internal services must be designed and tested to cope with those failures. This focus on risk management manifests also internally, as each business area is served via independent technical stacks hosted independently on redundant services.

With 40+ IT teams allocated to specific business areas, Sportradar also faced the need to setup some governance around the software architecture and lifecycle. Before a new development starts, it must pass a "fit for development" gate, with agreed architecture, security and hosting guidelines in place. Perhaps more importantly, deployment to production must pass a "fit for launch" gate to ensure marketing and client support teams are aware of the changes and ready.

Services are still improved after launch, as IT teams must follow a "30% rule" whereby 30% of their time is allocated to improving current services stability and operability, as well as improving existing procedures (such as on-call or incident procedures). Jensen highlighted the importance of iterating over established procedures, improving them and regularly communicating and clarifying them (not correctly following procedures is still their #4 contributor to incidents).

In terms of organizational structure, IT (product) teams aligned with business areas has worked well, with centralized IT and security teams providing guidance and oversight, rather than executing the work themselves. For example, security development guidelines were defined by the centralized team, and iterated upon for a period of three months as the first product teams to follow them provided feedback on what worked and what not. Only then were the guidelines rolled out to all the product teams.

Finally, each service must have an on-duty team assigned before being launched. On-duty teams provide 2nd level technical support - roughly 0.5% of all 110.000+ client requests per year escalates to this level - throughout the entire service's lifetime. As Jensen stressed, only the best (and higher paid) engineers in the organization work in these teams, promoting a culture of client focus and service ownership. Clients are kept in loop of any open non-trivial incident which is then followed by a postmortem once closed. Jensen added that clients appreciate this level of transparency.

Additional information on the talk can be found on the QCon London website, and the video of the talk will be made available on InfoQ over the coming weeks.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT