BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles DevOps in Telecoms – Is It Possible?

DevOps in Telecoms – Is It Possible?

Leia em Português

Bookmarks

A recent article on CA takes a look at how DevOps is making strides into many industries to improve the speed in how we deploy our solutions in Agile development environments.

What surprised me was that in Telecoms the adoption of DevOps is much higher than generally believed and seems most willing to extend their existing agile methodologies. CA says:

The clear leader here is telecoms. They are almost two-times more likely to be using DevOps (68%) than the aggregate (39%). In fact, a total of 88% of telecoms either are using it or plan to, versus an aggregate of 66%. It is clear that the telecoms industry is one of the most competitive, and the pressures to continuously deliver new products and services are enormous. The benefits of DevOps in terms of accelerating the pace of delivery provides a pretty compelling reason for telecoms to move forward aggressively with DevOps.

Different challenge for vendors and operators

This seems accurate for ISP's and mobile operators who deployed Agile methods years ago and now move to webRTC, web services, HTTP/REST in their battle to maintain profits in Value-Added-Services (VAS). Their playing field becomes more like IT with services running in a browser or easily virtualized in an OpenStack container.

DevOps comes from IT (as did Agile). Hence it's logical in industries that converge with Internet platforms and web-portals to use both DevOps tools and mindset.

But this pure IT stack, found at nearly every operator and ISP is only a subset of “telecoms”, and doesn't reflect the largest share of what the Telecoms industry “actually is”: namely network equipment providers building the equipment and infrastructure.

Apart from the biggest 4 vendors such as NSN, Ericsson, Alcatel-Lucent & Huawei, there are thousands of medium sized firms writing and deploying software using Agile principles. They moved from iterative-waterfall to Agile years ago, and albeit their huge size (compared to IT teams) they are no strangers to Continuous Integration, -and up to a point- even Continuous Deployment.

But unlike IT and Internet platforms, they don't create a virtual service to be deployed somewhere in the cloud, nor can it be “continuously” patched in an Agile manner. They deliver hardware that may cost millions to commission and is maintained over years with strict SLA's. So on a technical level, by using OpenStack, Puppet, Chef, Salt or other technologies DevOps isn't going to do anything for the Telco guys.

When I first asked my former colleagues from my time working in SaaS in 2012 what DevOps actually was, the confusing answer by advocates was:

To understand DevOps you need to remember that it isn't a framework of tools, but a mindset.

This made me curious, because nobody even in IT could provide a proper definition other than:

It will make you faster by improving the communication between development and integration/deployment departments.

Be it Telecoms, Automotive or other hardware intensive industry, anyone forced to take a conservative route, will be unlikely to embrace DevOps by-the-book, even if Agile processes are already in place.

Those of you familiar with some of the funny images over at devopsreactions, might get a good chuckle out of the idea of "DevOps in a Nuclear Power Plant", or deploying "live upgrades to Medical Surgery Equipment".

Amusing as this is, it's the same in Telecoms: A mobile base station or eNodeB or any other node delivering a service to thousands of subscribers, has to work without faults. If your shiny new LTE/4G phone constantly drops voice calls – a service which has been working since 1948 (0G/MTS) or emergency services like 911 are disrupted then penalties to the vendor are severe (millions each hour).

Sending an engineer to the site and collect logs will take a day. A few more days to analyze, then deliver and roll out the correction which may or may not work. And by the time this is solved penalties are often in excess of the revenues earned from this deal. So finding faults as early as possible using “Design Failure Mode and Effects Analysis” (DFMEA) is the most important aspect in whether you remain profitable.

But don't dismiss DevOps outside the Cloud and in conservative industries just yet. The DevOps "mindset" can still make you quicker than your competition! Before we go all philosophical, let's look at how these bigger firms outside IT usually develop/deploy their products.

Use Case: ACME Corp

The below example uses ACME, a telecom equipment vendor, to illustrate a typical scenario, but it can easily be applied to any other company which builds complex and very expensive systems and where the product isn't just SaaS but delivered with thousands of moving parts. ACME could equally be an OEM automotive manufacturer, an aerospace company, a power-plant delivering smart-grid solutions or a firm offering Industrial Automation.

ACME is a multi-billion dollar business and develops network components for mobile operators. One of their R&D projects currently keeps more than 1000 engineers busy developing away to make this world a better place. In the last 10 years ACME successfully moved from "iterative" (Waterfall) development methods to Agile. Their engineers now organize themselves as cross-functional micro-teams which are assembled/disassembled in an ad-hoc manner working on various chunks of the product. Not everyone touches everything within the code-repository but they look at code & ownership in the context of its features and impacts on the system.

Competences are spread across different geographic sites with a good mix and balance of skills. Some teams are closer to the hardware while others work higher on the technology stack. Strict quality guidelines ensure that whenever they submit a change to the software repository, immediate automated feedback is given by their continuous integration (CI) system to inform when developers broke existing functionality. In this first instance the CI executes mainly unit-tests but also more elaborate System-Component-Tests which verify the final binary with more complex scenarios and even simulating the messages the binary would later handle on the real hardware. When faults are raised by the CI they either deliver a correction immediately or roll-back that change. This way the overall content always has a very recent working and testable version of the product or its sub-components.

Once changes or new features were delivered, and provided nothing breaks, their code is automatically "promoted" and released to downstream integration and test stages.

The guys (in downstream departments) then (cherry-)pick a recent version from our upstream CI but also align and coordinate deliveries from other departments which have their own CI and contribute to the final product. They then integrate these versions further into the real hardware platform which also differs depending on applications (MCU, DSP, etc ...). E.g. some departments would deliver the kernel and abstraction layer for the OS, others deliver the Layer-1 (Mac and Physical Layer) or Layer-2 (Forwarding-Plane) in the OSI model and then there are those teams providing actual functionality to handle messages for radio or core network interfaces: User-Plane, Control-Plane, OAM. This downstream team has their own CI system and repository to version and store their test-cases, and then in turn “promote” and release whatever came from upstream after their own tests passed, to the next teams. This too is done in a mostly automated manner and human intervention is only required in order to analyze when the pipeline blocks up.

You might find 4 or more of such test/integration instances, all organized by their own management layers. Eventually the initial code and final binary package reaches the real hardware where end-to-end system tests and inter-operability (IOT) tests can be conducted before moving on to field-tests and eventually customer-deployments.

Everyone is happy! That is until faults come cascading back up from downstream, and analysis will eat up a lot of the resources of the overloaded developers.

  • Which software package version has the fault?
  • How many branches have we already created in the meantime which now have the same faults?
  • Where do we have to merge the corrections to?
  • Who else will come back from another department reporting the same error?
  • But there is already a correction in version control so why wasn't this one used downstream?
  • How can we ask developers to deal with all this overhead when they are meant to work on a tight deadline to churn out the next features in the sprint?

Who will deal with these questions if we're unable to directly route ever fault directly to the developer?

Fault Manager to the rescue!

But wait! Now we have an important customer trial (actually we always have one of those) which means we have many of these must-fix-immediately-super-critical-do-not-wait faults. And we need developers to correct it right now. Yes NOW ... even if we don’t know yet who the culprit is, who committed this hideous crime. And since we fault managers are also overloaded, we don't have time to analyze any of this.

If only we had some dedicated people who could look at the logs and identify which developers must deliver the correction!

Pre-Analysis to the rescue!

And just in a nick of time, our hero managed to analyze the 10 GB of logs and identified the usual suspects, confirmed it wasn't a bug in the actual tests, and forwarded the fault to the right team, who then delivered the required correction tout suite.

Though this is clearly not the end of the story! Since by now the fault has propagated into products which had already been frozen and therefore MUST NOT be corrected unless there is an official request from the customer.

Given the official delivery of the faulty code was made 3 weeks ago, this issue will raise its ugly head a few more times in months to come: and each time on a different branch popping up as a new bug in the bug-reporting-tool with a different set of trace files.

So clearly our heroes will be busier than Batman and Robin for years to come!

What happened? Didn't agile make ACME faster?

Having abolished their individual code ownership and implemented Agile internally within their team, ACME Corp gained tremendous speed over their competitors, churning out well tested changes and new features every few hours and delivering several times a day instead of once a week. Fail quickly is the new mantra.

But by doing so another bottleneck surfaced: the company had internally committed to Agile methods, but every department still worked on their own terms, maintaining their status quo and continued to treat other departments as external collaborators.

Back to their workflow: If you look closely at how the product delivery chain "cascades", often over 5 or more integration & test departments, you still see a waterfall! And that waterfall is one costly and resource hungry administrative nightmare, quickly eating away the developers time, as they have to constantly provide clarification to dedicated coordinators, whose main job consists of steering the communication about the faults into the correct channels. Frustrating.

And this is where Agile stops and DevOps can help you pierce those remaining silos.

What DevOps can bring to the table in industries that are "close to the metal", isn't a new tool to solve all your problems. And neither was Agile!

Thinking in "DevOps terms" means integrating your downstream test environments / operations and bringing them closer to developers.

In fact from the downstream departments' perspective not much has changed. Mainly because integration for them is technically not as easy due to lack of virtualization and operating closer to the hardware (using slow and unreliable JTAG/USB interfaces or debugging communications on SRIO, ePCI or dealing with highly specialized Telco specific interfaces such as OBSAI/CPRI) often requiring manual reboots of test-PC's which use purpose-built drivers etc. So they still get a baseline at regular intervals which is manually loaded into the test-environment and then they provide feedback upstream. To make it clearer the teams conducting integration have a totally different skillset which is far from the activities of what a developer does. Here the focus is closer on understanding the message flow of the 3GPP specs end-to-end rather than churning out individual parts in C/C++. When Integration speaks about a “feature” it usually touches several (or all) of the involved teams.

The further you move downstream the more you'll find a lack of scripting know-how which comes natural in your upstream layers. Integration might select only a few of these baselines randomly or "cherry-pick" whatever looks most promising.

So even if our developers in the different lines churn out features at maximum capacity, “downstream” continues to live isolated from the rest of the upstream departments.

And it gets worse the further you move downstream away from developers:

  • Each department inventing their own test systems because of their unique needs (some justified some not).
  • Many writing overlapping test cases and re-test what was already verified. Not that double checking is a bad thing, but writing/maintaining the same test code several times is a waste. There is a massive overlap, and engineers creating and solving similar problems in every department.

What practical steps are there to move towards DevOps?

A DevOps "mindset" will help you cut down the technical barriers between these cascading departments, so that automated deployments become possible. Add API's that make the whole delivery chain transparent from a technical point of view - all the way to the target hardware.

The good news is that you don't have to start doing it all at once, as you did when initially migrating to Agile 10 years ago! Instead start with a gap-analysis conducted by either an internal architect -let's call this person a consolidation engineer - and check every one of your test/integration departments to identify overlap and ways to automate your deployments’ API's/interfaces.

Once you have a clear picture, figure out how you can break your internal silos and improve intra-department communication. How would you do that? Simple: you already did this once when you introduced Agile in your development teams! (and yes back then some people probably resisted – even left – and it won't be easier this time either)

  1. Now take the next step and have some of your developers spend one or two days a week in what was until now the next downstream department. And move some of these guys into development for few days a week.
  2. You might want to think about incentives for doing so and reward people willing to work across these borders. Consider them as ambassadors in your company who pierce through your silos. Before you know it, boundaries will have become blurry and that silo is cracking! Natural cross-pollination of ideas will start happening - from bottom up!
  3. Developers now see the effects of their code-changes within just a few hours and on the actual target hardware. This makes them a more relevant part of your big "organization-machine", and also helps them to identify themselves deeper with the product you sell.
  4. Trust your test-coverage! In case something fails downstream you must have communication between these departments to figure out why your test coverage couldn't detect the problem early. Any error must which made it downstream must be covered in future tests on all branches. Limiting the number of active branches reduces “context-switches” for developers and complexity of your system. If your test coverage is so bad that your only way of ensuring quality is to isolate and freeze the branch then improve here first. Deciding that a new branch must be created is usually coming from top-down (QA managers) but usually ignores the fact that this is the most costly option and it never scales.

Will this make the role as “Fault-Manager” and “Pre-Analysis” obsolete?

I'd like to picture these positions like an “Interim Manager” who comes in during time of change and supports with special skills bridging the worlds of R&D and Operations. Big firms doing Agile without DevOps won't be able to deliver services without them. But once your silos crack these tasks need to be redefined.

How long will it take to get there?

You need support from the bottom-up and the top-down because breaking silos and doing DevOps affects all links in the delivery chain. From my personal experience during migration from Waterfall to Agile and also from many technical interviews I conducted as a Telco-recruiter, the time it takes to move from waterfall to Agile is 1-3 years. I'd predict another 1-3 years for DevOps. The biggest challenge in large organizations isn't technology but politics and people, or as Gerald M. Weinberg once famously quoted:

No matter how technical it looks at first, it's always a people's problem.

Conclusion:

  • Doing DevOps in industries outside XaaS and Web is not at all too different. We deal with more stakeholders and more complex and conservative structures.
  • It will take longer to get everyone onboard. But if you zoom out and focus on the interfaces (e.g. how these silos communicate) then “a silo is a silo” and there is no difference to IT. It's just a larger scale.
  • These firms already took the idea of Agile from IT and scaled them to their needs - mostly successfully. Individual silos are already operating at max speed and efficiency and little can be improved internally (most of them operate like pressure-cookers). DevOps is the missing-link and logical next step as it will reduce the increasing friction between these silos.
  • Once you move to DevOps it will increase cross-functional skills and lead to better understanding of problems other stakeholders are facing.
  • Finally it allows solutions such as putting more trust into your test-suite instead of utilizing expensive Q&A branches or interchanging people regularly between up/downstream. Solutions which have always felt right, but until then politically impossible to implement.

About the Author

Joachim Bauernberger is an Agile Engineer focusing on Research, Consulting and Recruitment for Future-Network Technologies and R&D Process Optimization in these domains.

Rate this Article

Adoption
Style

BT