BT

Self-service Delivery Platform at Tuenti

| Posted by Óscar San José Follow 0 Followers on Jul 31, 2015. Estimated reading time: 22 minutes |

Empowering developers by providing them with self-service solutions for development, deployment and releasing code has been one of the main goals of the System Reliability Engineering and Internal Tools at Tuenti. We are a very lean company, focused in providing software with a short time to market, and adapting to a constantly changing scenario in telco services. This means that the software being developed changes very frequently, new services are added and new environments and servers are provisioned.

The infrastructure team needs to provide services to a lot of developers working concurrently. A transversal team providing a unified framework and environment-related tools, infrastructure and process seemed like a good idea. The goal was to unify heterogeneous practices and toolsets that individual development teams were using, while preparing the road for the future teams and projects .

When a single team must provide this kind of services to ever-growing development teams, it rapidly becomes a bottleneck. Tickets start to pile up in the backlog and developers feel blocked. Since they do not know the infrastructure, they do not know what takes so long to provide them with the domains, hosts, release tools, testing pipelines, etc. We felt that applying a DevOps approach was mandatory if we wanted to be able to keep the pace the company demanded.

Providing self-service

We tried to design a system where developers had full control of the process from development to production, along with the knowledge to use it right, and the correct tools to make it easy, safe, and traceable.

The core of this software is an event-based control-system called Flow. Flow ensures that appropriate processes are triggered in response to developers actions or systems events, and keeps track of all actions and delivers that information to the stakeholders.

The other main piece that empowers this system is the puppetization of the entire platform, from development environments to production server deployments. This allows applying configuration changes in different environments, store changes, test them and promote them to production using CI techniques. Using virtualization and automatic provisioning of environments, developers themselves can test their software in environments very similar to production, do their own changes in configuration and send them to ops via a pull-request model.

Flow

Flow is the tool coordinating and reacting to events in the system, enabling developers to trigger events that produce deployments and other operations.

Flow is an event-action system - events from any component in the system are registered in Flow’s database, and sequences of actions (workflows) can be assigned to that particular event.

Events

Events are HTTP requests tagged with a “source” (usually the software triggering the event, such as jenkins, jira, git or mercurial hooks, in-house scripts, etc) and a type (such as “build-enqueued” or “build-finished” in the case of jenkins). The HTTP request is received in Flow hosts and enqueued to RabbitMQ, as a message of type “event”. It can also have any data associated, attached as a json structure.

The event is later dequeued and processed by the event processor. The only task of this processor is to search its event registry for events with the specified source and type, and returns a list of actions to perform. This event registry is filled in Flow database, so adding new actions or modifying existing ones is simple and requires no changes in code.

(Click on the image to enlarge it)

View of Jira-originated events currently registered in Flow

Actions

Actions are small functions written in python that perform a specific task. Their purpose is to be reused as much as possible, so they perform simple tasks and are very configurable via parameterization. These parameters are passed to the action function:

  • From the event itself;
  • As the result of other tasks performed earlier in the action chain;
  • As part of the action configuration in that particular workflow (defined in Flow Database), this is called “fixed param”.

Actions can also fetch any other information they need via data providers: from Flow’s database, from the source repository, from Jenkins, from the production platform, from running services, etc.

Actions are used to perform some needed work:

  • Produce information to be passed down to other actions, such as parsing commit messages between two changesets (to create a changelog)
  • OR
  • Perform operations in other systems, such as the version control system (changes in the repo), include tagging or committing changesets with changes, downloading builds from Jenkins and uploading them to the artifacts repository or deploying them to development or production environments, sending emails or notifying through HipChat, or in general interacting with any other tool

Actions make use of multiple tools available to implement the most common operations:

  • A source repository management and access API was created (available in github: python-repoman), so actions can perform operations over repositories (commits, tags, pull/push, etc) regardless of the underlying tool (git, mercurial, etc). This tool can also manage a workspace, creating new repository copies on demand from a cache (so tasks running in parallel can have their own, up-to-date working copy);
  • Utilities for email and chat notifications (xmpp, HipChat, etc);
  • Support for calling scripting tools such as maven, ant, gradle, make, etc;
  • Deploying to popular systems: in-house deployments based in rsync or webdav, uploads to nexus repos using maven, deployment to hockeyapps platform, even integration with google play and itunes store;
  • An interface to interact with jenkins (configure, poll, download artifacts, etc);
  • Integration with other tools such as Atlassian’s Crucible (peer code reviews) or Jira.

Given the diversity of utilities and frameworks, pretty much any operation a classic operator/release manager would need to perform can easily be automated in an action, and then reused across multiple processes.

Regarding the operations they perform, we have two kind of actions:

  • Meta-actions: actions that modify the configuration of the workflow, usually by setting information that will be consumed by actions downstream the chain;
  • Work-actions: they perform operations in other systems, creating some output i.e. changesets in a repo, tags, jenkins build triggered, mails or hipchat notifications.

A couple of examples can help illustrate what is an action and how it works.

Action: tag_version: This action gets as input a repository name and a changeset hash identifier, a version, and as fixed param a tag_prefix, and tags the changeset in the repository with the string <tag_prefix> + <version>, pushing to the upstream repo the resulting changeset (mercurial) or tag (git). All the information needed for this task can be taken from the event that triggered the action, or by any other previous action.

The advantages of configuring actions this way are:

  • Makes actions very configurable as they can receive inputs from any number of other actions, decoupling where that information comes from from the way it is used. We can get a version to tag reading it from a jira ticket or from a zookeeper db equally as easily;
  • Makes actions' behavior easy to change using Flow's GUI, by changing the fixed param (for example, we can change the tag_prefix from “release_1.0“ to “version_1.0” just by changing the fixed param of the action tag_version;
  • Removes the need for complex actions and encourages writing actions that perform a single, generic task, with a self-explanatory name, whose purpose can be understood by any developer.

Workflows

Workflows are ordered sequences of actions, including any configuration they might have. They are stored in Flow's database and can be edited or created through Flow GUI. They are associated to an event and are triggered every time Flow receives that event.

Workflows can be named and shown in a dashboard, so any developer can check what actions will be triggered by a particular event, such as the start of a release. This makes documenting company processes in the traditional way obsolete. No more documents or wikis that need to be maintained and become outdated quickly.. They can also be exported to text files and saved in a repository, as a backup, for sharing with other instances of Flow (such as development ones), or to be included in other tools.

No code changes are needed to modify or define new processes. Processes can be modified easily in Flow GUI if needed.

(Click on the image to enlarge it)

Worfklow associated with “Start Release” event

A use case

All these single-objective, tiny functions can be combined to implement high-level, company wide workflows that can be triggered from any part of the system. Jira ticket transitions are favored as triggers, because the project management system tool is typically used by multiple teams, not just development. For example, a release can be started when a product manager is satisfied with the content of the development branch and can be triggered by just pressing a button.

Here's how a new software release would be started. The event triggering a new release is, as mentioned, a jira ticket transition. It can be issued by any of the stakeholders: product managers or developers, as long as the release pipeline is empty (this means that there is no other release ongoing). Having the full release pipeline automated means no additional human resources will be used when a release is started, and so the team does not need to plan releases in advance. This is a train-like model: the feature branches of developers arrive at the “release station”. If the railroad track (release pipeline) is empty (meaning the previous release is finished), a new “release-train” is spawned and all features in state “ready” will be included.

The actions launched when the release is started are:

  • creation of a new release stabilization branch;
  • calculate the changelog for this release, scanning commit messages of changesets included in the release code;
  • linking tickets in jira with a central release ticket;
  • notifying stakeholders: all the committers and product owners in this release are included as watchers in jira ticket, and release status is posted as a comment in that jira ticket, so feedback is ensured to reach interested people but also to be reachable ad-hoc through a well-known ticket if needed;
  • triggering a preproduction build in jenkins.

The preproduction build job in jenkins triggers itself an event when it finishes: “Preproduction build ready”. This event is associated with the following actions:

  • tag version built;
  • increase version number;
  • retrieve build and deploy it to test environment (note that the actual meaning of a test environment depends on the particular software being deployed).

This high-level action can in turn call more specialized actions depending on the configuration: for example, in the case of mobile apps this deployment is done to hockeyapps distribution environment whereas web apps are deployed to staging frontends using rsync.

(Click on the image to enlarge it)

Full view of the events originated from Jenkins, with the workflow associated with one of them displayed

Advantages

As we've seen, an event can trigger actions that in turn can trigger events, allowing us to implement powerful and complex workflows that can be understood by the whole team (including product managers), not only release management or operational teams. Workflows are always kept up to date as the code itself serves as process documentation. An easy to use configuration system also means that developers themselves can “listen” to any event of the system, deciding if they want to receive mail or chat alerts when a particular event or process is triggered (for example team leads usually add a self-notification through HipChat when a release of their software is started).

This availability, transparency and ease of modification power improves the knowledge and engagement of development teams in the company processes, leading to improvement boosts by development teams themselves.

The advantages of this approach are:

  • Modularity: actions are small, and have only one, very specific task to perform. This makes it easy to reuse code and compose these atomic actions into more complex workflows;
  • Processes are easy to change: changes in the event-action registry database are immediately reflected, so it's quite easy to add steps to processes such as release or deployment (for example, any developer can add a step in the release process to receive a notification via chat when a particular service is deployed to some environment);
  • Documentation is always up-to-date: simply having a description set for every action mapped to a particular event, and displaying this information to developers, means we have accurate and auto-updated documentation for each process.

Puppet-operated platform

In a big and heterogeneous platform such as ours, puppet organization can be challenging. We have many different servers with multiple purposes and most of them were snowflakes inherited from the time when we had no configuration management system in place. As an example, the host naming scheme has been changed a couple of times, as we discovered that a lot of information can be conveyed by the hostname alone. Choosing a smart scheme can make operations, diagnosis, and stats gathering much easier. Puppet organization must be effective in order to be maintainable by ops team, and easy to understand by developer teams so they are not afraid to make changes.

The starting point was writing puppet modules from scratch, from an operations-team point of view. Basically they were install-scripts (we had tons of them, written in bash) translated into puppet. This produced poorly written modules, difficult to maintain and reuse. We made mistakes such as including “ifs” with the name of some of our servers if we wanted some specific configuration for them, or include very particular configurations of our own in very standard modules, such as nginx.

We soon discovered this was not maintainable. Changing the name of a host became impossible without updating a lot of puppet modules. Including a new host with a mixed role, or separating two pieces of software from a host into two different hosts meant a lot of changes in modules. It was often difficult to unify modules or templates, and the consequence was often having to rewrite the module for every host it would be installed to.

Most of the problems were due to our inexperience writing puppet modules, and the lack of a “framework” that prevented us from making mistakes. To fix this situation, we investigated the puppet ecosytem, and finally decided on a set of best-practices and guidelines to write modules: the Example42 module layout. Since then, the team started reusing most modules, instead of having to rewrite them every time a new server role was needed.

Example42 puppet templates are heavily used now, in a puppet infrastructure of over 200 modules. The standardized layout made it easier to write new modules and understand modules written by others. It also prevented us from repeating some of the mistakes made in the past: For instance, all the configuration that is particular to our organization is in custom_class modules, not in the standard module, which allowed us to reuse standard modules, and other tools such as puppet Librarian.

In order to reuse not only modules, but also groups of modules that fulfilled a given high level functionality and apply them as a whole to new nodes in the infrastructure, we defined a system of “roles” and “profiles”. A “role” is a group of puppet modules that provide a specific functionality and are usually installed together, for example: nginx modules, along with logrotate and diamond (to send metrics to graphite). We always want those three modules installed together (and with a particular configuration) in any host that needs nginx, so we created our own “nginx” role, and created a new pp module in a different location from the standard modules. In this case tuenti_resource/roles/nginx_server.pp:

class tuenti_resources::roles::nginx_server () {
include nginx
include logrotate
include diamond
}

With all these roles defined (around 50 of them), we started grouping them by “profiles” where each “profile” defines the group of roles that a machine will play. Often, a server will only provide one role, so the profile will include a single line:

class tuenti_resources::profile::tuenti_php_app_server () {
include tuenti_resources::role::nginx_server
}

But in some cases, we want to have a single host serving multiple roles, for efficiency reasons:

class tuenti_resources::profile::telco_server() {
include tuenti_resources::role::bss_gw_app
include tuenti_resources::role::hlr_gw_app
include tuenti_resources::role::voip_app
include tuenti_resources::role::transcoder
}

This example shows how the functionality of a BSS gateway, an HLR gateway, our voip processing application server and the transcoding software, while totally fine to be served by different hosts, have been grouped together to reduce the hardware requirements. You can later split or regroup these roles very easily.

Finally, you need to assign particular machines to a profile, so when they run puppet they get the appropriate configuration automatically.

Applying modules to servers

We created a host-naming scheme, where the profile of the machine can be inferred from the hostname, and we have puppet applying it automatically upon host-start up:

node telco_server-00 {
include tuenti_resources::profile::telco_server
}
node telco_server-(prod|preprod|staging|test)-(\d+) inherits telco_server-00 {
}

We then create our nodes from base installation (using PXE) and configure them with names such as telco_server-prod-01, telco_server-prod-02 (production servers) or telco_server-preprod-01 (for preproduction), etc. Then simply changing the name of the host will force puppet to apply the correct configuration. It also makes it easier for human operators to remember machine names according to their function. Finally, gathering stats becomes much more straightforward: besides stats for a particular machine, we can sum up stats for all machines fulfilling a role by simply using regular expressions such as telco_server-prod-*.

Configuring modules

We are strict about writing our manifests this way, using only bare “includes”, no parametrized “class” statements. This ensures all manifests can be reused. We then define specific configurations we may need using custom_class, puppet facts and hiera yaml files. Puppet facts allow us to use host name in the hierarchical resolution of the hiera-lookup. Hiera is a key-value hierarchical resolution system that allows us to define a hierarchy with different levels of resolution, common values at the bottom, roles and profiles on top, and finally hostnames as the top-most specific. When looking for a key for a node over that hierarchy, the result will be the top-most value defined for that keyin the hierarchy. So for example we can define a base template for nginx module in a common.yaml file:

common.yaml:
nginx::sites_available: tuenti_resources/nginx/files/common_sites_available.erb

This key defines the puppet template that will be used by the puppet nginx module to create the available sites configuration.

We can then override this definition, using a specific one for preproduction stage:

nginx_server/preprod.yaml:
nginx::sites_available: tuenti_resources/nginx/files/preprod_sites_available.erb

The corresponding hierarchy in hiera looks like this (from bottom to top):

  • common.yaml
  • roles/{$role}.yaml
  • stages/{$stage}.yaml

And write a puppet facter that extracts $role and $stage information from the name of the host it is being run on, using a facter like:

Facter.add("role") do
setcode do
hostname0 = Facter.value('hostname')
role0 = hostname0.split("-")[0]
role0
end
end

For e.g. for “telco_server-preprod-01”: Facter will set role as “telco_server”, and stage will be set to “preprod”. Then when you look for “nginx::sites_available” key hiera will look for this key in common.yaml, then roles/telco.yaml and finally in stages/preprod.yaml. If the key is defined in common.yaml as X and in preprod.yaml as Y then the key value will be set to Y.

With all this in place you can change the configuration of the entire platform just by changing yaml configuration data files. This also eases configuration testing, as well as exposing and compiling your configuration information. Other tasks such as host inventory become totally straightforward.

Handing over puppet repo to development teams

With this puppet setup in place it is easy to define overrides for development environments, so developers can provision a smaller production-like virtual machine, powered by Vagrant. They can then test changes in their software configuration (paths, deployments, dependencies, config files), apply them on the puppet repo, and issue a pull request for the ops team to review, adapt those changes to production if needed (memory size, request rates, etc), and approve it.

We chose a pull-request system to prevent misconfigurations being committed by developers straight to the production branch since our puppet automated testing currently covers little more than syntax and that all manifests can be applied. We still need an experienced engineer to supervise that the configuration values chosen by developers can be applied safely in production.

Ops team also makes sure the programming guidelines presented above are being followed.

Some benefits of this approach are that workload for ops team is reduced, as changes in platform configuration are prepared and tested by developers, so the ops team only needs to review them. But most importantly, developers know the infrastructure their software is being run on in much more detail.

Ready-to-use development environments

Depending on the project and application being developed, a number of environment definitions have been created and made available to developers. Two of the most widely used at Tuenti are Tuenti-In-A-Box (TBox) and Voice-In-A-Box (VBox). Despite the similar name, they are totally different environments.

TBox provides a complete running environment for web app developers to write and test features of our social network, Tuenti’s traditional business. This required the creation of a full-stack web app environment in a single, virtualized box: from database to frontend app in PHP, including the backend application (also written in PHP), memcached, nginx web server, our own configuration service (written in C) etc. This approach allowed developers to spawn a box of their own, isolated from other developers’ configuration changes, and also writing and testing features in the absence of an Internet connection.

This box is launched using vagrant, and then provisioned using puppet. Regular provisioning uses the puppetmaster provided by the SRE team, running a canonical copy of the master branch in puppet repository. If they need to, developers can provision this box against their own working copy of puppet, with any modifications they need, by running a single command. This will spawn, also using vagrant, another box called puppetmaster-test.This new box uses the modified puppet server in developer’s laptop (via Vagrant's shared folders mechanism). Furthermore, using DNS replacement via /etc/hosts of the TBox, the latter is used as the actual puppetmaster. So it is very easy for a developer to make changes in their puppet repo working copy and have them tested immediately in their own application environment.

The other popular box in Tuenti is the so called VBox (Tuenti’s Voice-In-A-Box). This development environment strongly differs from the TBox environment because its core is a multiple-host, networked environment configuration. This box is meant for VoiP software development engineers to have an environment as similar to production as possible, where networking and boxes interconnections make a big difference in application runtime. Merging all those in a single box would obscure a lot of the potential problems during development.

So in this environment we create a multi-box setup, that spawns the full stack for VoiP development. Because this means around 5 Unix boxes running some resource-consuming software, this environment cannot be spawned in developer’s laptop. Instead an internal cloud service using OpenStack is provided, so developers can self-service their environment from it, and then run it against the standard puppetmaster or against any puppet copy of their own. Even though the machines are not in their local environment, they can still test changes in a complex but isolated setup.

The problem with this approach is that even though developers do not share environments with other developers, the resources are actually shared, so negligent behaviors (not freeing resources when finished, running heavy cpu-demanding scripts, or requiring intensive I/O) can affect other developers. Since private cloud resources are not infinite, a booking and scheduling system has been created, so the team can foresee developer’s needs and scale the underlying hardware platform if needed.

Conclusion

Complex processes hard-coded in difficult to change workflows, and lack of up-to-date documentation used to lead to developers being unaware of the processes. This translated into mistakes, lots of questions to the SRE team, inability to change those processes quickly enough to accommodate business needs, etc.

Flow solved these problems by allowing a transparent, extensible way for automating most common tasks related with Continuous Deployment for multiple software applications:

  • knowledge of the processes is now spread through the development team;
  • developers are able to follow up the status of the release process of their own software;
  • it's easy for developers to suggest changes and optimizations, spot flaws or potential problems;
  • it's much easier to explain the processes to new hires.

The control over the different development and production environments and the knowledge on how to provision them (before the puppet platform was setup) was concentrated on a single team (the SRE team). This team became a bottleneck for the development teams when it came to making changes on those environments. It kept developers from acquiring a deeper knowledge of the environments their software was being run on, thus c causing the classical divide between development and operations.

Setting up the puppet platform and having developers using virtualized environments helped to close that gap. It improved their knowledge about the environments. It allowed them to test changes in their software (and changes in how the software is run too) in production-like environments. This reduced the number of problems that were typically found only when promoting the software to production (the “works-on-dev-machine” problem).

It also allowed developers to send change requirements to the infrastructure team via pull requests of puppet code. This made changes much more traceable, easier to review, and faster to commit. It reduced the workload of the infrastructure team and made developers happier by increasing their knowledge of the platform, reducing the number of bugs, and having their change requests being implemented faster.

About the Author

Óscar San José -  was born and raised in Madrid. He became passionate about computers when he first got an Amstrad CPC from his parents at the age of 9. CS degree from Universidad Complutense de Madrid, he started working as a programmer for different companies. In 2006, when working for Telefonica R+D he discovered the field of Release Management. He was hired by Tuenti to organize their integration and release processes, and there he applied a DevOps approach for solving (some of) the problems of a rapidly-growing development team. He has lately moved to Tuenti's SRE (System Reliability Engineers) team, where he continues writing tools and frameworks to help fellow developers in their daily work.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT