InfoQ Homepage Articles Why Observability Is the Key to Unlocking GitOps

Why Observability Is the Key to Unlocking GitOps

Bookmarks

Oct 31, 2022 13 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

You cannot do GitOps without observability
Git is the single source of truth for the system’s intended state, while observability provides a single source of truth for the system’s actual state
Internal observability allows the GitOps controller to identify configuration drift; external observability allows operations staff and other systems to identify changes in the cluster
Cloud native observability is a new skill and task you must add to your DevOps team

GitOps is a new software development paradigm that promises to streamline and fully automate the software deployment process. Instead of relying on IT staff or unwieldy scripts to provision environments, GitOps defines all environments as code, and deploys the environment together with all applications in a consistent and predictable manner. Everything is managed in source control, using tools that are familiar to most developers.

GitOps promises, and delivers, massive productivity benefits for developers. But just like any new technical approach, the challenge is in the details. One of the complex aspects of GitOps is ensuring sufficient visibility of live environments, to ensure that it can be synchronized with a desired configuration. In this article, I’ll explain why observability is so critical to GitOps, and how ArgoCD, a popular GitOps platform, addresses the observability challenge.

What Are Continuous Delivery and Continuous Deployment?

Continuous delivery prepares a software product for production deployment, making it possible to deploy changes at the push of a button. In traditional setups, this was typically done by merging a change to the master branch (this is known as push deployment).

In newer GitOps environments, it is done by committing a change to a central environment repository, triggering a deployment (this is known as pull deployment).

Continuous delivery creates artifacts that can be deployed to production. This is the next step after continuous integration (CI). It prepares a software release that is ready for deployment, and is only waiting for teams to evaluate the change and decide whether to release it.

Continuous deployment takes this one step further, removing the need for a human to evaluate the new version and push the button to release the software. In continuous deployment, every change is automatically tested and if it meets certain predetermined quality criteria, it is automatically deployed to production.

What Is GitOps?

The GitOps model prescribes the use of source control systems, typically based on Git, for application and infrastructure configuration management. Git version control systems act as the single source of truth for GitOps. Based on this single source of truth, GitOps uses declarative configuration to adjust a production environment to match a desired state.

GitOps automatically manages the provisioning of infrastructure and deployment via Git pull requests. It relies on a Git repository containing the system’s complete state to ensure a full audit trail of system state changes.

The GitOps approach emphasizes developer experience, allowing Dev teams to manage infrastructure with familiar processes and tools used for other software development tasks. GitOps offers almost complete flexibility in the choice of tooling.

What Are the Benefits of GitOps?

There are many reasons to start using GitOps. Most of them are related to the ability to deliver software faster, more reliably, and with higher quality. Here are commonly cited GitOps benefits:

Increased productivity - GitOps enables fully automated continuous deployment with an integrated feedback loop, which reduces deployment time compared to traditional CI/CD pipelines. According to the State of DevOps report by the DORA research group, acquired by Google, the four characteristics of the highest performing DevOps teams are high Deployment Frequency (DF), low Lead Time for Changes (MLT), low Time to Restore Service(s) (MTTR), and low Change Failure Rate (CR). GitOps can directly improve all four of these metrics.
Improved developer experience - when operating in a Kubernetes cluster, GitOps removes the need to execute kubectl commands. Instead of having to learn and maintain Kubernetes internals, developers can use familiar tools like Git to manage Kubernetes updates and features declaratively, and any operations on the Kubernetes cluster are carried out automatically by the GitOps controller. New developers can ramp up more quickly and become productive in days instead of months, and experienced developers can rely on their knowledge of existing tools.
Improved stability - in a GitOps workflow, audit logs are automatically created for all changes. This auditability promotes stability, because it is easy to see what changes resulted in production issues. It can also be used for compliance with any necessary standards such as SOC 2.
Improved reliability and rollback - Git provides rollback and fork features that allow teams to achieve reliable and repeatable rollbacks. Because Git is the source of truth of the cluster’s configuration, the team has a single source to recover from any production issue. This reduces recovery time from hours to minutes.
Consistency and standardization - GitOps provides a model for changing infrastructure, applications, and Kubernetes add-ons in a consistent way, providing visibility across the enterprise and ensuring all teams have a consistent end-to-end workflow.
Security guarantees - Git can sign changes and prove author and origin, and provides strong encryption for tracking and managing changes. This provides a high level of trust over the integrity and security of a Kubernetes cluster.

What Is Observability and How Does it Support GitOps?

Traditional monitoring methods have reached their limits in the context of cloud native application architectures. The focus is shifting from monitoring to observability:

System monitoring involves detecting a set of known problems by determining the health of the system against predefined metrics. For example, container monitoring aims to answer two questions: what went wrong, and why. Over time, this enables profiling a container to anticipate, predict, and prevent problems before they happen.
Observability aims to provide an understanding of the state of a system based on three key elements - logs, metrics, and traces. Observability is a characteristic of a system - just like a system can be scalable, reliable, or secure, it can also be observable. In a cloud native environment, observability should be built into applications from day one.

Monitoring and observability are strongly connected. An observable system can be more easily monitored. Monitoring is part of observability, and effective monitoring is a result of an effectively observable system.

Observability provides insights using three main concepts:

Logs - provide a record of discrete system events.
Metrics - measure and process numerical and statistical data at set time intervals.
Traces - provide an event sequence to map the logic path taken.

These three types of insights provide answers to most critical questions, including the current state of the deployment compared to the intended state. They are important for all aspects of the system ranging from intended architecture and configurations to the UI, resources, and behavior.

The Need for Observability in a GitOps Process

The GitOps model emphasizes the ability to simplify complex Kubernetes management tasks. The core concept is the deployment to production via changes to a central Git repository, with changes made to a Kubernetes cluster fully automatically.

To enable a true GitOps process, there is a need for two types of observability:

Internal observability - the GitOps controller needs to know what is happening in the Kubernetes cluster, for example, in order to compare it with a desired configuration and make adjustments.
External observability - other systems operating within and outside the cluster need to be aware of workflows automated by the GitOps system. To this end, the GitOps system should publish metrics that cloud native monitoring systems can consume.

How Does Internal Observability Work?

In a GitOps work process, Git is the single source of truth for the system’s intended state, while observability provides a single source of truth for the system’s actual state. Thus, it allows GitOps developers to understand the system’s state in the real world.

If, for example, you intend to have three NGINX pods running in the cluster based on a deployment manifest in your Git repository. The GitOps system will use Kubernetes controllers to determine how many pods are actually running and their current configuration. If it detects the wrong number of instances or any change to pod configuration (this is known as configuration drift), it creates a "diff alert".

Once the system is aware of a divergence (i.e., a mismatch between the desired and actual number of instances), the diff alerts can trigger the relevant Kubernetes controller. The controller will attempt to synchronize the actual and desired states. Once there are no diff alerts, the system concludes that the actual state matches the desired state, meaning the application is "synchronized".

The key concept throughout this process is awareness of divergence. You cannot sync or fix the state if you don’t know it is out of sync. Thus, internal observability is critical for enabling GitOps and ensuring the actual state remains up to date.

How Does External Observability Work?

External observability has three elements:

A monitoring system must be running in the Kubernetes cluster. There are several mature tools that support cloud native environments - a common choice is Prometheus for Kubernetes.
A GitOps controller making changes to the cluster in accordance with a Git configuration.
Published metrics generated by the GitOps controller or related systems.

Once these three elements are in place, the monitoring system scrapes metrics from GitOps automation systems in the cluster. This can proactively inform the rest of your ecosystem what changes are taking place. In other words, other systems get a "heads up" that an application is being synchronized, instead of discovering it in retrospect and generating unnecessary alerts.

Let’s see how this works with a popular GitOps project: Argo.

What Is Argo?

Argo is a collection of open source projects that help developers deliver software faster and more securely. Argo is Kubernetes native, making it easy for developers to deploy and publish their own applications.

Argo tools enable continuous deployment with advanced, progressive deployment strategies, allowing developers to define the set of actions required to release a service:

Argo CD is a GitOps-based continuous deployment tool for Kubernetes. The configuration logic is in Git, and developers can work on their code using the same development, review, and approval workflows they already use in Git-based repositories. Argo CD does not have continuous integration, but integrates with CI systems.
Argo Rollouts is an incremental delivery controller built for Kubernetes. It enables progressive deployment strategies out of the box, including canary deployments, blue/green deployments, and A/B testing.
Argo Workflows is a container-native workflow engine for orchestrating parallel tasks on Kubernetes.
Argo Events is an event-driven workflow automation framework and dependency manager that can manage events from a variety of sources, including Kubernetes resources, Argo workflows, and serverless workloads.

In the context of GitOps, Argo facilitates application deployment and lifecycle management. It makes it possible for developers to operate environments and infrastructure seamlessly, automating deployments, facilitating rollbacks, and enabling easy troubleshooting.

Argo as an Enabling Technology for GitOps

Argo uses Kubernetes manifests to continuously monitor Git repositories, verify commits, proactively fetch changes from repositories, and synchronize them with cluster resources. This synchronous reconciliation process ensures the state of the cluster configuration always matches the state described in Git.

This is the exact definition of GitOps - meaning that Argo allows teams to implement a full GitOps process easily, in their existing Kubernetes clusters, and without changing their existing work processes.

In addition, Argo eliminates the common problem of configuration drift, which occurs when elements in a cluster diverge over time from a desired configuration. These unexpected configuration differences are one of the most common reasons for deployment failures. Argo can automatically revert any configuration drift, or at least show the deployment history of the cluster to identify drift and identify the change that led to it.

Lastly, the Argo project aims to provide a better experience for Kubernetes developers, maintaining a familiar user experience while easily applying advanced deployment strategies. It is implemented as Kubernetes Custom Resource Definitions (CRDs), meaning that it works just like existing Kubernetes objects with extensions that developers can easily learn and use.

To summarize, Argo makes it easier to implement GitOps for the following reasons:

A more efficient workflow - developers can deploy code using familiar processes and tools.
Improved reliability and consistency - using an automated agent to ensure that the desired state defined in Git is the same as the state of the cluster.
Improved productivity - with fully automated CD and no complex setup.
Reduced deployment complexity - deployment becomes a transparent process that occurs behind the scenes.
Progressive delivery - in a traditional setup it was very difficult to set up strategies like blue/green or canary deployment, and these are available out of the box in Argo.

Argo CD: GitOps with Observability Built In

Internal Observability in Argo CD

Argo CD receives information about resource status and health via the Kubernetes API Server. When it detects a change between current cluster state and the configuration in Git, it goes through three phases:

Pre-sync - checking if the change is valid and requires a change to the cluster
Sync - making a change to the custer
Post-sync - verifying that the change was made correctly

This process occurs in one or more waves that sweep the entire cluster, looking for changes, and reacting to diffs. The order of resources within a wave is determined by type (namespaces, then Kubernetes resources, then custom resources) and by name.

Within each wave, if any resource is out-of-sync, Argo CD adjusts it and then continues sweeping the cluster. Note that if resources are unhealthy in the first wave, the application may not be able to synchronize successfully.

Argo CD has a delay between each sync wave in order to give other controllers a chance to react to the change. This also prevents Argo CD from assessing resource health too quickly, before it updates to reflect the current object state.

External Observability in Argo CD and Argo Workflows

Argo CD provides a notifications feature, which lets you continuously monitor Argo CD apps and receive alerts about significant changes in the state of an application. It offers a flexible way to set up notifications with templates and triggers - you can define the content of notifications and when Argo CD should send them.

Another part of the Argo project is Argo Workflows, which lets you automate tasks related to CI/CD pipelines in a Kubernetes cluster. Argo Workflows generates several default controller metrics, and lets you define custom metrics to provide information about the state of Workflows.

Argo Workflows generates two types of metrics:

Controller metrics - provide information about the state of the controller.
Custom metrics - provide information about the state of your Workflow. You define the custom metrics using the Workflow specifications. The owner of the metric generator is responsible for generating custom metrics.

For example, you can define custom Prometheus metrics and apply them at the Workflow or Template level. These metrics are useful for various cases, including:

Enforcing thresholds - keep track of your Template or Worklfow’s duration and receive alerts when they exceed your threshold.
Tracking failures - see how often your Template or Workflow fails across a certain timeframe.
Metric reporting - set up reports for internal metrics like the model training score and error rate.

Conclusion

GitOps is gaining traction as a mainstream development practice. I showed why observability is an inseparable part of GitOps systems, and described two types of observability:

Internal observability - required for the GitOps controller to identify configuration drift in the cluster and correct it.
External observability - required to notify operations staff and other systems of changes made by the GitOps controller.

I briefly showed how both of these are implemented in a popular open source GitOps platform - the Argo project.

GitOps is based on several complex mechanisms, and the best way to wrap your mind around them is to take ArgoCD for a test drive. Check out the official getting started tutorial which shows how to install ArgoCD and deploy a minimal application to a Kubernetes cluster. Try to "mess up" your test cluster and see how Argo picks up the changes and reverts the cluster to your desired configuration.

To go more in depth and understand the ArgoCD sync processes, see the discussion on Sync Phases and Sync Waves in the official documentation.

About the Author

Gilad David Maayan

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?