Key Takeaways
- A major key to success in using GitHub Actions is to embrace GitHub's repository-centric mindset
- Shifting the focus from the pipeline to the pull request as the main aggregator of quality gates is essential to an effective developer experience
- Use Actions' modular approach to your advantage - internal platform teams can provide precisely versioned actions and workflows that simplify auditing and help to balance teams' ownership of their pipelines while keeping them within established guardrails
- Teams considering moving to Actions should begin planning for a way to manage the usage and security of third-party actions, as industry solutions in this space are fairly immature
- Using a declarative, pull-based model of delivery, as exemplified by GitOps, provides strong alignment to the strengths of the GitHub Actions platform
Catching Up with the Evolution of Actions
When GitHub Actions was announced in the fall of 2019, it drew immediate attention for ushering in "the third wave of CI/CD platforms." The unique approach of building composable pipelines from reusable open-source building blocks presented clear potential for improving the efficiency and maintenance of CI/CD pipeline operations. However, despite the inherent appeal in the design of the Actions platform, many organizations, especially large enterprises with established CI/CD systems, were understandably hesitant to make a move. This was due in part to limitations present in the platform during its infancy, including restrictions around the execution and sharing of Actions internally within enterprises.
Fortunately, in the time since the initial unveiling, Actions has taken steps to break down some of the barriers toward maintainable enterprise use. Shortly after launch came the introduction of self-hosted runners, an essential capability for allowing jobs to execute within existing internal corporate networks. Then, in November 2021, came the announcement of reusable workflows, greatly expanding the ability to provide reusable pipelines that reduced the number of duplicate steps and boilerplate code, coming closer to parity with many more established CI/CD products. What truly made this feature ready for enterprise adoption though, was the announcement in January of 2022 of the ability to share actions internally, giving platform engineering teams the power to package and publish common reusable steps without exposing them to the world. Finally, Actions felt capable of providing a foundation for an enterprise CI/CD ecosystem.
Why We Are Moving to Actions
At Thrivent, we had invested considerable effort over recent years into improving the developer experience of our CI/CD tooling, including building out automation to quickly generate application pipelines from templates along established golden paths. Despite this, we encountered limitations in our ability to provide a platform that balanced consistency with flexibility.
One of the shortcomings of our current system was the effort involved in sharing common tasks in the CI/CD pipeline. Our repository of shared scripts offered only limited encapsulation, and would muddle the build history of an application, intermingling commits to the actual application repo with commits to our shared task repo.
We also had limited ability to preserve user customizations across upgrades to the pipeline templates. If we introduced a new step or patched a bug in the template, developers would have to regenerate the entire pipeline to pick up the change, which would erase any customizations they had made.
Lastly, a less tangible factor was minimizing the overall developer friction when using our CI/CD system. As we entered a period of growth within our technology organization, we were faced with a large influx of new developers. Being able to quickly onboard these developers and provide a platform that met their expectations was a top priority. Given GitHub’s standing in the open-source community, it is no surprise that most developers reported using it in both the most recent Stack Overflow and State of Frontend surveys. By expanding the scope of our usage into the CI/CD capabilities native to the platform, we saw opportunities for:
- consolidating tooling into a single system that optimized existing developer familiarity, thereby reducing extraneous cognitive load
- minimizing context switching by using repository events to work more closely within the coding lifecycle
- providing the natively modular pipeline architecture we had been looking for.
Overall, the design of the GitHub Actions platform mirrored the same approach used within modern development itself. Developers had the flexibility to build a pipeline of reusable components, transferring the open-source mindset used by teams in their development activities into the CI/CD pipeline. However, as we delved deeper into the Actions ecosystem, noticeable differences in how Actions approached CI/CD caused us to rethink some of our assumptions and develop a new strategy that used the constraints of the platform as an advantage.
A Quick Note on Terminology
Within the GitHub Actions system, several terms are used which may carry a variety of connotations due to being used widely throughout the CI/CD ecosystem, but they are used here to refer to specific resources in the Actions framework (namely "workflow," "job," "action," "event", and "runner"). If you are unfamiliar with the GitHub Actions definitions, I would recommend taking a quick look at GitHub’s documentation.
A GitHub Actions workflow visualized (Source: GitHub)
A Mindset Shift
When transitioning to actions, one of the first distinctions to understand is the relationship between the repository and the pipeline tooling. The CI/CD product we had been using was integrated with, though essentially separate from, the application’s repository. Even other tools in the CI/CD landscape, which may provide a more comprehensive solution for the full application lifecycle (often bundling a repository, pipeline(s), and other delivery tooling into a single product) tend to have a higher-level construct such as a "project" or "application" that clearly separates the repository and pipeline, placing them on relatively equal footing. When the CI/CD process starts, all attention shifts to the pipeline tool view.
In contrast, with GitHub Actions, the repository remains the center of the universe. Workflows triggered within the Actions platform exist primarily within the context of a repository. There are advantages and disadvantages to this approach:
Advantages
- The repository now provides a single landing point for the status of the application, aggregating application health and deployment information closely with the code itself.
- The ability to initiate workflows on a diverse set of repository events (e.g., tags, pull requests) outside of the traditional commit/push trigger allows for opportunities to integrate CI/CD more closely into the development process.
- The system by which actions and workflows are shared provides the ability to easily manage the lifecycle of shared pipeline components independently from changes to the application repository.
Disadvantages
- Because everything is orchestrated based on events that occur in the repository, initiating workflows from other (external) event sources is difficult.
- Artifacts produced through workflows do not have the same level of visibility in the repository that the code benefits from. This means that there is limited ability to share and promote resultant artifacts natively as that information is not persisted in the repo and is not shared between workflow executions.
- Different approaches to repository organization may introduce additional complexity into the process, for example, monorepos may contain multiple different projects in the same repository. These independent applications may also have differences in how they are built and deployed. Due to this, additional care and tooling may be required to effectively manage the triggering events to avoid building unmodified applications and to properly package and deploy the downstream artifacts.
Building a Strategy for Actions Usage
Armed now with an understanding of the GitHub Actions viewpoint and the strengths and weaknesses thereof, we began to develop a strategy for the effective use of the Actions platform as our CI/CD solution. The following are some of the main principles that have guided our decisions:
Favor Approval in the Pull Request Over Approval in the Workflow
One of the major concerns of a CI/CD platform is effectively collecting and displaying output, and enforcing any quality gates from those results. Previously we had been using a process that will look familiar to many organizations:
Previous approach: repository and pipeline checks as separate panes of glass
In this approach, a push to the target branch triggered the pipeline. The pipeline progressed through a series of stages, each of which would produce some output of the results and would typically assess the production readiness of the artifact based on those results.
With our new goal of moving this information closer to the natural activities of development, as well as remaining mindful of some inherent limitations in how Actions displays results and artifacts, we sought a better way to provide these controls, and the pull request feature was the logical choice. Pull requests maintain a more durable audit history (in contrast to the retention limits of workflow executions) and focus the discussion on the change itself, with the code readily available for reference. Embracing the pull request as the main approval dashboard required a shift to a model wherein the pull request collected the same results and quality checks, but without the additional effort needed to navigate back to the repository to view associated changes.
New strategy: Using the pull request to guide and collect the necessary information
Develop Guidelines for Reusable Components
Building a self-service library of reusable pipeline pieces also came with many questions. How would we handle updating shared workflows and actions? What is the right level of abstraction when trying to balance multiple teams with slightly different ways of performing a seemingly common task?
Faced with these challenges, our objective was to achieve a set of "paved roads" where each new repository should be able to deploy using the templates, actions, and reusable workflows supported by our team, but with the flexibility to customize as needed. This sets the groundwork for the continuous delivery directive of keeping the repository always in a deployable state. There are several guidelines we established in keeping with this goal:
Push Semantic Version Tags for all Releases
Maintaining precise versions provided our users with the ability to stay current with changes in accordance with the level of risk they were comfortable with. In our previous CI/CD tooling, shared tasks were pulled from the main branch of a repo, which meant we needed to be very mindful of introducing unexpected changes. While Actions supports this type of usage, it is preferable to use semantic version ranges to balance the need to stay up-to-date while avoiding breaking changes. One key learning we had here was to make sure we pushed/updated multiple tags (major, minor, and patch tags) for each version to ensure that users following a major tag ref (e.g., "@v2") would also pick up a new patch version.
Manage Your Interfaces Like Any Other API
When providing a resource to an internal team, it is easy to fall into the trap of assuming they are comfortable with a deeper level of understanding of its inner workings than you would expect from say, an open-source project or external vendor. This can quickly lead to leaky abstractions which diminish the value of the shared component you are providing, as the user must take on that additional cognitive load, rather than just understanding how to interact with its API.
To maintain a more consistent level of encapsulation, we aimed to be explicit with our Inputs and Outputs, minimizing assumptions and dependence on side effects in the environment. Building on this, we always attempt to provide a sensible default for values, so that users can focus on providing only the required values for most use cases. Lastly, in situations where environment variables are used, we recommend limiting their scope to be as narrow as possible.
For example, if an environment variable is only used in one step, declare it in the step, if it is used in multiple steps, declare it in the job, and finally, if it is used in multiple jobs, declare it in the workflow. This is a common recommendation regardless of platform which aims to decrease the risk of unintended variable modification or naming collisions and improve readability by keeping information close to where it is used.
Focus on Your Workflows
Actions often get most of the attention in the GitHub Actions ecosystem; after all, the product is named after them. However, on a paved road, Actions are but the bricks. They’re useful for providing the ability to decompose a process into functional steps, but to provide the most value to internal delivery teams, platform engineers should devote just as much time to developing a catalog of reusable workflows that tie these steps together. A reusable workflow can ensure that any sequencing needed between steps takes place correctly and can simplify any setup and teardown activities.
For example, our team created an action for deploying a container to our Kubernetes environments. As part of the inputs to this, some environmental metadata needed to be provided. To better manage the effort around invoking this action, we used workflows to perform the additional steps around gathering the needed metadata, performing the deployment, and then adding the result to the output. What we found though, was that this workflow’s code was being repeated in many different circumstances – for different triggering events, as well as for different deployment environments (development, staging, production, etc.). To further maximize reuse, we refactored these workflows into a single reusable workflow that determined the correct destination and image context based on the triggering event.
Furthering reuse: consolidating three nearly identical jobs ...
... into one reusable workflow
Reduce the Risk From Third-Party Actions
GitHub’s documentation contains some good general guidance for security hardening of GitHub Actions. Included in this is a warning that "there is significant risk in sourcing actions from third-party repositories on GitHub," to which they offer some direction, namely pinning third-party action versions to a commit SHA, auditing the source code of the action, and only using actions from creators you trust.
This took a bit of wind out of our sails, as we had become accustomed to having appropriate risk (vulnerability/licensing) information for external components provided automatically by our software composition analysis (SCA) tools, which had enabled broader access to open-source components within our application development activities.
Moving back to manually inspecting any external action threatened to severely limit the utility of the open-source Actions marketplace that held such promise at the outset of our journey. With supply chain attacks on the rise, securing our pipeline was of paramount importance, yet manually auditing the source code and creator of an action was not an approach that we could scale, as our team might not have in-depth knowledge of the languages used in a given action, and the trustworthiness of the creator can be hard to assess when it’s an individual user’s contribution.
GitHub recently announced support for Dependabot alerts for Actions, which is a good first step in the right direction. However, this still comes a bit short of the level of security we were hoping for, for several reasons:
- The alerts generated by Dependabot are largely dependent on creators' self-reporting vulnerabilities. In the case of smaller actions provided by individuals, it’s unclear what level of reporting they may provide.
- There is still no baseline assessment of risk – if a user is browsing the Actions marketplace, there are no readily available data around existing vulnerabilities in the action or the vulnerability management processes of the creator.
- There’s also the risk of someone deliberately embedding malicious code into an action (either at the time of creation if they are the original author, or more unknowingly through a supply-chain attack), which would not be caught by this approach.
- Lastly, a general best practice in dependency management that many companies practice today is to host an internal repository that proxies to the other major public repositories (NPM, Maven Central, PyPI, etc.) and keeps an internal cache of dependencies used by that company’s products to help ensure business continuity if the source of that dependency disappears. Actions would now have the same criticality, as they are essential to a working pipeline, and any disruption could result in significant lost time to developers.
To address the above limitations, we developed an approach that used the tooling we have, along with controls in the GitHub settings, to try and balance risk with development velocity. First, we allowed any internally written actions and any GitHub-authored actions. The next common control was allowing a list of verified creators, permitting all actions from organizations with whom we had existing trust established. Finally, we had an allowed list of individual actions from smaller, unverified third parties.
To expedite the approval process for these, we leveraged our existing scanning tools for SCA, SAST, and container security to scan the repositories and/or container images for these actions and establish some visibility into any potential vulnerabilities.
One final technique we have used in a few situations is to fork or import a copy of an action’s repository into our own enterprise. This provides three assurances:
- We will always have the code available if needed
- We can modify the code to better suit our use cases
- We can ensure that any attempt to change version references (for example, changing a tag) will not affect us.
This of course requires continuous effort to ensure any useful changes or fixes from the source repository are incorporated into our own version, so that trade-off should be considered when using this method.
GitHub also has plans in their roadmap for continuing to improve the security of actions. It’s clear that this is an evolving area, and we look forward to advancements in the platform and solutions ecosystem that can hopefully reduce the dependence on our homegrown assessment process.
Scale Infrastructure Practices to Keep Pace with Demand
As we expanded our use of the Actions platform, increasing the performance and reliability took a large share of the spotlight. We expanded our strategy to incorporate key learnings around operational use:
Emphasize Elasticity with Self-Hosted Runners
Like many enterprises, we make extensive use of self-hosted runners in order to allow jobs to communicate with resources on our internal networks, better manage the tooling and settings of the runner image to meet our needs, and help manage cost.
With our reliance on self-hosted runners, we wanted to make sure that we were matching the experience of using a GitHub-hosted runner, namely that runners were ready when a job started (minimizing queue time where a job is waiting for a runner to be available), and those runners were treated as single-use, ephemeral instances, minimizing the chances that changes left behind from a previous execution would affect the idempotency of the next job.
To do this we kept a warm pool of disposable runners – when a new job started up and grabbed a runner, we used a webhook to spin up the next runner so that the pool always remained able to address the current workload. We also pre-populated the runner pool with a default allocation of warm runners at the start of each workday (and then tore it down at night), and grew the size of this pool as usage increased.
Use Caching to Reduce Build Times
As mentioned above, our fleet of runners was designed for single use, with each job being provided with a clean working environment. While this provided clear benefits around build consistency and isolation, it also meant that any files generated would be unavailable to subsequent jobs unless they were directly shared using a feature like Artifacts. One of the main challenges we had to overcome here was spending excessive amounts of time regenerating existing resources.
Fortunately, in situations where large amounts of infrequently-changing data (such as application dependencies) need to be shared, GitHub provides several caching capabilities. Three of the most common situations which we attempted to address were shared application artifacts, container images, and action images.
The first situation involved application artifacts and dependencies being used between jobs. In this case, using the GitHub cache action brought immediate improvements to workflow performance.
Container images, including commonly used base image layers, were the next most common situation. To alleviate the constant re-pulling of commonly used images, we investigated the different caching options supported by the Docker buildx action we were using for most of our image build tasks. After taking different measurements using the inline cache, GHA (GitHub Actions Cache), and registry cache options, we have currently had the most success using a registry cache hosted using the GitHub Container Registry (GHCR).
Finally, the trickiest of all have been action images. An important concept to understand is that GitHub actions that are packaged as images are built from scratch using their Dockerfile each time the action is used (there is an alternative option to reference a public image in Docker Hub, but this was not feasible for internal actions we did not wish to publish publicly).
We’ve made some effort to reduce this pain by following general image best practices around minimizing image size, but we are also investigating moving some jobs away from executing as containerized actions, and instead packaging the same logic as a pre-built standalone image, and then configuring the job to run inside that image. The job would then execute a script that would traditionally serve as the ENTRYPOINT of the action, behaving very similarly to if it was run natively as an action. The advantage here is we’d be pulling an image rather than building each layer every time, allowing us to pre-pull these commonly used images as part of the startup process for our runners, caching them before the workflow starts.
GitOps on the Horizon
One of the main struggles we’ve continued to have despite expanding the quality and efficiency of our portfolio of actions and workflows has been maintaining agility and visibility through the deployment and operational stages of an application. Our initial approach to deployments was built around substituting variables into parameterized templates that, using our custom actions, ultimately became a full-fledged Kubernetes manifest that we would apply to our environments.
This placed a large burden on our team to ensure that we had complex logic in place to handle various deployment styles, retry and recovery from deployment failures, and cleanup of stale deployments. Troubleshooting deployments was also made more difficult by the fact that the final manifest with all variables applied was not persisted anywhere after the deployment. This also meant that rollbacks or promotion of artifacts involved invoking the entire process anew.
Reflecting on these challenges, we have once again found ourselves examining how we can reshape our deployment strategy to play to the strengths of how GitHub has structured its CI/CD platform. Reiterating some of the learnings above, we looked for solutions that:
- Focus on repositories
- Control changes through pull requests
- Minimize custom deployment logic
By this point, readers familiar with the GitOps philosophy may begin to see where our train of thought was headed. GitOps advocates for a deployment strategy where the runtime state is expressed declaratively in a git repo, and automatically pulled and reconciled by agents running in the environment.
Moving towards a GitOps deployment model will require the adoption of new technologies and behaviors, but would allow us to leverage the deployment and reconciliation logic that has been fine-tuned by experts in the open-source community, while also maintaining a clear, persistent, version-controlled history of changes to the environment, with a clear path (pull requests) to easily change versions and keep a record of approvals where needed. This all comes with building on that same developer familiarity that drew us to GitHub in the first place.
We’re in the early stages of this transition but are excited to take the next steps toward providing better, more reliable ways to deliver value to our customers.