BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Piranha: Reducing Feature Flag Debt @Uber

Piranha: Reducing Feature Flag Debt @Uber

Bookmarks
39:02

Summary

Murali Krishna Ramanathan describes the experiences building and deploying Piranha, an automated code refactoring tool to delete code corresponding to stale feature flags.

Bio

Murali Krishna Ramanathan is the architect of Piranha and a Staff Software Engineer at Uber. He currently leads multiple code quality initiatives across Uber engineering. In the past, he has led the research and development of novel static and dynamic analyses for concurrency bug detection and automated test generation.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Ramanathan: I'm going to talk about my experiences with building Piranha and deploying it at Uber, to reduce stale feature flag debt. I will provide an introduction to feature flags, explain how they can introduce technical debt, and the challenges associated with handling the debt. Then, I'll discuss our efforts in building Piranha to address this problem, and the results that we have observed over 3-plus years. I will conclude with the learnings from this process and provide an overview of potential future directions to this effort.

What Are Feature Flags?

Let us understand, what are feature flags using a simple code example. Here we have a code fragment, which contains a feature implementation, a flag API named, isEnabled, and a feature flag named, SHOW_CUSTOM_IMAGE. If this code is part of an app, the value of the flag is obtained at app startup from the server. Depending upon the value of the flag, the app behavior can change to exhibit the feature or not. Observe that the same version of the application can exhibit distinct behaviors by simply toggling the flag value. This powerful aspect of feature flags explains their enhanced use in software development.

Why Feature Flags?

Feature flags are quite useful. For example, we can have two different customers sitting in opposite corners of the world, using the same version of a food delivery app. The user in San Francisco may be rendered an image of granola berries, whereas a user in Malaysia and Bangalore may be shown an image of Idli Sambar. The ability to customize user experience can be seamlessly achieved using feature flags. This customization may be due to geography, the OS on which the app is used, the maturity of the build feature. Further, using feature flags, optimizes software development costs. These features can be reused and custom features can be modulated according to requirements. Without any additional effort, it is possible to provide numerous perspectives of the application to the user and iterate quickly towards an optimal solution.

Benefits of Feature Flags

Apart from being beneficial to provide customized user experience, feature flags play a critical role in doing A/B testing. Software development organizations may want to evaluate the value proposition between two features, and can potentially roll out an experiment using flags to get real world data.

Feature flags can also be used for gradually rolling out features, and can be used as yet another step in the software release process. This ensures that issues that are undetected with static analysis and internal automated testing can be detected before the feature is widely available to our users. Finally, they can also serve as kill switches, where globally available features can be turned off by simply turning on the kill switch. Given these benefits with feature flags, it is not surprising to observe the increased popularity among software developers and organizations.

Using Feature Flags

Using a feature flag in code, assuming the presence of a feature flag management system, is straightforward. It is done in three steps, define the flag. Here, a new flag, RIDES_NEW_FEATURE is defined. Use the flag by invoking the flag API with the appropriate flag. Here, isTreated on the previously defined flag is used to modulate the behavior of the application. Test the flag by decorating the tests appropriately. Here, a unit test has an annotation that tests the code with respect to the newly defined flag.

Code Base Evolution

Let us now understand the evolution of code with feature flags. Here we have our example corresponding to a feature flag, SHOW_CUSTOM_IMAGE. The, if branch, contains code related to when custom image is enabled, and the, else branch, corresponds to when the custom image feature is disabled. Subsequently, more code and additional features may be added. Here, a new feature flag, SHOW_CUSTOM_VIDEO, is added to differentiate between the presence and absence of custom video features. A code modification may restrict the custom image feature to only premium users, and a flag, PREMIUM_USER, to handle the scenario, may be added as shown in the code here.

Stale Feature Flags

With addition of more features and flags, it is only a matter of time before feature flags become stale. A flag is considered stale when the same branch executes across all executions. This can happen when the purpose of the flag is accomplished. For example, when A/B testing is complete, then one feature may be rolled out globally for all users and executions. Good software engineering practice demands that we delete code from the stale branch, the code region that will never be executed. Further, any other code artifact that is only reachable from this region needs to be deleted. Beyond production code deletions, related tests for stale feature flags need to be deleted too. Not deleting the code due to stale feature flags creates technical debt.

In our running example, assume after a period of time, the team decides to move SHOW_CUSTOM_IMAGE and SHOW_CUSTOM_VIDEO to 100% enabled and disabled respectively. Then, on each execution of the application, the appropriate conditions will be evaluated to true and false respectively as shown in the slide. The source code continues to carry references to these stale flags and the related debt code. The presence of this code makes it unnecessarily complex for developers to reason about the application logic. Consider the code region at the end of two if statements, a developer now has to reason about eight different paths reaching that region, when in reality they may just be two paths with the stale codes deleted.

Technical Debt

This additional complexity and technical debt can be reduced by cleaning up the code as shown here. Here, when the code is deleted, we observe that the custom image feature is only available for premium users, and the custom video feature is disabled for all users, vastly simplifying the code here. To further emphasize the problem due to stale flags beyond coding complexity, here is a real world example of a bug due to a stale feature flag, causing financial damage. The bug manifested due to repurposing a flag, triggering automated stock market trades, resulting in half a billion dollar losses in 2012. In general, technical debt due to stale feature flags causes multiple problems. This graph abstracts a control flow graph with nodes corresponding to the flag API and locations. In the presence of a stale branch, we see the presence of unnecessary code files, which affects application reliability due to reduced effectiveness of testing and analysis tools. Binary sizes are larger as they contain dead code, which cannot be removed by compiler optimizations, as the value of the flag is known only at runtime. There is additional costs associated with compiling and testing code due to stale feature flags. The payload or flags from server is unnecessarily high, carrying flag values that are either globally true or false. Finally, it increases overall coding complexity. By deleting code due to stale feature flags, we can reduce the technical debt and the corresponding disadvantages. The elephant in the room is the manual effort required to address this technical debt.

Challenges with Stale Flag Cleanup

When we deep dived into this problem further at Uber, there were a number of challenges, both technical and process oriented. There was ambiguity pertaining to the liveness of the flag, as there was no easy way to denote the expiry date for a given feature flag. This raised questions pertaining to when the flag becomes stale. One possibility is that if a flag is rolled out 100%, one way, and its configuration has not been updated in a while, then it can be considered stale. That only serves as a heuristic because kill switches may satisfy this criterion, but can never be stale by definition. Second, because stale flags accumulated over time, and there was churn in the organization, cleaning up flags owned by former employees or teams was non-trivial, as knowledge pertaining to the current state of the flag and its usefulness in the code was unclear. Even in scenarios where there are about two problems that are not blockers, we observed that prioritizing code cleanup over feature development was always an ongoing discussion. Finally, the variation in coding style raised questions on how to build an automated refactoring solution to address this problem effectively.

Piranha: Automated Refactoring

At Uber, we try to address these challenges due to stale feature flags by building an automated code refactoring bot named Piranha. Piranha can be configured to accept feature flag APIs to support SDKs of various feature flag management systems. It accepts as input the source code that needs to be refactored along with information pertaining to the stale flag and the expected behavior. It automatically refactors the code using static analysis and generates a modified source code. Under the hood, it analyzes abstract syntax trees, and performs partial program evaluation based on the input stale flag and expected behavior. Using this, it rewrites the source by deleting unnecessary code due to stale feature flags, and then subsequently deletes any other additional code.

Piranha Pipelines

In our initial iterations, we noticed that building a standalone refactoring tool, while useful, was not the easiest way to gain traction among users. Therefore, we built a workflow surrounding refactoring to ensure that the effort required by the end user is minimal. For this purpose, we set up a weekly job that queries the flag management system for potential stale feature flags, and trigger Piranha accordingly. Piranha performs that refactoring on the latest version of the source to generate stale flag cleanup diffs, and assigns it to the appropriate engineers for review. Diffs are equivalent to GitHub PRs, and are used internally for reviewing code as part of the CI workflow. If the reviewing engineer considers the flag to be stale, and the code changes to be precise and complete, they stamp the diffs so that the changes are landed. If more changes are needed, beyond the automated cleanup, they make more changes on top of this diff, get it reviewed, and land the changes. If the flag is not stale, then they can provide this information so that the diff generation for this flag is snoozed.

Example of a Generated Diff

This slide shows a snapshot of a real Piranha diff that was landed in the main branch sometime back, where the red colored regions on the left correspond to deletion of code, you still do. I would like to emphasize here that the problem of stale flag deletion is peculiar, because the stale code is interspersed with useful code. This is in contrast to dead code in the traditional sense, which is usually associated with the specific functions, files are published.

Demo

I will now present a brief demo of triggering Piranha diff generation to clean up a stale flag. Here is a Jenkins job to run Piranha Swift for one flag. We want to clean up the flag named, demo_stale_feature flag for this demo. The expected behavior is treated. The generated diff needs to be reviewed, and for demo purposes, I will provide my LDAP. Building this will generate a diff. For this demo, I already created a diff as shown here. This diff is authored by piranha-bot, and the reviewer is my LDAP. There is a brief description on the purpose of this diff, and then the inputs used to generate the diff are listed, the flag name, the expected behavior, who the reviewer is. Then, specific processing instructions are provided to the reviewer to validate various aspects: flag is stale, cleanup is complete, it's correct. If the diff should not be landed, then there are instructions pertaining to that also here. To handle merge conflicts, there is a link within the diff that can be used to refresh the diff.

This is followed by code changes which shows deletion of unnecessary code. The unit test code removes references to the flag. There is a code change that deletes the else branch, as shown here. Then there is deletion of the definition of the flag along with the accompanying comments. Then there is a deletion of field declaration because it was assigned the return of a flag API that invoked it on the stale feature flag. Then, related code deletion associated to that. The reviewer can review this diff, and if they are happy with it, they can simply accept the solution. There is a linked Jira task that contains a description pertaining to what the appropriate flag owner needs to do with the reference to the generated diff. While in this case, I have shown the demo by triggering a cleanup diff manually, the most common workflow is the automated workflow that generates diffs periodically.

Piranha Timeline for Cleanup

At Uber, Piranha has been used for more than three years to clean up code due to stale flags. This graph shows the number of cleaned-up stale flags versus time in months. The initial prototype was available from December 2017, and we enhanced the availability of Piranha to multiple languages over the course of a year, and set up the automated workflow. General availability was announced in October 2018. Seeing the positive results, we decided to document our experiences as a technical report and open sourced all variants in January 2020. Since open sourcing Piranha, we have seen increased usage to clean up stale flags. More recently, it has increased even further following an engineering-wide effort to clean up stale flags and improve overall code quality of our mobile apps. As can be observed from this graph, we have observed increased stickiness with time.

Code Deletion Using Piranha

We have now deleted more than a quarter million lines of code using the Piranha workflow. This roughly corresponds to approximately 5000 stale flags being deleted from our mobile code bases. This graph shows the distribution of diff counts on y-axis versus deleted lines per diff on the x-axis. A majority of the diffs contain code deletions between 11 to 30 lines, followed by code deletions with less than 10 lines. We also noticed a non-trivial fraction of diffs, where we see more than 500 lines deleted per diff.

What Piranha Users Think

Beyond quantitative statistics, we wanted to get the pulse of what Piranha users think about it and potential pain points that we can help address. For this purpose, we conducted an internal survey. We received responses from engineers developing feature flag based code across different languages: Java, Kotlin, Swift, and Objective-C. At least 70% of them have processed more than 5 Piranha diffs. Processing a Piranha diff involves a few things, checking whether the flag is really stale and can be cleaned up. Whether the generated code changes are precise and complete, any additional changes on top of that generated diff. We also received responses from an almost equal number of engineers who write code for Android and iOS apps respectively. We wanted to understand the average time taken to process a Piranha diff. We found approximately 75% of the respondents consider the processing of Piranha diff takes less than 30 minutes. In fact, 35% of the respondents thought that on average a Piranha diff takes less than 10 minutes to process. We also wanted to understand whether our approach to assign this to users was accurate. I'm sure that approximately 85% of users think that we get it right mostly or always. The response to the state of staleness of the flag shows that 90% of the diffs generated are to clean up stale flags. An automated workflow helps to reduce technical debt with reduced manual effort. The metadata corresponding to the stale state of flag and its owner also plays an important role in this process.

The Top 3 Pain Points

We also wanted to understand the top 3 pain points that users face while processing Piranha diffs. The top pain point according to our survey showed that processing diffs are affected when more manual cleanup is needed, because it requires context switching. This suggests refining the Piranha implementation to help reduce the manual effort. The required changes here range from handling simple patterns of coding to more challenging problems pertaining to determining unreachability of code regions. The second pain point corresponds to prioritizing the work related to cleanup, which can be addressed by setting up organizational initiatives. The third pain point was due to merge conflicts where cleanup diffs conflicted with each other due to changes in code regions. This was because of the time lag between diff generation and diff processing. We were able to resolve this by providing a refresh option within the diff, which updates the diff with a code cleanup on top of the latest version of the source, circumventing the problem. The remaining pain points were due to more info pertaining to the feature flag, and are being blocked by additional reviews due to operating on a mono repo code base. We also received free text feedback from a few users as seen here.

Learnings 1: Benefits of Automation

There have been many learnings with the usage of Piranha. We have noticed benefits due to automation, as it helped improve overall code quality and was able to reduce technical debt with minimal manual effort. Interestingly, we also noticed that the automation was able to detect many instances where flags were rolled out incorrectly, or not rolled out at all. These issues would not have been discovered in the absence of this workflow. Further, the automated cleanup handled complex refactorings seamlessly, where manual changes could potentially introduce errors. Finally, a surprising side effect of the Piranha workflow with regular usage was that it steered engineers into writing code in a specific manner, so that it becomes amenable to automated code cleanup. This ensured simplicity of feature flag related code, along with enabling better testing strategies.

Learnings 2: Automation vs. Motivation

The second set of learnings were on automation as compared to motivation. Initially, I found it quite surprising that even in instances where code cleanup was not complete, developers were happy to make changes on top of the diff to clean up code. The only downside was the slowing down of process due to additional manual effort. The second surprising aspect was that even when the code cleanup was complete, there were instances where the changes were not landed to the main branch. In these cases, organizational prioritization becomes important. It goes without saying that there are no replacements for a motivated engineer.

Learnings 3: Process Enablers

Finally, there are various aspects that ensured that technical debt due to feature flags is kept under control. Automated refactoring is one part, and there are other critical pieces. These include management of the ownership and the flag lifecycle, prioritizing source code quality, supporting infrastructure to validate the correctness of the changes, and review policies associated with automatically generated diffs. Integrating these aspects enables a better software developer experience with feature flags.

Open Questions in the Domain

While we had looked at reducing technical debt due to stale feature flags, there are still many open questions in this domain. What is the cost of adding a feature flag to code, given the complexities it introduces in the software development process? What are the costs incurred due to an incorrect rollout? There is additional developer complexity with the presence of feature flags. A more detailed investigation needs to be done to understand the software engineering cost of developing in code bases with many feature flags. Enabling 100% automated cleanup has many interesting technical challenges at the crossroads of static analysis and dynamic analysis, which merits further exploration. Then there is the question of performance. What are the runtime costs associated with widespread use of feature flags? Are there compiler optimizations that are disabled due to the presence of feature flags? Will they affect binary sizes and app execution times? Finally, how do we prioritize handling the software development costs related to feature flags and incentivize developers to address the technical debt associated with it, collectively?

Future Directions

Beyond the fundamental questions associated with flag based software development, there are many future directions of engineering work related to Piranha. Improving the automation by handling various coding patterns. Implementing deep cleaning of code, extending to other languages. Building code rewriting for applications beyond stale flag cleanup using this framework, correspond to one dimension of work. The second dimension is extending and applying Piranha workflows to various feature flag management systems, code review systems, and task management systems, which can help engineers tackle this problem in other software development organizations. The third dimension is the work on improving the flag tooling to simplify the software development process with feature flags. Working on the future directions requires more engineering work. Piranha is available as open source at this link, github.com/uber/piranha. We welcome engineering contributions to our efforts. In fact, Piranha variants for JavaScript and Go are completely external contributions.

Summary: Story of Piranha

In 2017, there were questions on designing solutions for handling stale flag debt. In 2019, Piranha was being used by users to clean up stale flags and was gaining popularity. In 2021, Piranha is now part of regular developer workflow. There are questions being asked on how we can improve automation rates with Piranha. This change in narrative from 2017 to 2021, summarizes the story of Piranha.

Questions and Answers

Losio: I'm really impressed by the numbers you mentioned, a quarter of a million lines already removed, 5000 stale flags. That's quite an impressive number. I know that is a big company, but they're huge numbers.

Have you explored using Piranha to basically clean up dead code, not just related to feature flags? Basically, outdated library, any other code? I noticed that your already partial addressing, is mainly targeted to stale feature flags, but I was wondering, what's the direction there? What are the options?

Ramanathan: We actually target Piranha for stale feature flags. Also, it's a natural question, and we explored that problem of deleting dead code because that's another major pain point for most organizations, but one of the benefits of automating stale flag cleanup versus dead code is, for stale flags, there are clear anchor points. There is a clear logical association for what constitutes dead code with respect to a feature flag. Whereas, dead code in generality can be spread across the entire code base, and now splitting it into multiple chunks, and then assigning it and making it compilable, and having it run tests successfully, is non-trivial. That's one challenge. Then the second challenge is with stale feature flags, there are usually a specific owner associated with that single point of contact, but with general dead code, it's not that trivial to find a specific point of contact, because that code could have been iterated upon by different folks. It probably falls within the purview of a team, and then trying to drive the cleanup with working with the team as opposed to working with the POC, is slightly harder and more challenging. We explored that. We have that in our roadmap, as you've seen, but haven't fully solved that.

Losio: Actually, you already partially addressed my next question, "How do you define the owner of a feature?" That's really major. How we interact with them in the sense that there are many scenarios where, how do you define if it's really stale, or if something maybe would be rolled back, or something will be used again? Of course, yes, it's a good point that there's an owner. It's the full engineering team, it's not just the developer. Do you have any case where you really need to involve the product owner or to a further discussion if something can be removed?

Ramanathan: In fact, this entire workflow wherein we actually post diffs to the appropriate owner and let them review it falls under them. Because we have a heuristic of what probably is likely stale, and therefore we say, "This is a likely stale flag. We are going to do a cleanup." Then the eventual cleanup decision has to be made by the engineer who actually implemented it, or in consultation with the team or in consultation with the PPM associated with the product. The final decision to actually land the changes are with the appropriate teams. We don't make that decision further. We just simplify the process of actually bubbling up saying, "This flag has not been changed in configuration for some time now. It's 100% rolled out one way or the other, therefore, this may be a candidate for cleanup. Here's a change, which you can just do a click of a button and then you are done." Whether you want to do it or not, that's a decision that the team has to make.

Losio: Has Piranha made you rethink how you define a stale flag?

Ramanathan: It has in some sense. One of our recommendations is that when we create a flag, there needs to be an expiry date associated with the flag that needs to be defined by the owner so that way, for kill switches, the expiry date can sometime be in the future. Whereas for gradual rollout features, it can be time bound. This will also help workflow surrounding Piranha because we don't need to make any heuristics associated with what could be considered stale, but rather use the information or the intent that the owner has specified as part of the management system to say, "This is a stale flag based on what you have recorded, and now, let's clean this up."

Losio: You just mentioned about coding styles and automation. How critical are specific coding styles at this point to enabling high automation rates in the way Piranha works.

Ramanathan: At Uber, we have actually noticed, cleanup is happening across Android and iOS apps, across multiple teams and multiple apps. It's very interesting, like there are certain teams that follow a specific coding style. The style itself makes the cleanup much easier and more automatable. There are teams that may not necessarily follow such strategies. In those cases, we will have to over-engineer Piranha to handle those cases. What we have observed based on our users is that initially, certain code regions may not necessarily be cleaned up automatically completely. Then the engineers who are actually working on that, they quickly see what automation is doable by this bot. When they build new features and add new flags, they follow the guidelines that we have placed for creating the feature flag so that when it comes to deleting a test, it becomes much more easier by adding annotations, as opposed to not having a clear specification on what constitutes a stale code or not. Initially, there will be some amount of friction, but then eventually, with time, the coding style changes so that it makes it easy to integrate with the bot.

Losio: You're basically saying that the code that has been written after Piranha was introduced, is somehow much easier to handle because the developer already knows that he wants to use that, so he's already writing the code in a way that is going to be easier to clean up later. That's really a self-fulfillment process that is really helping out there.

Ramanathan: Particularly this happens after three or four diffs of processing it. Then engineers are also like reviewers on other Piranha diffs for other teams because they may have some reviewing responsibilities for those. Then they also notice that there are certain diffs for Piranha, which are just landing access. Therefore, they actually invite those practices. In an earlier version of the Piranha workflow, we also had a link for the coding events associated with Piranha, and that actually had some traction, which influenced the coding style that we took.

Losio: What was the most common way to deal with those problems before Piranha? I understand what you're addressing, I was wondering, if someone never had Piranha before, how were you basically working before that?

Ramanathan: There are multiple ways that tech debt due to feature flags was being handled earlier. One was the fix-it weeks. Yearly, there would be two or three fix-it weeks. The entire week the team sits and tries to get rid of the tech debt, which is essentially a lot of work during that period of time and not much getting done. That was one problem. Then tech debt keeps on accruing for every fix-it week, and it becomes much harder to reach completion. That's one thing. Second was there was some automated tooling that was built upon just textual processing, like considering code as text, as opposed to ASTs. Then trying to rewrite code using Python scripts. This would work for very specific teams. It would be very brittle, because you have some additional spaces, or maybe assignments and reassignments, maybe additional comments, and suddenly, the entire script will fail, and you will not be able to even get the code to compile. It was not a foolproof point. What Piranha was able to do was use ASTs and partial program evaluation to actually ensure that we are able to do as much cleanup as feasible, and then provide the opportunity to actually add more engineering effort so that we can eventually reach 100% automation rate at some point in time in the future.

Losio: I understand as well that probably had a very positive impact as well on the morale of the team, because if I have to have an iteration, where I'm just basically working with technical debts, with cleaning up code. It's probably not as exciting as maybe I have to merge some code or I have to fix something at the end, but where most of the job has already been addressed in a semi-automatic or automatic way. That's right?

Ramanathan: That's right. In fact, the initial variant of Piranha for Objective-C was compiled by the team reaching out to us saying, we have to stop all development work for the next two, three weeks, because we have a lot of nested feature flags, and just developing on this code base is becoming much more complex, and they wanted a tooling for that. We quickly whipped up a prototype. That's how it took off also in some sense.

Losio: What's the process of integrating Piranha with other third-party flag management systems?

Ramanathan: I think Piranha is configurable, as I mentioned. There are JSON files, like configuration files present in the code base itself. If you look at the GitHub code base, for each variant, we actually make it configurable so that for each of the feature flag API, you can specify whether it is treated. What behavior is expected? Whether it's a testing API, or whether it's an enabling API or a disabling API. Then, once that API is configured, you will be able to run Piranha with this tool. Typically, it could involve maybe 50 to 60 lines of JSON file updates to support 5 to 6 different APIs. This can be done for pretty much any feature flag management system. We only support something that's available within Uber.

 

See more presentations with transcripts

 

Recorded at:

Nov 27, 2021

BT