Feature flags are a useful tool to conduct A/B experiments and to roll out changes in a controlled way. To make sure that their use does not end up disappointing users when a change causes a crash or degrades the user experience, Lyft created Safe Mode, specifically aimed at preventing crash loops on launch.
When a crash on launch was introduced by turning on a feature flag or changing other remote configurations, we usually had to ship a hotfix to get users out of infinite crash loops since we had no way of pushing configuration updates to the app when it was crashing so early in its lifecycle.
Key to Safe Mode implementation is Bugsnag, a platform aimed at monitoring app stability that provides specific support to manage feature flags and A/B experiments. In particular, Bugsnag allows developers to declare a list of used feature flags, which is sent over along with a crash report. Bugsnag is also able to identify crashed on launch by providing an API to mark launch-time events.
Lyft starts Bugsnag very early in the app lifecycle, right in its main
function, and configures it so it considers an app launch to be completed when applicationDidFinishLaunching
returns on iOS and once the main screen is displayed on Android.
Safe Mode is also started in main
, right after Bugsnag is initialized, and queries the latter to see if the previous session crashed before the app fully launched. In this case, it logs a safe_mode_engaged
analytic event and enters a shadow state where it first detects which feature flags were consumed in the previous session and then locks their configurations to local default values.
This effectively puts the potentially problematic features/codepaths into their default “safe” state and allows the user to use the app as they normally would (albeit with some functionality disabled).
Once an app has launched successfully after a crash, it refreshes its feature flags to give them another chance on the next launch. If the crashing feature has not been fixed before that moment, the app will crash and engage safe mode again.
A Grafana dashboard is fed with all crash-related events so engineers can easily detect a spike and take action quickly. Additionally, Lyft engineers also made sure that Safe Mode would not be itself the cause of instability. So, before rolling it out, they set up a specific feature flag that intentionally triggered a crash on launch to test the whole approach in internal, alpha versions of the app.
According to Lyft, Safe Mode has proved effective in reducing pain points with feature flags. They plan to extend it to include handling app hangs, which usually do not cause an app crash and are trickier to detect; to be more effective in determining which specific feature flag caused a crash and to avoid disabling all of them; and to automatically disable feature flags that provoked too many crashes.