Key Takeaways
- There are three levels of refactorings: Code-level, micro refactorings; Refactoring to patterns; and refactoring to a deeper model
- Making many, small changes can lead to compounded, large changes, both in the system itself, and your understanding of it.
- The case study discussed here covers camera integration with Nexia Home Automation. The refactoring performed made it easier for developers to understand the domain model and to reason about the Java and Ruby code in the system.
- The refactoring reinforced good DDD practices, such as having strong bounded contexts, and explicitly translating across context boundaries.
- Using feature toggles and a phased rollout provide the option to defer decisions until you have enough knowledge to make informed choices.
This article was adapted from a presentation recorded at Explore DDD 2017.
In the world of chemistry, you can take different substances, each of which, on its own, is in a stable state, and then combine them and they react to become something greater than the sum of the reactants. Similarly, in software, there are different refactoring reagents, each with different effort, frequency and potency. When combined with domain-driven discovery and exploration process catalysts, these refactoring reagents provide code chemistry reactions that transform code towards a rich domain model.
This is a story of a refactoring within a system that's been around for a long time, the video camera support in Nexia Home Intelligence. Nexia is a large-scale Ruby on Rails application with a customer base using tens of thousands of video cameras.
I'm going to cover refactoring at three different levels. In Martin Fowler's Refactoring, he talks about micro refactorings, small changes that are continually being made at the code level to make incremental improvements. Good developers invest the time to memorize and build muscle memory for using their refactoring tools, so these micro refactorings become second nature.
In Refactoring to Patterns, Joshua Kerievsky talks about higher level patterns, such as the Strategy Pattern. He also identifies "smells" to watch for, like shotgun surgery, where making one small change requires lots of additional changes. These ideas brought me comfort that I didn't have to get my design exactly right from the beginning, and that I could start developing, and when I encountered one of the smells, I had the tools available to refactor, but only after it was necessary.
I'll be covering the third level, refactoring to a deeper model, which Eric Evans introduced to us in Domain-Driven Design. When I first read the book, part three really caught my attention. In it, he talks about a project where the model just wasn't working, and they came up with a way of refactoring the model to introduce new concepts, and it completely transformed the project.
Looking at these three levels, if you can introduce new concepts into your model, that's a far more powerful refactoring. I would contend that you need to get good at the micro refactorings and using patterns to fully leverage refactoring to a deeper model.
About Nexia Home Automation
The Nexia Home Automation system is written in Ruby, and allows you to do a variety of home automation tasks, such as knowing if windows are open or closed, integrating with motion sensors, and connecting to cameras. Dan Sharp and I worked on the camera system, and is the scope of the domain I'll be covering.
The domain of home automation isn't like other domains, such as banking or insurance. Instead, it's a highly technical domain, dealing with hardware and firmware. This means the customer isn't the obvious one, the homeowner, since you can't ask them questions about firmware.
Our goal was to continue to work on new features, while at the same time improving the support for new video cameras. When a new camera came on the market, it could take weeks or months, with a lot of shotgun surgery, to add support for that camera to Nexia. We hoped to greatly reduce that timeframe.
If you were to walk through the process of adding a camera to your Nexia setup, you'll start to notice some of the language that is used. For example, instead of add, you enroll a new camera, and then you activate the camera. Enrollment is just letting Nexia know that the camera exists, and has to be done before it is able to connect to Nexia.
Architectural Walk-through
Thousands of cameras, installed in customer homes, talk to many Camera Manager components. The Camera Managers are written in Java, and communication is over HTTP and SSL. As messages come into the managers from the cameras, we place those messages into a Redis jobs queue. Those messages are pulled off of the queue by many Portal Workers, which run in the background of the Portal. The Portal Workers are written in Ruby. The Nexia application needs to talk back to the cameras, so we queue up those messages on a RabbitMQ message bus, and those messages are handled by the Camera managers. Figure 1 shows this very high-level view of the architecture.
Figure 1: Nexia Architecture
The application itself is a Rails application, and a partial view of the codebase is shown in Figure 2. If you're not familiar with Rails development, the models folder is often where everything goes that isn't a controller, so don't assume that it implies a rich domain model. I've highlighted some of the workers I mentioned, as well as camera files related to automation. An example of automation in Nexia would be having a series of tasks happen at sunset, or dimming the lights at a specified time.
Figure 2: Rails Application Structure
Three Main Challenges
Our first main challenge was the code being very difficult to reason about. The Java camera managers were over-architected. They used a meta-framework, with lots of abstractions, to potentially allow it to work with any type of camera that could ever connect into Nexia. In reality, most of the cameras were quite similar, such as the use of SSL and HTTP, and we didn't need all the extra layers of abstractions.
As just one example of a systemic problem, Figure 3 shows a portion of the handleRequest()
method. Any DDD practitioner should immediately have questions about the use of language in this code. Line 91 introduces the term Zombie. What's a Zombie? Line 93 says "if not authorized", but the comment on line 94 mentions authentication, which isn't the same thing. To make matters worse, line 98 declares a variable as auth
, which could mean either term. While just a small sample, this code shows some of the issues that we encountered throughout the camera manager codebase.
Figure 3: Camera Manager handleRequest()
Code Sample
On the Ruby side, the support for cameras in the portal workers grew over time. Because most of this work was done as needed, by various developers (mostly contractors), it became overloaded with responsibilities, and was not purposefully modeled.
The great thing about Ruby code is that it's so concise, and you can express a lot of things in a very small number of lines. However, that was not the case in the CameraWorker
, which is responsible for authenticating and closing camera connections. First, at over 130 lines, it would be excessive to show the full method in this article, so Figure 4 shows just a portion of the code. In multiple places, the worker was doing "surgery" and reaching into the camera object to modify state, rather than declaring the desired behavior. We also encountered some unfortunate naming, such as the start_motion
call on line 89, which looked like a command to start motion, but we learned it was not.
Figure 4: CameraWorker
Code Sample
As with the Java code, this is just a small snippet, but serves as an example of more systemic problems. All of this made it difficult to reason about the code.
The second challenge was that the camera manager was overly coupled to the device manager. To understand this, a bit of the architecture history needed to be understood. The camera manager (CM) evolved from a general device manager (DM) that was designed to manage a wide variety of device types. This lead to a shared kernel with the rest of Nexia. This became a significant deployment issue, and basically meant we could not upgrade Java. We eventually learned that this coupling was unnecessary. While a camera is a device, it does not have a lot of similarity with other devices, like a door lock.
The third challenge, which really involves a DDD perspective, was the domain knowledge being in the wrong place. Most of the domain logic was inside the Java camera manager code. This meant adding new features was complex, time consuming, error prone, and hard to test. Also, modifying that code meant touching almost everything with shotgun surgery.
DDD Concerns
I've laid out all these issues not to simply complain about bad code, but to identify that these are not insurmountable challenges. Furthermore, DDD provides techniques that can greatly improve this situation. That starts with reviewing four main concerns of DDD.
First, we want to grow and express a deep domain model in the code. Number two, we want to refactor the code towards a ubiquitous language where things are consistent and understood, and that the intent in the code is clear. Third, we want to clearly delineate the boundaries and responsibilities of the model and modules. It's very difficult to have high cohesion and loose coupling without clearly defined boundaries. Lastly, we need to enforce model boundaries (aka bounded contexts) and explicitly translate across those boundaries.
Where to Start?
What do you do when you encounter code like this? There are a few options that I know some people have tried, but I don't see meeting with much success. One option is to "drain the swamp" and try to remove all the old code and just start fresh. Sometimes this occurs when someone locks themselves away for a few weeks to just do a major refactoring, planning to emerge when it's "fixed." Option number two is to just say it's somebody else's problem, and not deal with it.
The option I like to support is to experiment and see what you can do. I like the story of how the British cycling team won the gold medal at the 2012 Summer Olympics. The team decided to experiment and find lots of ways to make small improvements, and make many of them. Sir Dave Brailsford, head of British cycling, said, "It struck me that we should think small, not big, and adopt a philosophy of continuous improvement through the aggregation of marginal gains. Forget about perfection; focus on progression, and compound the improvements." The idea of continuous improvement should sound familiar to anyone who's worked in an agile software development process.
Baby Steps
On our project, we tried a bunch of different things, and many of them didn't work. In March 2014, we took a "baby step." We realized that the domain concept of a camera is actually two different responsibilities. It is a stateful representation of a physical device, also known as an entity. But it also acts as a command handler, providing an interface for sending commands and queries to a physical device.
Looking first at the Ruby code, we saw that both responsibilities were in the Camera object. This was a subclass of Device, which had a lot of issues. Since there was no further subclassing, all logic was in one huge, "god" object for Camera.
Rather than changing the Camera, we first decided to add a new domain service. This follows the principle of being open to extension and closed to modification. This new Camera::CommandService
became the new home for all camera commands and queries. Because we took the approach of writing this as an extension, we were able to create it using good TDD and pair-programming practices to create better quality design work without breaking anything. We also had the benefit of a good test suite that covered controllers, workers, and other collaborators, allowing us to make updates with a high degree of confidence. This baby step provided a small, but measurable improvement.
Finding a Seam
In his book “Working Effectively with Legacy Code", Michael Feathers talks about "finding a seam" in your code, someplace where you can insert something new. We did an EventStorming session on the various ways that devices are enrolled in Nexia, and that helped visualize a lot of the similarities in those workflows. Looking at the Java code, we decided that the Camera Manager component had too many "smarts". It should be just managing camera sessions, but it was doing so much more. Our idea was to make Camera Manager a generic http proxy, since all commands are just HTTP calls, and consolidate all the camera logic on the Ruby side.
The seam we created was introducing a new generic send_url()
command on Camera Manager. We focused the http proxy model on connection management, authentication, camera-to-portal messaging, and logging (to aid in troubleshooting and future metrics). On the Ruby side, we could build on the marginal gains from the year before, and use the Camera:CommandService
to send arbitrary commands to cameras.
Migrating Domain Logic
With the baby step from a year ago, and this new seam we found, we could begin migrating the camera domain logic from the Java side to the Ruby side. One at a time, we moved Camera::CommandService
commands (e.g. Pan-Tilt) to use the generic Camera Manager interface. What was really cool about this approach was it required no changes to the Java code. As an internal refactor on the Ruby side, we could work on a single command, test it, and iterate until it was working.
When I present this story, one question I hear often is, "How do you justify these kinds of refactorings?" I always point out that we were continuing to deliver features in the application, and this was something we would work on whenever we could. Also, these small steps led to some early wins. Because we provided generic URLs to the cameras and could send commands from Ruby, we could bulk upgrade the firmware on all installed cameras. Also, the ability to easily and quickly make changes on the Ruby side meant we could conduct experiments and find other marginal improvements. Previously, this required coordinated changes in both Java and Ruby.
I really want to emphasize the importance of looking for early wins. Non-technical stakeholders don't care that you refactor the code. I believe, in general, they assume you are a professional, and are doing your best to write maintainable code. That means you have to build up trust and credibility, and finding early wins are one way to do that.
We also found that the refactoring led to momentum in cleaning up the code. In the Camera::CameraWorker
, there were three distinct aspects of authentication. First, authenticate an existing camera upon reconnect. Second, handle the creation and authentication of a new camera. The third aspect was to fail a "zombie" camera; one that was connected but not authenticated. By refactoring to a deeper model, the code became much easier to reason about, as shown in lines 8-14 in Figure 5.
Figure 5: Three distinct aspects of authentication
As we introduced more domain logic, the Ruby code started to use more of the ubiquitous language of Nexia. Instead of doing surgery on the camera object and setting lots of properties, we recognized a factory pattern would be more appropriate. The ubiquitous language included the concept of a heartbeat, the concept that the cameras connect to Nexia on a regular basis to say that they're still alive. We then created factory methods named update_from_heartbeat
and create_from_heartbeat
, to handle existing and new cameras, respectively.
The Java side also benefited from the refactoring. Compare the previous handleRequest()
method, which is only partially shown in Figure 3, the five lines in Figure 6. With just a few extract method refactorings the functionality became much easier to understand.
Figure 6: New Java code sample (compare to Figure 3)
The camera class was huge, and that's something you'll always run into. A tip for dealing with that situation is to realize that you don't need to refactor the code to increase your level of clarity. In the presence of overwhelming clutter, simply rearranging code is an easy - but powerful - design technique. Start looking for patterns and put similar methods together. This will reduce the cognitive load when you're working on a huge area of code.
Refactoring to Deeper Insight
Approaching a big, messy codebase can be a lot like walking through a fog – you can't see everything around you, and may wonder if you're looking at a bunch of trees, or possibly a mountain. As you make little changes – reorganizing code, extracting methods – those marginal gains start to add up, and the fog starts to lift. Implementing the micro refactorings and patterns that Fowler and Kerievsky talk about creates a cumulative effect, which leads to a deeper insight about the model.
For example, on the Ruby side, we realized we were sending commands to the camera. So we followed Kerievsky's advice and introduced the command pattern. This dramatically simplified things. We setup a base class for commands with a standard execute()
method. Then we created a camera/command folder and started writing each command, implementing that base class. Along with that, we introduced a feature toggle. This allowed the old code to continue to execute, until the corresponding command had been converted. I highly recommend using feature toggles to help you refactor safely in this type of situation.
Another approach I recommend is to perform a phased rollout. We wanted to avoid a situation where every camera suddenly disconnected, and most products with an existing customer base will probably have a similar desire to not impact the entire audience. In the first month, we deployed only to internal Nexia IP addresses, allowing QA, developers and support personnel to "dog food" the new system before any customer. The second month expanded the deployment to include select customers, but was still only using a single production server. It wasn't until the third month that we deployed to all cameras, for all customers, on all production servers. This went much smoother than other production rollouts I've been involved in during my career.
Another benefit of feature toggles and phased rollouts is that they create options, and options have value. I recommend the graphic business novel Commitment which covers the concept of Real Options. Usually, when someone in a meeting says, "we have to make a decision", it seems the only two choices are to either make a decision, with limited knowledge, or to not make a decision. Real options say there's a third option, to strategically decide to defer the decision until we have better understanding. Both feature toggles and phased rollouts allow you to defer some decisions until you can make a more informed decision, and that is key.
Looking Back
Looking back at where we started, there are some big wins that we achieved. Before, adding a new camera took weeks or even months. Now, we can add a new camera in a few hours. Instead of having to make changes in both Java and Ruby, and keep the code in sync, we now only need to make changes in Ruby. While the old code was inconsistent in where certain aspects belonged, the new code was far more cohesive, and easier to reason about, within clear contexts. The gnarly dependency between Camera Manager and Device Manager was removed, so we no longer were prevented from updating Java.
There are a few, general refactoring tips I have, based on this experience. Instead of picking just one idea of how to implement a change, such as naming something new, experiment with at least three language and/or model options. Also, in your day-to-day work, embrace marginal gains. We tend to dramatically overestimate the power of big changes, and underestimate the power of small, cumulative changes.
About the Author
Paul Rayner is a developer, coach, mentor, trainer, and popular international conference speaker. With over 25 years of hands-on software development experience in a variety of industries, Rayner is a seasoned software design coach and leadership mentor, helping teams ignite their design skills. His consulting company Virtual Genius LLC provides training and coaching in software design for agile teams. Rayner is from Perth, Australia, but lives, works and plays in Denver, Colorado, with his wife and two children. He tweets with an Australian accent at @ThePaulRayner and blogs at thepaulrayner.com.