Agility, Big Data, and Analytics
Agile and Lean techniques seem to be the best way we currently know to create complex software in the face of risk, uncertainty, and changing requirements. Agile hinges on embracing and adapting to change by enabling rapid feedback cycles and evolutionary development. However, bringing agility into big data (and small data) analytics has been a challenge for many, very bright and talented, data scientists and engineers. In this article we’ll explore what makes analytics uniquely different than application development, and how to adapt agile principles and practices to the nuances of analytics. We’ll also examine how the disciplines of data science and software development complement one another, and how these intersect in an agile project environment.
The data scientist, the software developer, and the data engineer
First let’s look at what differentiates analytics experts from software developers. C.F. Jeff Wu first introduced the term “data science” in 1998 as a discipline that encompasses statistical analysis, science, and advanced computing. The use of analytics by social media companies like LinkedIn, Facebook, and others in recent years has boosted the popularity of “data scientist” such that Harvard Business Review published an October 2012 article entitled “Data Scientist: The Sexiest Job Title of the 21st Century.” Simply put, a data scientist has a unique, and very deep, blend of the skills depicted in Figure 1.
Figure 1: The Disciplines of Data Science, Source: Calvin Andrus, Wikipedia
Data science skills are both complementary to, and overlapping with, software development skills. Data science requires programming, but data scientists are not often trained in modern software engineering practices. Conversely, many developers have skills in data engineering, advanced computing, and statistics, but these are not commonly their areas of deep expertise. Data scientists commonly code in multi-paradigm languages like R and Python, which have powerful statistics libraries and an active research community behind them.
Data engineering is the bridge between data science and software development. A data engineer supports the data scientist in data discovery, harvesting, and preparation. Data engineers support developers in operationalizing analytical models for production deployment, which we will discuss shortly. This role requires expertise in data management technologies (“big data”, NoSQL, and SQL), data modeling, data architectures, and data manipulation languages and techniques.
Analytics vs. App Dev
It’s worth noting that the nature of advanced analytics is fundamentally different than that of application development. While app dev is complicated by both technical complexity and changing requirements, the emphasis is on implementing chunks of functionality based on business requirements. Conversely, analytics is complicated by uncertainty about what, if any, actionable or insightful knowledge is hiding within the data.
For example, consider the goal of increasing customer footfall, the number of customers who enter a store. If we already know that special coffee deals on “Wake Up Wednesdays” increase footfall by coffee buying customers, then software developers can implement functionality to enable registration for the “Coffee Rewards” program, and to gently remind customers about upcoming Wednesday events. The complexity is in providing an effective customer experience and applying the right logic to focus on the right customers.
However, advanced analytics are a likely precursor to marketing strategies such as the “Coffee Rewards” and “Wake Up Wednesday” programs. Marketing may lead the idea with a statement: “If we knew what buying behaviors highly profitable shoppers have in common, we could create a marketing program to increase the footfall by all shoppers who purchase these items.” This business goal might lead the analytics team to conduct customer profitability analysis to identify the most profitable customers, followed by a market basket analysis to identify which items these customers tend to buy in common, which in our example must have included coffee.
At the outset of this scenario it is unknown which, if any, products were regularly purchased in common by most profitable customers. There is no assurance that a meaningful discovery will be made with respect to the goal of marketing. Moreover, there is no assurance that a meaningful discovery will be particular actionable or insightful. While application developers focus on designing the best solution for a particular business goal, data scientists focus on empirically determining if the data exists support the business goal. Both disciplines are complex, but in very different ways and requiring very different skills and experience.
Conventional analytics, while typically motivated by general business questions and hypotheses, is indirectly driven by business value. Many analytics endeavors begin with business questions like, “How many nights per month do frequent travelers stay in our hotels?” Such questions are not always tied to actionable or impactful intentions. It is easy for these types of analyses to provide interesting insights that are not necessarily actionable. Since interesting analytical discoveries beget more questions, it is also easy for such analyses to expand into months of analysis producing possibly interesting findings, but with little or no actionable outcomes.
Agile analytics is specifically focused on the frequent delivery of impactful, actionable results. The goal is to make a small discovery, get it in the hands of business decision makers, and evaluate the usefulness of the results. Data scientists typically use historical data to build predictive models. For example, a customer churn index might use Bayesian statistics to predict the likelihood of customers to switch to a competitor. Agile analytics calls for a clear business purpose such as, “If we knew which customers are likely to churn, we could intervene with incentives to encourage them to stay.” This enables the agile data scientist to measure when a model is good enough, and whether the business actions are producing the expected results.
Minimizing Initial Investment
Conventional analytics calls for building the best predictive model possible, which may take an extended length of time, and may not reap the expected benefits. Agile analytics calls for developing an embryonic, but effective, model quickly; deploying it for business usage; and then iteratively replacing it with more refined/mature models. Agile analytics also calls for creating a measured feedback loop to monitor the impact of model usage. This approach leads to a much smaller initial investment before the first insights are put into action. Ongoing investments are based on the impact of these early actions and on the evolution of better analytical discoveries.
So, for our customer churn example, the data scientist may quickly build a model using linear regression. This model may accurately predict churn only 58% of the time, but gives the business a better than 50/50 means of focusing intervention. So, while the business is honing its churn intervention approach using the preliminary model, the data scientists are working to refine and improve the model using a blend of other algorithmic techniques. And they are also monitoring the actions of the business to see if churn is being reduced. This cycle continues while model accuracy improves, and interventions are helping reduce churn.
Our goal is to operationalize analytical discoveries routinely and to measure their impact. Operationalizing analytics is the process of deploying an analytical model against live, production data. Credit fraud detection is a familiar example of this. The fraud scoring model is built and validated against known, historical data and then operationalized to evaluate new transactions in real time.
Not all analytics warrant operationalizing. Sometimes analytical models are built and validated purely for singular discoveries. Strategic decisions are based on these discoveries rather than deploying the model for operational analysis.
Lab and Factory Model
The data science “laboratory” is where experimentation and discovery occur, while the “factory” is where analytical discoveries are put into production. Lab experiments are fraught with uncertainty. We don’t know whether or not there are insights buried in the data. However, once insights are uncovered, the factory is where those insights are coded, tested, and deployed into production. Data scientist Will High describes these in detail in his two part series, Discovery in the Data Lab and Deployment in Production. Agility is applied in each of these phases.
Agility in the Lab
Agile analytics focuses on the frequent creation of actionable, impactful insight. So, the business goal, “Identify high value customers who are about to leave” might trigger the question, “What action would you take if you knew a customer was about to leave?” By focusing on the intended business action, the agile data scientist can determine the first minimally sufficient analysis. For example, if the intended business action is a customer check-up phone call, then the first analysis might simply identify the customer segment with similar profiles as those who have previously attrited. Later this might evolve to a more mature model that computes a higher fidelity attrition index - a score of the likelihood of attrition.
In some ways the analytics lab is like an experimental spike in agile software development. It’s highly exploratory and often early analytical models are disposable since they are for discovery purposes. Therefore, lab development is not generally test-driven and may be a bit more hacked than well-engineered. However, some of these models show enough promise to emerge from the lab and into the factory.
Agility in the Factory
When the agile data scientist recognizes these “keepers” it’s time to add tests and refactor toward a more sustainable design for production. By now the agile data scientist is pairing with a developer or data engineer. The analytics factory is principally concerned with how the analytical results are integrated into the broader solutions architecture. Incremental refinements to the analytical model may continue in the lab, but the factory is where the latest models are deployed against live, and possibly very large, data.
The best architectures support modular deployment and replacement of the models. This may involve a service such as a REST API to sequester the analytical model, enabling rapid replacement. In this way data scientists can continuously evolve and mature their models over time. This also enables models written in languages like Python to be integrated into applications developed in other languages like Java, Ruby, etc. Finally, some analytical models, such as regression models, may be parameterized so that it is easy to update the formula’s coefficients over time.
Testing in Agile Analytics
Testing and test automation presents a unique challenge in analytics. There are two aspects to testing: ensuring that the analytical code is written correctly, and ensuring that the model is both valid and accurate. The first of these is relatively straight forward. Tests can be reliably written for functionality such as loading a data file, deriving a new independent variable (data attribute), ensuring data completeness, consistency, and correctness. These tests can be written in a test-first fashion by data scientists but as mentioned earlier, many early models are disposable. Therefore, data scientists may not write tests initially until a model appears to have promise.
The second aspect of validation is more challenging. Analytical models are probabilistic by nature. This means that there will always be false positives and false negatives. For example, some customers who are predicted to attrite will never do so, and some who are not predicted to attrite will. Data scientists use the scientific method to validate analytical models and determine their accuracy, while business experts verify their utility. These latter tests are not typically included in a continuous integration test suite, but the former tests are.
There is a temptation to think of data science as an isolated precursor to application development. This isolation is akin to doing big design up front and is anathema to agile development. If your software solution includes an advanced analytics capability, then data science is best viewed as a role within a cross-functional agile team rather than an isolated specialty. When possible, it helps to enable the data scientists to work an iteration or two ahead of the delivery team. Data scientists collaborate with customers to verify that the model is insightful, actionable, and impactful. When that is confirmed, the model can move into the factory for deployment. However, be careful to ensure that the team does not perceive data science as a separate activity. Like evolutionary design, data science may run just a little ahead of development, but must involve collaboration to be effective.
About the Author
Ken Collier is the Director of Agile Analytics at ThoughtWorks, a global technology company that provides fresh thinking to solve some of the world's toughest problems. Under Ken’s leadership, ThoughtWorks is building deep expertise in Agile Analytics, especially focused on advanced analytics using a polyglot blend of relational and nonrelational (aka NoSQL) persistence. He incorporates the Agile Project Management concepts developed by Jim Highsmith and the Innovation Games™ developed by Luke Hohmann into many of his client engagements. He holds an Ph.D. in Computer Science Engineering from Arizona State University where he studied software engineering, database theory, and artificial intelligence/machine learning. His most recent book is Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing.