InfoQ Homepage Articles The Problem with AI

AI, ML & Data Engineering

The Problem with AI

Sep 13, 2017 17 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

Machine learning in turn is frequently fueled by big data but can also be fueled by traditional data sources.
No matter what the scope is, we have to select data that is appropriate to the domain of the problem space
Information from highly diverse sources needs to be parsed, curated, packaged, contextualized, and componentized for consumption by users or ingested by systems.
While machine algorithms play an important role in both the preparation of data and interpretation of user intent, these types of applications require a significant amount of knowledge engineering to be successful.
Thinking about data as a service and the platform as an orchestration layer between business problems and technology solutions can help organizations achieve dra- matic improvement in data sci- entists’ productivity.

This article first appeared in IEEE IT Professional magazine. IEEE IT Professional offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.

Science is actually pretty messy. When I was a chemistry undergraduate, I loved the theory behind biochemistry-the endless complexity allowed by simple rules; how massive, complex cellular machines could arise from a few building blocks. In lab, however, I struggled to make the simplest reactions work. Starting with pure crystalline compounds and expensive laboratory equipment, when the result was also expected to be crystalline, I ended up with piles of brown goo-with my instructor concluding, "Well, it could be in there" in reference to the experiment's objective.

Data science is also very messy. Frequently the starting point is the data equivalent of brown goo-messy, poor quality, inconsistent data-with the expectation that pure crystalline results will be the output of the next best action, personalized marketing campaigns, highly effective custom email campaigns, or a cross-department, cross-functional, 360-degree understanding of customers and their needs.

Artificial intelligence (AI), though broadly applied these days to mean almost any algorithm, is primarily dependent on some form of machine learning. Machine learning in turn is frequently fueled by what is called big data (high-velocity, high-volume, highly variable data sources) but can also be fueled by traditional data sources.

Variable Does Not Mean Poor Quality

There is a common misconception that "variable" data can mean "messy" data and that "messy" data can mean "poor-quality" data. Simply put, variable does not mean messy, and messy does not mean poor quality. Variable data is data that has different formats and structures. To use it, we need to understand how the different types of data can be used as signals to achieve a result. Twitter data is very different than transactional data. The two together can provide insights about how social trends impact sales. Messy data can be missing values or can be in formats that are difficult to ingest and process. The data can be very good, but requires work to get it into a format for processing.

A recent article in Sloan Management Review stated that

Organizations can now load all of the data and let the data itself point the direction and tell the story. Unnecessary or redundant data can be culled … [This process is] often referred to … as 'load and go'¹.

While conceptually accurate, there is much left open to misinterpretation. "All the data" needs to be defined. Does it mean all product data, social media data, accounting data, transactional data, knowledge base data? Clearly "all" is an overgeneralization. And this approach has its drawbacks. Sandy Pentland, MIT professor, remarked at the recent MIT CIO Symposium that "Putting all of your data in a data lake makes it convenient for hackers to go to one place to steal all of your data".

No matter what the scope is, we have to select data that is appropriate to the domain of the problem space. The data needs to be in a consistent format. It cannot contain incorrect values. If the data is incorrect or missing, then the algorithm cannot function correctly unless we are making accommodations for those issues. "It's an absolute myth that you can send an algorithm over raw data and have insights pop up", according to Jeffrey Heer, a professor of computer science at the University of Washington, as quoted in the New York Times².

Technology writer Rick Delgado notes that "many data scientists jokingly refer to themselves as data janitors, with a lot of time spent getting rid of the bad data so that they can finally get around to utilizing the good data. After all, bad data can alter results, leading to incorrect and inaccurate insights"³.

In a recent conversation I had with Laks Srinivasan, chief operating officer (COO) of Opera Solutions, he asserted that "80 percent of the work the data scientists are doing is data cleaning, linking, and organizing, which is an information architecture (IA) task, not a data scientist function".

Opera, founded in 2004, was one of the firms that tied for first prize in a Netflix contest that was offering US$1 million to the company that could beat its recommendation engine by 10 percent or more. (The three-year contest, which ended in August 2009, awarded the prize to a team from AT&T Labs, which submitted its response just minutes before Opera.) Opera is an example of a company that developed a platform to help data scientists in many aspects of analysis, feature engineering, modeling, data preparation, and algorithm operationalization.

A Range of AI Applications

AI applications exist along a spectrum. At one end lies embedded AI, which is transparent to the user, but makes applications work better and easier for them. Spelling correction is an example that people take for granted. Machine translation is another. Search engines use machine learning, and AI, and, of course, speech recognition, which has made enormous progress in recent years.

At the other end of the spectrum are the applications that require deep data science and algorithm development expertise. The people who develop these applications are technical experts with deep mathematical and data science knowledge. They devise and tune the algorithms that provide advanced functionality.

Along the continuum are the platforms and development environments that make use of the tools (many of which are open source). These applications require various levels of configuration and integration to provide capabilities.

Types of Cognitive Computing

For example, consider a type of "cognitive computing" application. Cognitive computing is a class of application that helps humans interface with computers in a more streamlined, natural way. Such applications are also capable of processing information in a less traditionally structured manner to provide a range of answers, with probabilities based on the user's context and details about the data sources.

One type of cognitive computing application is the processing of large amounts of patient observational data and providing a "second opinion" about a diagnosis. Physicians are using this approach to augment their knowledge and experience when developing treatment regimens. Another type is creation of an intelligent virtual assistant (IVA) that retrieves answers to procedural questions rather than lists of documents. IVA functionality requires various mechanisms that are powered by machine learning. The first is speech recognition, which translates spoken language into text. The next is a mechanism for deriving intent from the user query or utterance. Intent can be based on training sets and examples of phrase variations, or it can be from parsing language to derive meaning.

The Role of Machine Learning

Each of these approaches leverages machine learning. Some dialog management approaches can use mechanisms akin to language translation. Given enough questions and enough answers, a machine learning algorithm can "translate" questions into the correct responses. When the intent is derived via natural language understanding or training set classification, a response can be retrieved from a corpus of content via a ranking algorithm that uses signals generated through determining the intent of the user as well as additional metadata that can inform the user's context-anything from purchased products, to configured applications, to demographic or social media data.

Inference can use relationships mapped in an ontology-for example, products associated with a particular solution or steps to troubleshoot a specific device configuration. Some of this knowledge is inferred from the data and some is intentionally structured-the knowledge engineering approach to AI.

Contextualizing Endless Knowledge Sources

Organizations have enormous repositories of knowledge in the form of processes, procedures, manufacturing techniques, research methodologies, embedded designs, programming code, configured applications, technical documentation, knowledge bases of various kinds, engineering libraries, expert systems, traditional libraries, technical publications, scientific, engineering, and trade journals-the list of explicit knowledge sources is endless. Historically, humans have always limited the scope of the information that they consume-for example, by picking up a book on a topic, searching for a specific area in a library, pursuing a specialized library, or seeking out a particular journal. Even in our digital age, engineers will go to engineering sites for nuanced, specialized information. Scientists will go to scientific sites, and so on.

Information from highly diverse sources cannot be processed as raw data inputs for any purpose without restriction. It needs to be parsed, curated, packaged, contextualized, and componentized for consumption by users or ingested by systems for application to a limited number of scenarios. As powerful as it was, the Jeopardy-playing Watson program required specific information sources to function correctly.

Can Curation Be Automated?

Machines can help when given the correct scaffolding and representative training sets. Data and content sources can be processed by machine algorithms, overlaying the structure and identifying patterns in the information to assist in componentization and contextualization. The process is iterative and requires human judgment and inputs to fine-tune results. Those results might be the componentized information containing specific answers to questions rather than large amounts of text. When the content is fine-tuned and componentized, the specific answers can be more readily retrieved. A user looking for an answer does not want a list of documents, but the answer to the question. Bots and intelligent virtual assistants are designed to respond with an answer or a short list of suggestions presented in the correct context (the user's query or intent). Auto-tagging and autoclassification machine learning algorithms can apply the correct metadata to content to allow for those contextualized results.

The Role of Ontologies

Ontologies are the containers of metadata-the knowledge scaffolds or structures that can be abstracted from systems of knowledge and applied to other bodies of information for organization and contextualization. The ontology can capture the relationships between knowledge elements and ways of organizing those elements-for example, the list of user intents with corresponding actions. A taxonomy of products can be related to a taxonomy of solutions composed of those products. Or a list of problem types can be associated with corresponding troubleshooting approaches.

Tools such as virtual assistants become channels for knowledge structured with an ontology, along with rules and contexts that apply to specific problem sets. Take, for example, the task of servicing a customer who is trying to set up and operate a new fitness tracker. Instead of searching on the website or calling the help desk, the customer might try typing a question into the company's support chat bot. The bot interprets the natural language question as an intent, and the ontology allows retrieval of the correct responses from a knowledge repository. The ontology manages intents and responses as well as terminology and phrase variations for algorithm training.

The advantage of a natural language question over a search is that it becomes easier to derive the user's intent when they ask a fully formed question rather than typing a few ambiguous keywords. A bot can also be programmed to further disambiguate intent by requesting more detail from the user. This type of natural interface can also be used to access corporate information sources-running a financial analysis or retrieving underwriting procedures, for example.

Maturing Algorithms Still Necessitate Data Clean-Up

While machine algorithms play an important role in both the preparation of data and interpretation of user intent, these types of applications require a significant amount of knowledge engineering to be successful.

As machine learning algorithms mature, the heavy lifting will become more invisible and behind the scenes, and data or content preparation as well as application tuning and configuration will constitute the bulk of the work and require the greatest effort. With data scientists increasingly in short supply, business users will need to perform more analysis so that a backlog does not develop behind scarce data science resources. Data preparation is a major challenge, and operationalizing capabilities is an even bigger one. This is because knowledge of deep analysis approaches is becoming lost in translation from the laboratory environment to the operational environment. Given that detailed machine learning approaches are less accessible to business people, there is increasingly a gulf between the business world and the IT world. However, two trends are in play. Sophisticated tools are becoming more commoditized, while more advanced capabilities are being made available to business people through platform approaches. The key component of data preparation, data operationalization, and translation between business challenges and analytical tools is the semantic layer-the glossaries, thesaurus structures, metadata standards, data architectures, and quality mechanisms.

As the tools get more mature, organizations will get value from them only if they take control of the things that will not be commoditized by the marketplace-their data, content, processes, and semantic translation layers. For example, organizations will not get a competitive advantage by building speech recognition. That problem has been solved (for the most part-it is still improving, but building the algorithms from scratch would not have business value). They will, however, gain a competitive advantage from servicing their customers uniquely with a speech recognition agent that accesses the knowledge they have about their customers and serves up the products and content they need.

Rethinking High-Power Analytics

As demand is exploding for big data analytics, data scientists are increasingly in short supply. When a company is building predictive models or machine learning models, a few factors stand out.

Every journey starts out with raw data, so if a company is doing multiple projects for the same client and the same department, multiple teams start with the same raw data, which can be inefficient. The second factor is that so much of the work data scientists are doing is data cleaning, linking, and organizing, which, as Srinivasan mentioned, is an IA task, not a data scientist function.

The third factor is that even after the data is cleaned up and models are developed that accurately predict (for example) who is likely to buy a certain product, it takes a lot of time to go from the data science sandbox to actually operationalizing the analytics that create a business impact.

This disconnect occurs because the development environment and the production environment are very different. As Srinivasan explains, "The data scientists might build a model using SAS in the sandbox and using certain datasets, but the IT department needs to re-code the variables and models in Java or optimize R code to scale in Hadoop when the application goes into production. At this point, the data is also very different because it goes beyond the test datasets, so the data scientists have to retest it against the model.
Finally, even when the projects are in production, all these insights and know-how [are] fragmented into documents or code or people’s heads. As staff turns over, knowledge is lost".

Re-Imagining the Analytics Lifecycle

When Opera began considering how to address these issues, it came up with the approach of fundamentally re-imagining the analytic development lifecycle by developing a "semantic layer" between the data layer (raw data) and the use case, application, and UI layer. The thought was that the company could preprocess the data to a point, independent of its future use, and then apply AI and machine learning in converting big data to small data. By putting a semantic layer around analytic models and tools, all the users can find them once the semantic layer is operationalized.

According to Srinivasan, "By making data independent of use cases and operationalizing it, and then making it machine-learningdriven and AI-driven, the signals learn about the data. The system becomes a learning system, not a static, one-time data modeling system. It becomes a continuous feedback, loop-based, living, breathing kind of a central nervous system, in the enterprise".

In other words, the semantic layer acts as a way to translate business problems into the inputs needed to query a big dataset. The technical predictive algorithms operate under the covers, and this complexity is hidden from the user. The algorithms simply have to point to the big data sources (that are correctly cleansed and prepared, of course) and then provide their parameters as inputs to predict their outcomes, run simulations, segment audiences, customize campaigns, and so on.

Developing an Orchestration Layer

In the case of Opera, the company went on to build a platform from the ground up to create and manage the signal layer, and ran mission-critical applications on it. The platform, called Signal Hub, processes data from about 500 million consumers for global blue-chip clients across industries. This approach allowed Opera to essentially outsource the data science work, operate on its platform, and sell solutions to business buyers. When Opera developed and then productized the platform as an orchestration layer in 2013, many organizations did not have the IT or data science resources to fully exploit the power of advanced tools. The market has matured since then, and that strategy-to productize as an end-to-end AI and machine learning enterprise platform by hardening with security, scalability, and governance capabilities-provides valuable lessons for organizations building data-driven solutions.

Thinking about data as a service and the platform as an orchestration layer between business problems and technology solutions can help organizations achieve dramatic improvement in data scientists' productivity, and in the productivity of business analysts and business intelligence workers. "The maturing of technologies and emergence of platforms is democratizing insights derived through machine learning and capabilities provided by AI in a way that we say makes ordinary people extraordinary", says Srinivasan. "If all the insights and expertise are buried in a small team within a company, it doesn't really leverage the value of AI tools to be used by an average call center rep".

The concepts of data as a service and platforms as an orchestration layer have far-reaching implications for the future of AI-driven enterprises. Not only can data be more fully exploited by this paradigm, but so can knowledge and content-the raw material on which cognitive applications are being developed. According to Henry Truong, CTO of TeleTech, a $1.4 billion call center services firm, "Organizations can normalize knowledge in the same way that they normalize data-through componentizing knowledge into the building blocks that provide solutions to problems. The knowledge ontology becomes the data source to orchestrate more and more process actions, that, in our case, prevents service disruptions". This approach is beginning to be exploited in ways that allow for interoperability between platforms that are exposing functionality through a services layer. Those "normalized knowledge bases" are powering chat bots that are driving the next-generation digital worker.

Leveraging Platforms and Orchestration Layers

Many organizations are attempting to build their own platforms and believe this is required to create a competitive advantage from machine learning and AI capabilities. The key decision point is whether the platform is the differentiator or whether it is the data and orchestration layer that will be the differentiator. "I frequently hear CIOs say they have a platform or that they are building machine learning. The problem is that it is easy to go through $100 million or more, and a lot of pain and suffering. I say, 'Do not try this at home' in my presentations and hope they take it to heart", cautions Srinivasan.

A core premise for success with advanced analytics is that organizations need to build metadata structures and ontologies to define relationships among data elements relevant to their companies. Srinivasan continues: "That is the investment that organizations should be making rather than building their own platforms. They should be building their own representation of the core of the business, the soul of the business, which is the ontology that can embody all that knowledge of processes and customers. Insights can then be fed back into the ontology, so it becomes that living, breathing thing. It is a semantic layer that evolves around that".

Most of the work that data scientists do is "data janitorial" work, as opposed to science work, and there is a gulf between prototype and sandbox, and innovation and production. In addition, having pockets of knowledge and expertise throughout the enterprise, which may be gone when an employee leaves, poses a problem when the knowledge is not institutionalized or captured in a system. Organizations are best off if they focus on understanding their own data, focus on the business problems they are trying to solve, and build the semantic layers that can allow for data portability across various platforms. This lets them take advantage of bestof-breed solutions and not become locked into a particular vendor that does not abstract the business problem, analytic, data, and platform layers required to operationalize the fast-evolving advanced machine learning analytic and AI technologies.

References

1. R. Bean, "How Big Data Is Empowering AI and Machine Learning at Scale", MIT Sloan Management Rev., 8 May 2017;
2. S. Lohr, "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insights", New York Times, 17 Aug. 2014;
3. R. Delgado, "Why Your Data Scientist Isn't Being More Inventive", Dataconomy, 15 Mar. 2016;

About the Author

Seth Earley is CEO of Earley Information Science. He's an expert in knowledge processes, enterprise data architecture, and customer experience management strategies. His interests include customer experience analytics, knowledge management, structured and unstructured data systems and strategy, and machine learning. Contact him here.

InfoQ Software Architects' Newsletter