Machine Learning with Spark: Book Review and Interview
Machine learning is about making data-driven decisions or predictions about the future by building models from existing data. Machine Learning is getting lot of attention in the recent years because of its power in helping organizations with business decision making.
Nick starts the discussion with Spark programming model and its components like SparkContext and Resilient Distributed Datasets (RDD). He also talks about how to write Spark program in different programming languages like Scala, Java, and Python.
He also discusses how to build a recommendation engine with Spark framework using content-based filtering and collaborative filtering techniques. Building classification, regression and clustering models as well as dimensionality reduction with Spark are also covered.
Machine learning solutions are more effective when data processing and analytics can be done on real-time data, not just the static data sets. This is what’s discussed in the last chapter of the book. Topics covered include streaming data analytics, streaming regression and k-means models.
InfoQ spoke with Nick about data science and machine learning concepts and his book.
InfoQ: Could you please define Machine Learning for our readers?
Nick Pentreath: There are many definitions of “machine learning”, but I tend to think of it as simply learning from data to make predictions about the future. In this sense it has many similarities to statistics, and indeed the machine learning and statistical fields overlap significantly. However, machine learning is also heavily influenced by the fields of computer science and artificial intelligence. This combination of ideas and techniques from many disciplines is one of the aspects that most excites me about machine learning.
InfoQ: What are some Machine Learning business use cases?
Pentreath: Until fairly recently, machine learning tended to be quite theoretical, and was certainly not in the public mind. With advances in both the theory and computational power, it almost seems as though machine learning is appearing everywhere. It now powers applications including online search, recommendation engines, targeted advertising, fraud detection, image and video recognition, self-driving cars and various other artificial intelligence use cases.
InfoQ: Please define data science and what is the role of data scientist in Big Data projects.
Pentreath: The term “data science” is fairly new. Again, just like machine learning, one will find many definitions. I don’t think there is one definition of data science. Instead, it encompasses a blend of skills from different disciplines, including statistics, machine learning, programming, data visualization, and communication.
I particularly like a recent article which introduces “Type A” and “Type B” data scientists.
“Type A” data science is more focused on analysis and experimentation. In this sense, a data scientist may fall more towards the “statistician” or “data analyst” end of the data science spectrum. Examples may include running A/B tests to decide which new features to introduce to a web application, or performing a customer segmentation exercise for a retail store. A core skill here, apart from the technical, is communication and presentation of results and outcomes to a wide (often non-technical) audience.
“Type B” data science is more focused on creating systems that use machine learning to make decisions, often in an automated and real-time environment. Examples may include search and recommendation engines and fraud detection models. Core skills often emphasize software engineering and distributed systems for larger scale problems.
The role of a data scientist in “Big Data” projects depends on the nature of the project, and usually aligns with the two camps outlined above. However, both types of data scientist need to possess specific skills related to working with large data volumes, which may include distributed data processing, scalable machine learning approaches, and large-scale data visualization.
InfoQ: Can you discuss different Machine Learning models and what use cases or problems they solve?
Pentreath: Machine learning is an extremely broad field. In some sense, almost any problem involving making predictions under uncertainty could potentially be tackled using machine learning techniques.
The broad types of machine learning models include:
- Supervised learning – this involves predicting a given outcome, for example fraud detection, or the probability that a customer will buy a product;
- Unsupervised learning – this is focused on trying to uncover hidden structure in raw data; for example learning the relationships between words and the structure of language from raw text data;
- Reinforcement learning – this essentially learns how to maximize some concept of a “reward” by continuously selecting an action to take from a set of available actions. Examples include many Artificial Intelligence applications like self-flying helicopters and computers that learn to play video games.
Within each broad type, there are many different models and algorithms, each with its own advantages and disadvantages.
InfoQ: What are different technologies to implement Machine Learning solutions? How does Spark compare with these technologies?
Pentreath: There are probably almost as many machine learning libraries and frameworks as there are models! Among the most widely used are R and its many libraries, scikit-learn in Python, Weka in Java and Vowpal Wabbit in C++. Some more recent additions include H2O and various deep learning frameworks such as Caffe and Deeplearning4J.
Apache Spark core itself is a framework for distributed data processing. Spark’s MLlib library provides distributed implementations of various machine learning algorithms, and is focused on solving large-scale learning problems, often involving many hundreds of millions or billions of training examples. Therefore, it may not cover as many algorithms as some of the other general libraries. This is partly because implementing distributed versions of machine learning models is often difficult to do efficiently, but also because MLlib is still a young project undergoing heavy development.
InfoQ: What are the design considerations and best practices when designing a Machine Learning system?
Pentreath: The considerations when designing a machine learning system (as opposed to ad-hoc exploration and analysis) are much the same as for designing any complex software system. These may include: data storage and schema design (for example, storing and managing models as well as various input data sources); modularity for different components (for example, the data processing and model building component is often separate from the model serving component); architecting for independent scalability of each component; system and performance testing (both traditional software testing, as well as testing and monitoring model performance), and data visualization (for example, model performance and analytics dashboards).
In addition, machine learning systems in most cases interoperate with various other systems, such as web services, reporting systems, payment processing systems, and so on. In these cases approaches such as service-oriented architectures or “micro-services” may be useful to provide clean APIs for communication between the machine learning system and other systems.
InfoQ: In Chapter 4 of the book, you discussed recommendation engines. Can you talk about different recommendation models and when to choose each of the options?
Pentreath: Recommendation models typically fall into three main types – collaborative filtering, content-based, or model-based methods.
“Collaborative filtering” approaches use the “wisdom of the crowds” to find users (or items) that are similar to a given user (or item), based on the behaviour of many other users. This drives recommendations such as “people who viewed this product also viewed...” commonly seen on ecommerce sites. The assumption underlying collaborative filtering is that people who display similar behaviour have similar preferences for items (for example, movies). Thus, when recommending movies to a user, we can find other users that are similar to the user, and find the movies they have watched or rated. We can then recommend these movies to the user.
“Content-based” models use the content attributes of items (such as categories, tags, descriptions and other data) to generate recommendations. Content-based models generally don’t take into account the overall behaviour of other users.
“Model-based” methods try to directly model the preference of users for items (for example, modelling the expected rating of a user for a movie given the set of ratings given by all users to various movies). Model-based approaches often incorporate some form of collaborative filtering, and may also include content-based methods.
Collaborative filtering (and model-based approaches that use it) often perform very well in practice.
However, one downside is that these models require quite a bit of data in order to work well. These methods also don’t handle the “cold start problem” well – this is when a new user or item appears, for which our model has no historical data, and so cannot recommend to that user (or recommend that item) until some preference data has been collected. Finally, collaborative filtering is often fairly expensive in terms of computation (particularly when the number of users and items is very large).
Content-based methods are somewhat less “personalized” than collaborative filtering models, and often have been shown to not perform as well. However, they can handle the cold-start problem since they don’t require preference data for new items.
Model-based approaches often try to blend the power and performance of collaborative filtering with the flexibility and adaptability of content-based filtering. Recent techniques such as deep learning for content-based feature extraction, factorization machines, tensor factorization and other hybrid models have achieved strong performance (at least on benchmark datasets!).
In practice, the choice of approach and model used will depend on the domain, the data (and volume of data) available, as well as time, costs and other constraints. Often a blend of multiple approaches (or a more structured combination such as a model ensemble) is used in a real-world system. As with any machine learning system, it is important to test and evaluate the performance of different approaches both offline and on live data, and monitor and adjust accordingly.
InfoQ: One of the popular use cases with Machine Learning is fraud detection. Can you discuss how this use case should be implemented using MLlib library?
Pentreath: Fraud detection is a good example of a binary classification problem. For example, we might wish to build a model that can predict whether a given online transaction is fraudulent or not. The potential outcomes are therefore binary – “fraud” or “no fraud”.
MLlib provides number algorithms suitable for binary classification, including linear SVMs, logistic regression, decision trees, naïve Bayes and multilayer perceptron. In addition, ensemble models (which combine the predictions of a set of models) such as random forests or gradient-boosting models, are also available. These ensemble models often achieve very good performance on binary classification tasks.
As with any machine learning problem, the algorithm is only part of the solution. In many cases the input data (or “features”) that is used for training is even more important. It is often said that a high percentage of a data scientist’s time is spent on cleansing and transforming raw input data into useful features for machine learning models.
In addition to the variety of binary classification algorithms above, MLlib provides a rich set of processing functions and transformations that can be applied to datasets in order to create features for these algorithms.
Another key component is to rigorously evaluate and compare the performance of different feature transformation and model pipelines, using tools such as cross-validation (also available in MLlib) and, if possible, A/B testing on live data.
InfoQ: How can we leverage Machine Learning with other Spark libraries like Spark Streaming and Spark SQL?
Pentreath: The original version of MLlib in Spark typically operated on RDDs (Resilient Distributed Datasets, the core data structure abstraction of Spark), where the RDD contained the feature vectors (and where relevant a “label” or “target variable).
With DataFrames becoming the core of SQL on Spark, MLlib has introduced a newer API called Spark ML. In particular, Spark ML is focused on using the rich, higher-level DataFrame API to create machine learning pipelines.
A typical machine learning workflow may use DataFrames functionality to read data from various sources. It may then use Spark’s SQL capabilities to filter, aggregate and perform other initial processing on the dataset. The next step could involve applying Spark ML transformations to the processed data to create feature vectors, followed by training and evaluating a model. So in this sense, machine learning in Spark is now deeply integrated with Spark SQL and DataFrames.
Spark Streaming provides implementations of clustering and linear models for streaming data. Other Spark ML models could be integrated with a Spark Streaming program, for example to continuously update the model with new data, or to monitor the performance of our models on live data.
Nick also talked about the future of machine learning and how Spark MLlib library helps with developing machine learning applications.
Pentreath: Though it may seem like machine learning is everywhere, I believe we have just started to scratch the surface in terms of applying machine learning techniques to solve real-world problems. As the need for automating decision making becomes greater, so will machine learning become more and more widely used across many different industries.
Similarly, as data volumes continue to grow, tools for distributed machine learning and large-scale data processing will become more and more critical. I see Apache Spark as a core piece of the “Big Data” puzzle, and with it MLlib and Spark ML as a key element in making large-scale machine learning easier to develop and use and more accessible to a wider range of users.
I’m very excited about what the future holds for both machine learning in general and Apache Spark in particular!
You can find more about Nick’s book at Packt website and if you are interested in purchasing a copy, there is a discount code “MLWSPK50” that provides 50% discount on ebook.
About the Interviewee
Nick Pentreath is one of Graphflow's cofounders - a big data and machine learning company focused on recommendations and customer intelligence. Nick has a background in financial markets, machine learning and software development. He has worked at Goldman Sachs and as a research scientist at online ad targeting startup Cognitive Match in London, and led the Data Science and Analytics team at Mxit, Africa’s largest social network. He is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add value to the bottom line. Nick has been involved in the Apache Spark project since 2013, and is a member of the Apache Spark PMC.