MLConf was going strong on Friday April 11 in New York City. This was the first ever MLConf happening in NYC, and it was met with a resounding success, so much that it quickly sold out. The conference was a full day packed with sessions around Machine Learning, and a good portion of the talks also talked about Big Data topics and how to apply this machine intelligence at very large scale.
This is another good indication that the field of Data Science is stronger than ever, and that today's data scientists need a mix of two skills: traditional research skills with a background in mathematics and statistics, and more engineering skills to be able to work with the most popular Big Data frameworks.
Machine Learning and Big Data
Corinna Cortes from Google kicked off the conference with a description of how some of Google's services are built, with a focus on how they scale to millions of users. Take for example the problem of image browsing, where Google is applying research in image processing to cluster similar images together in its product titled Google Image Swirl. According to Corinna, this is done by computing all the pair-wise distances between related images and form clusters, but at Google scale this becomes quickly impractical. To solve that, Google started representing images by short vectors using the Kernel-PCA algorithm, and making random projections to grow the resulting tree top-down. At query time, all that is left is browing the tree to return the related images.
Another example described by Corinna is the Google Flu Trends where they are essentially looking at correlation between query searches. This is similar to another product called Google Correlate which allows users to correlate search patterns with arbitrary real-world events. Corinna described using the K-means algorithm to solve this problem by splitting time series into smaller chunks, and representing each chunk with a set of cluster centers.
Ted Willke from Intel came on stage to stress an important paradigm shift: we are living in a world where context and semantics matter and are being analyzed more and more, for example through the use of RDF. Scaling at web scale remaing a challenge, and Ted described using the Titan distributed database to store RDF data, and applying the LDA algorithm that can be simply expressed using Gremlin.
In a very different talk, Yael Elmatad from Tapad described the algorithms involved in the construction of Tapad's device graph that links consumer devices together and contains around 2 Billion nodes representing more than 100 Million households and 250 Million individuals. The weighting algorithm for the edges between devices in particular took several tries to get right. Focusing initially on segmentation data (traits associated to various individuals), Yael found that this approach performed poorly, barely 10% better than a random guess because of the nature of segment data itself that is filled with randomness and noise, and also the fact that long-lived devices tend to accumulate a lot of segments. Instead of trying to correct the biases, Yael took a different approach to use Tapad's in-house browsing data by considering unique domains, even if that data is harder to come by and so much sparser. The results were much more encouraging and performed 40% better than random, which is close to ideal since Tapad considers 50% better than random to be the benchmark. Yael makes an interesting statement about what data should be considered when building models:
The best pieces of data may be scarce and raw because they are often less fraught with hidden biases and unnecessary processing.
Edo Liberty from Yahoo came on stage to describe a type of streaming computational model which is often useful when dealing with large-scale data when memory is limited. In the context of email threading for Yahoo Mail, Edo was tasked to come up with an algorithm that would scale to Millions of users with limited resources. Edo used the Misra-Gries algorithm described in these lecture notes by clustering all available emails, and computing conditional probabilities to determine how likely two emails can occur one after another. This method goes back to a common theme about trading accuracy for space.
In terms of scalability there are many databases on the market like Amazon Redshift which are built for performance and try to be a generic platform to fit all use cases. According to Shan Shan Huang from LogicBlox, there are times when using a database built for a very specific purpose can outperform a generic database like Redshift. LogicBlox implemented their own join algorithm in the form of the leapfrog triejoin which is a kind of worst-case optimal join, and is available in their platform.
Perhaps the most impressive talk of the day in terms of scale since it encompasses more than 30% of US traffic, Justin Basilico from Netflix described how the company built a recommendation system for the homepage to show movies based on user tastes. For Netflix everything is actionable to make recommendations, including play data, ratings, movie metadata, impressions, interactions or even social data. These events are then fed to various models which are evaluated through a rigorous testing process before it is rolled out to all users, as described below:
And for users wanting to leverage their existing Hadoop infrastructure to apply Machine Learning algorithms, but looking at something more than Mahout, there is H2O. Speakers Irene Lang and Anqi Fu from 0xdata went into details on how H2O adds an in-memory layer on top of Hadoop, as well as an R interface to keep it familiar among the data science community.
The Rise and Rise of Spark
Apache Spark is an open-source project initially developed at UC Berkeley and only really started about two years ago. It is now one of the most popular Big Data frameworks with the most active contributor community - even more than Apache Hadoop. Spark was a common theme to many of the talks at MLConf, reinforcing its position as one of the leading frameworks to process data at scale.
It was not surprising to see a talk from Databricks, the company behind Spark where a lot of contributors are working. Xiangrui Meng gave a sensible overview of the Spark framework and stressed one of its key points: Spark is becoming the most powerful platform for data scientists because it unifies everything into a single platform whose foundation is Spark. MLlib is one of the projects that builds on top of Spark to provide a reliable and scalable Machine Learning platform, which can be integrated with other projects from the UC Berkeley software stack. For example, to integrate Spark SQL with MLlib, Xiangrui gave the following example:
// Data can easily be extracted from existing sources // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude, FROM Users u JOIN Events e ON u.userId = e.userId""") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = SVMWithSGD.train(training)
And to integrate with Spark Streaming, the following example shows how to collect and process tweets in real-time:
// train a k-means model val model: KMeansModel = ... // apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) // print tweets within this particular cluster filteredTweets.print()
In a different example, Xiangrui showed how to integrate MLlib with the new Graph processing engine GraphX based on Spark with a simple spam detector example:
// assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices // load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double))))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } // train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)
According to Josh Wills from Cloudera, one of the real strengths of Spark is that it is as good for data cleansing as it is for data modeling. Since everything can be done in the same environment, this also means less context-switching for developers. But according to Josh this is still too raw as developers have to manually go from the raw data to the model, and Josh would like to do a feature extraction DSL for Spark, similarly to the R formula specification. But the real challenge is that it needs to operate in a distributed context, and can be quite labor-intensive to develop.
Spark is still a young technology, and several other speakers mentioned that they were looking at bringing Spark in their infrastructure, including Spotify and Netflix.
Analyzing Human Behavior with Machine Learning
Human behavior is very rich and a driving force behind many companies such as social networks. Todd Merrill from HireIQ described how the same applies to making hiring decisions in call centers. By using audio analytics to process the voice signal recorded in interviews, HireIQ is able to extract meaningful features such as the tone of the voice to build models used to minimize the probability that a potential employee might quit in a short period. While initially flawed because of overfitting, the tool now agrees with human experts most of the time and is able to meaningfully reduce hiring-based costs for call centers.
A big area where human behavior is of paramount importance is online advertising. Claudia Perlich from Dstillery gave a detailed talk around the challenges associated to applying models on high-dimensional data in display advertising. One interesting highlight from this talk is that you have to find a good tradeoff between bias and variance, and sometimes optimizing a different variable can yield better results. For example, when building a model to determine who buys on a given website, trying to optimize for clicks gives almost random results because you could be optimizing for people with bad eyesight, whereas optimizing for something simpler like site visits gives much better results. Another good lesson from Claudia is that when doing sampling, it is important to sample the positive examples in the same proportion as the negative examples, otherwise the model will be skewed.
Spotify brings a complete different use case on the table, which is still about analyzing human behavior. Erik Bernhardsson explained how the company uses collaborative filtering to structure music understanding by finding patterns in usage data. One of the key elements is figuring out how similar two tracks or albums are. Erik recalled a great quote from George Box staging that:
Essentially, all models are wrong, but some are useful.
This is exactly the approach that Spotify took, by using the latent factors method on play counts to bring the problem back to a matrix factorization problem. The idea is putting all the users and tracks into a big sparse matrix, and then running the PLSA algorithm. The results are in the form of vectors which are very small and compact fingerprints of a user's musical tastes. Aside from having a solid underlying model, this method is also very fast and easy to scale, and measuring similarities becomes as easy as computing the cosine similarity. While PLSA initially couldn't fit in memory and the team had to resort to using Hadoop, most newer models don't need a lot of latent variables and can fit in RAM, so the Hadoop overhead becomes avoidable.
Machine Learning for the Greater Good
Machine Learning also enables scientists to apply their algorithms for the greater good, as is the case with Pek Lum from Ayasdi. Pek applies Topological Data Analysis (TDA) techniques to work around the limitations of most query-based systems available today. When looking at the human genome in the context of cancer for example, Ayasdi is able to create networks in the form of graphs and find patterns in this data. This particular type of analysis is also applicable to other high-dimensional complex datasets. According to Pek, there are three fundamental properties for any kind of TDA system to work:
- Coordinate invariance: the topologies measure properties of the shapes that don't change even as shapes get rotated, translated, or more generally change the coordinates system.
- Deformation invariance: a measure should not change even if a shape gets stretched or crushed.
- Compressed representation: it is important when dealing with a shape to bring it back to its fundamentals and keeping only the relevant part of the signal and stripping out the noise.
The IBM Watson team is also applying Machine Learning to healthcare by adapting the system behind Watson who beat Jeopardy to real-world problems, in particular healthcare. Watson is built on semantic relation extraction, and speaker Chang Wang described how this was adapted to the medical domain. The main idea behind is to first generate candidate answers by detecting relations in the question and doing lookups in a knowledge base such as DBpedia or FreeBase. Once these candidates are generated, the second step is to look at the sentences and determine if they mean the same thing or not. And in a final step Watson constructs and updates the knowledge base, which is critical in a field such as medical knowledge where assumptions can change very quickly.
Healthcare data can be uncertain by nature, and speaker Samantha Kleinberg from Stevens Institute of Technology described how this can affect models. The particular problem at-hand was that of insulin intake based on glucose levels in blood, with a sensor in the body for auto-regulation. Sensors by nature can give incorrect results, for example when they are calibrating, or even missing values when they are turned off. This leads to very noisy data, which has to be discretized.
About the Author
Charles Menguy is a Software Engineer at heart with a strong bias towards Data Science. He has been using Big Data technologies such as Hadoop in the online advertising industry, and is currently working at Adobe to help scale its infrastructure in response to the massive amounts of data generated by modern web usage. Passionate about data analysis and machine learning, Charles is also an active contributor to StackOverflow, and recently started blogging here.