Interview and Video Review: Working with Big Data: Infrastructure, Algorithms, and Visualizations
Get Ready for Hands on Coding
Software Installations Made Simple
The first three lessons of the training covers the installations of the major server software technologies in a Big Data architecture. The main server software selected for the training include: Hadoop, Cassandra, and Kafka. Each server system is introduced with enough detail to allow users to follow along by doing the same hands on work as Paul. The technologies complement each other and can run easily side by side as services on a single server instance. Paul uses a single m1.large instance of Ubuntu 12.04 LTS in Amazon's EC2. When needed Paul uses minor additional supporting software and code to complete the training from a conceptual perspective without taking away the focus from the major server software systems in the Big Data architecture. Supporting software includes ZooKeeper, Redis, and Sinatra.
Accessible Big Data Architecture
Paul made it easy for users of the training to get started by choosing open source software exclusively and the data used in the training is accessible over the internet as well. The beginning of the training focuses on unstructured data processing with Hadoop. The training uses a dump of Stack Exchange posts from September of 2011, the most recent posts from March 2013 can be used just the same. Following Hadoop, the training continues with Cassandra, a highly scalable column oriented database that allows for efficient read and writes for Hyperscale web technology. It is pointed out that Cassandra is used by Netflix and Twitter. Next, Paul introduces high throughput messaging and its role of moving data into a Big Data architecture. With the main server systems covered, Paul then leads into a discussion on machine learning algorithms. This part consists of multiple algorithms that calculate k-nearest neighbors and in turn allow for predictive analysis of tags on Stack Exchange posts. The final two lessons cover integrating the Big Data architecture components into production environments along with visualization techniques for doing macro level analysis.
Processing Unstructured data with Hadoop
The training makes it clear that in every Big Data conversation these days Hadoop is included. Paul recommends the Cloudera distribution because of its relative ease of installation. Doug Cutting, Cloudera Chief Architect, helped create Hadoop as an open source software framework that runs on servers as a group of services for processing large amounts of unstructured data. Google's MapReduce and Google File System provided the inspiration for Hadoop. Its main components include a Hadoop kernel, Hadoop Distributed File System (or HDFS), and MapReduce. The pseudo distributed version used in the training enables the starting of the component services in a way similar to the way a distributed Hadoop installation would run. HDFS provides a resilient data storage spread across many commodity servers. The MapReduce technology provides data location aware processing that follows this standard sequence: read in data, map data thereby creating output, shuffle the data to create a sort ordering over a key, and then reduce the data into final output. The map and reduce phases provide opportunities for integrating Hadoop with other software systems including: databases, message buses, and other file systems. Additionally, there is an extended ecosystem that includes an array of technologies: Sqoop, Flume, Hive, Pig, Mahout, Datafu, Hue, and many others.
Storing Structured data with Cassandra
Cassandra fits into the Big Data architecture in the role of structured data storage. Paul uses and recommends the DataStax distribution of Cassandra in the training. Cassandra is a column-oriented database providing high availability and durability through a distributed architecture. It achieves hyperscale clustering by providing a type of consistency known as eventual consistency. This means that at any given time there can be different values for the same database entry in different machines. Cassandra will eventually converge the data to a consistent state, however in the mean time two reads from different parts of the system can return different values for the same database entry. Software engineers can mitigate the risks associated with consistency by tuning Cassandra to meet the expected usage patterns. In the training Paul uses two methods for interacting with Cassandra through Ruby, a native client gem "cassandra" and a CQL client gem "cassandra-cql". The training takes the users through writing Map and Reduce jobs in Ruby to interact with Cassandra. Paul teaches that a Big Data architecture is comprised of both unstructured data processing, Hadoop in the training, and structure data processing, Cassandra in the training.
Messaging with Kafka
According to Paul, "a proper Big Data architecture utilizes a messaging system to handle incoming data and events in real time". In the training Paul has chosen Kafka, an open source software platform, to explain the role of messaging. LinkedIn originally developed Kafka and now it is maintained as an Apache project. The originators created the platform for performance and if there was ever a tradeoff to be made between performance or functionality, they chose performance. Kafka can be installed on a single machine while still maintaining its division of producers, brokers, and consumers. Kafka depends on ZooKeeper to assist with determining which brokers are up and which are down. Kafka has an API that is usable by both producers and consumers through the Ruby gem "kafka-rb". Paul makes sure it is understood that consumers of messages are responsible for keeping up with the messages they have read. Additionally, he elaborates on an algorithm that uses Redis as an external synchronization system for preventing multiple consumers from processing the same messages.
Machine Learning Algorithms to Predict Outcomes
Big Data architectures distill data into usable information that can be leveraged in models for predictive classification. Paul covers machine learning algorithms for K-nearest neighbors (or k-NN) in order to predict the tags associated with a Stack Exchange post. The training includes Ruby implementations of two distance algorithms, Euclidian and cosine similarity. Paul notes that while Ruby is the language being used in exploring Big Data architecture, in real systems a language like C that has better performance should be used. The training includes a lot of hands on coding in this section, including the creation of a word dictionary, the before mentioned distance algorithms, conversion of posts to sparse arrays, a complete end to end algorithm that predicts tags.
Testing Models in Production
Paul explains cross validation and provides Ruby code for doing so with the posts problem domain that has continued to be used throughout the training. He adds to the code from the previous sections and combines it with an in process http server, a Ruby gem called Sinatra, to provide access to a tag recommender. The code provides a basis for the conceptual diagramming and engineering Paul elaborates on. Paul talks about how each set of data makes up a model for a tag recommender and each set is considered a version. A service can provide distinct access to each version of the tag recommender without users being aware. All the users are aware of is that they are getting tag recommendations. This allows switching versions of the tag recommender at runtime. A website could then use each version of the tag recommender service in combination with the tags chosen by an end user and output the sum of that data to a message bus for processing out of band. The messages collected through the message bus would then be put into Hadoop and cross validated for accuracy in combination with the database of posts. The results would then be fed back into newer versions of the model.
Visualization in the Browser
InfoQ has also interviewed Paul Dix on the video training lessons.
InfoQ: What are the most significant challenges a company faces when executing a Big Data project?
Paul: I think the two biggest challenges are usually in operations and learning how to program against the new tools. Operations are tricky because with big data you’re talking about a distributed system. For many organizations, it’s their first attempt at running distributed clusters of machines that aren’t just web servers or typical master/slave RDBMS configurations. For programming with big data systems, you’re usually looking at either a MapReduce paradigm for batch processing, or patterns with messaging systems and stream processing for real-time processing. They’re both pretty different from the standard web request, cache, database lifecycle.
InfoQ: How does continuous delivery as a software development practice integrate with the development and production release of a Big Data architecture?
Paul: Making continuous delivery work in a big data environment can be difficult or, at the very least, expensive. The reason is that many things in big data seem to work well in small scale, but completely break when dealing with the full dataset. The only way to be absolutely sure that your continuous delivery system is checking everything properly is to have an integration environment that mirrors your production data. That can get quite costly since you’re basically running two clusters at the same time. One method to bring down the requirements for the integration environment is to sample the production data so you’re not keeping an exact mirror. However, you need to be sure that the dataset is large enough that it triggers performance problems before you deploy into production.
InfoQ: What automated quality gates should Big Data Architecture components be processed through before being released into production?
Paul: Test driven development with all components is obviously a good practice. However, there are a few other things that you’d want to do that TDD may not catch. The first is to ensure that anything you’re releasing into production has been tested for performance at scale. The other is only relevant if you’re creating predictive models like classification or clustering. For that you’ll want to have known datasets from your production data so you can run each iteration of your models against this data to compare for improvements against previous versions.
About the Video Training Author
Paul Dix is CEO and co-founder at Errplane, a Y Combinator backed startup providing a service for instrumenting and monitoring applications and infrastructure that automatically sends alerts based on detected anomalies. Paul is the author of “Service Oriented Design with Ruby and Rails.” He is a frequent speaker at conferences and user groups including Bacon, Web 2.0, RubyConf, RailsConf, and GoRuCo. Paul is the founder and organizer of the NYC Machine Learning Meetup, which has over 3,600 members. In the past he has worked at startups and larger companies like Google, Microsoft, and McAfee. He lives in New York City.