BT

Interview and Video Review: Working with Big Data: Infrastructure, Algorithms, and Visualizations

Posted by Aslan Brooke on May 02, 2013 |

Paul Dix leads a practical exploration into Big Data in the video training series "Working with Big Data LiveLessons: Infrastructure, Algorithms, and Visualizations" The first five lessons of the training span multiple server systems with a focus on the end to end processing of large quantities of XML data from real Stack Exchange posts. He completes the training with a lesson on developing visualizations for gaining insights from the macro level analysis of Big Data. Paul uses Ruby as the primary programming language and JavaScript for browser programming. He also created a GitHub repository where the commands executed in the training can be viewed. Paul makes it practical for people to follow along by installing all the software in a Ubuntu Linux instance in AWS and he does all the software development on that same computer.

Get Ready for Hands on Coding

The first five lessons of the training use the Ruby programming language for almost all coding examples in order to level set the transition from one technology to the next in the Big Data architecture. Paul acknowledges during the training that in some cases that instead of Ruby other programming languages should be used for optimization purposes, however maintaining one language across the lessons helps accomplish the goal of focusing on the architecture and not the low level details. In the last lesson Paul necessarily switches the programming language to JavaScript because of its providence in web browser programming.

Software Installations Made Simple

The first three lessons of the training covers the installations of the major server software technologies in a Big Data architecture. The main server software selected for the training include: Hadoop, Cassandra, and Kafka. Each server system is introduced with enough detail to allow users to follow along by doing the same hands on work as Paul. The technologies complement each other and can run easily side by side as services on a single server instance. Paul uses a single m1.large instance of Ubuntu 12.04 LTS in Amazon's EC2. When needed Paul uses minor additional supporting software and code to complete the training from a conceptual perspective without taking away the focus from the major server software systems in the Big Data architecture. Supporting software includes ZooKeeper, Redis, and Sinatra.

Accessible Big Data Architecture

Paul made it easy for users of the training to get started by choosing open source software exclusively and the data used in the training is accessible over the internet as well. The beginning of the training focuses on unstructured data processing with Hadoop. The training uses a dump of Stack Exchange posts from September of 2011, the most recent posts from March 2013 can be used just the same. Following Hadoop, the training continues with Cassandra, a highly scalable column oriented database that allows for efficient read and writes for Hyperscale web technology. It is pointed out that Cassandra is used by Netflix and Twitter. Next, Paul introduces high throughput messaging and its role of moving data into a Big Data architecture. With the main server systems covered, Paul then leads into a discussion on machine learning algorithms. This part consists of multiple algorithms that calculate k-nearest neighbors and in turn allow for predictive analysis of tags on Stack Exchange posts. The final two lessons cover integrating the Big Data architecture components into production environments along with visualization techniques for doing macro level analysis.

Processing Unstructured data with Hadoop

The training makes it clear that in every Big Data conversation these days Hadoop is included. Paul recommends the Cloudera distribution because of its relative ease of installation. Doug Cutting, Cloudera Chief Architect, helped create Hadoop as an open source software framework that runs on servers as a group of services for processing large amounts of unstructured data. Google's MapReduce and Google File System provided the inspiration for Hadoop. Its main components include a Hadoop kernel, Hadoop Distributed File System (or HDFS), and MapReduce. The pseudo distributed version used in the training enables the starting of the component services in a way similar to the way a distributed Hadoop installation would run. HDFS provides a resilient data storage spread across many commodity servers. The MapReduce technology provides data location aware processing that follows this standard sequence: read in data, map data thereby creating output, shuffle the data to create a sort ordering over a key, and then reduce the data into final output. The map and reduce phases provide opportunities for integrating Hadoop with other software systems including: databases, message buses, and other file systems. Additionally, there is an extended ecosystem that includes an array of technologies: Sqoop, Flume, Hive, Pig, Mahout, Datafu, Hue, and many others.

Storing Structured data with Cassandra

Cassandra fits into the Big Data architecture in the role of structured data storage. Paul uses and recommends the DataStax distribution of Cassandra in the training. Cassandra is a column-oriented database providing high availability and durability through a distributed architecture. It achieves hyperscale clustering by providing a type of consistency known as eventual consistency. This means that at any given time there can be different values for the same database entry in different machines. Cassandra will eventually converge the data to a consistent state, however in the mean time two reads from different parts of the system can return different values for the same database entry. Software engineers can mitigate the risks associated with consistency by tuning Cassandra to meet the expected usage patterns. In the training Paul uses two methods for interacting with Cassandra through Ruby, a native client gem "cassandra" and a CQL client gem "cassandra-cql". The training takes the users through writing Map and Reduce jobs in Ruby to interact with Cassandra. Paul teaches that a Big Data architecture is comprised of both unstructured data processing, Hadoop in the training, and structure data processing, Cassandra in the training.

Messaging with Kafka

According to Paul, "a proper Big Data architecture utilizes a messaging system to handle incoming data and events in real time". In the training Paul has chosen Kafka, an open source software platform, to explain the role of messaging. LinkedIn originally developed Kafka and now it is maintained as an Apache project. The originators created the platform for performance and if there was ever a tradeoff to be made between performance or functionality, they chose performance. Kafka can be installed on a single machine while still maintaining its division of producers, brokers, and consumers. Kafka depends on ZooKeeper to assist with determining which brokers are up and which are down. Kafka has an API that is usable by both producers and consumers through the Ruby gem "kafka-rb". Paul makes sure it is understood that consumers of messages are responsible for keeping up with the messages they have read. Additionally, he elaborates on an algorithm that uses Redis as an external synchronization system for preventing multiple consumers from processing the same messages.

Machine Learning Algorithms to Predict Outcomes

Big Data architectures distill data into usable information that can be leveraged in models for predictive classification. Paul covers machine learning algorithms for K-nearest neighbors (or k-NN) in order to predict the tags associated with a Stack Exchange post. The training includes Ruby implementations of two distance algorithms, Euclidian and cosine similarity. Paul notes that while Ruby is the language being used in exploring Big Data architecture, in real systems a language like C that has better performance should be used. The training includes a lot of hands on coding in this section, including the creation of a word dictionary, the before mentioned distance algorithms, conversion of posts to sparse arrays, a complete end to end algorithm that predicts tags.

Testing Models in Production

Paul explains cross validation and provides Ruby code for doing so with the posts problem domain that has continued to be used throughout the training. He adds to the code from the previous sections and combines it with an in process http server, a Ruby gem called Sinatra, to provide access to a tag recommender. The code provides a basis for the conceptual diagramming and engineering Paul elaborates on. Paul talks about how each set of data makes up a model for a tag recommender and each set is considered a version. A service can provide distinct access to each version of the tag recommender without users being aware. All the users are aware of is that they are getting tag recommendations. This allows switching versions of the tag recommender at runtime. A website could then use each version of the tag recommender service in combination with the tags chosen by an end user and output the sum of that data to a message bus for processing out of band. The messages collected through the message bus would then be put into Hadoop and cross validated for accuracy in combination with the database of posts. The results would then be fed back into newer versions of the model.

Visualization in the Browser

In the last lesson, Paul covers the visualizations that result from Big Data architectures aggregating data into human understandable information. The training details the JavaScript programming necessary to view aggregate information about post tags across time in a web browser. Paul uses the D3 javascript library to create both a bar chart and a time series data graph. The post data drives the visualizations that Paul codes in D3. He covers the following D3 functionality: extent, scale, selecting html elements, and retrieving data.

InfoQ has also interviewed Paul Dix on the video training lessons.

InfoQ: What are the most significant challenges a company faces when executing a Big Data project?

Paul: I think the two biggest challenges are usually in operations and learning how to program against the new tools. Operations are tricky because with big data you’re talking about a distributed system. For many organizations, it’s their first attempt at running distributed clusters of machines that aren’t just web servers or typical master/slave RDBMS configurations. For programming with big data systems, you’re usually looking at either a MapReduce paradigm for batch processing, or patterns with messaging systems and stream processing for real-time processing. They’re both pretty different from the standard web request, cache, database lifecycle.

InfoQ: How does continuous delivery as a software development practice integrate with the development and production release of a Big Data architecture?

Paul: Making continuous delivery work in a big data environment can be difficult or, at the very least, expensive. The reason is that many things in big data seem to work well in small scale, but completely break when dealing with the full dataset. The only way to be absolutely sure that your continuous delivery system is checking everything properly is to have an integration environment that mirrors your production data. That can get quite costly since you’re basically running two clusters at the same time. One method to bring down the requirements for the integration environment is to sample the production data so you’re not keeping an exact mirror. However, you need to be sure that the dataset is large enough that it triggers performance problems before you deploy into production.

InfoQ: What automated quality gates should Big Data Architecture components be processed through before being released into production?

Paul: Test driven development with all components is obviously a good practice. However, there are a few other things that you’d want to do that TDD may not catch. The first is to ensure that anything you’re releasing into production has been tested for performance at scale. The other is only relevant if you’re creating predictive models like classification or clustering. For that you’ll want to have known datasets from your production data so you can run each iteration of your models against this data to compare for improvements against previous versions.

About the Video Training Author

Paul Dix is CEO and co-founder at Errplane, a Y Combinator backed startup providing a service for instrumenting and monitoring applications and infrastructure that automatically sends alerts based on detected anomalies. Paul is the author of “Service Oriented Design with Ruby and Rails.” He is a frequent speaker at conferences and user groups including Bacon, Web 2.0, RubyConf, RailsConf, and GoRuCo. Paul is the founder and organizer of the NYC Machine Learning Meetup, which has over 3,600 members. In the past he has worked at startups and larger companies like Google, Microsoft, and McAfee. He lives in New York City.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT