BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Chris Fregly on the PANCAKE STACK Workshop and Data Pipelines

Chris Fregly on the PANCAKE STACK Workshop and Data Pipelines

Key takeaways

  • The fully-immersive and interactive workshop lets attendees get up and running quickly using an all-in-one Docker image. This allows attendees to play in the sandbox with many integrations covered in detail during the workshop.
  • The Apache Spark Community is doing a good job of listening to user feedback and integrating with the most common and productive tools available including Parquet, Avro, Elasticsearch, Cassandra, Redshift, Kafka, Kinesis, where Spark Streaming is the key moving forward. These integrations are covered during the workshop as well.
  • Spark is particularly successful due to the combination of ongoing open-source and industry support that's bringing data processing, machine learning, and advanced analytics to the masses.
  • Model refinement, testing, and deployment in a heavily metered and automated manner is the basis for optimal model discovery, but there aren't a lot of well-vetted solutions that do it well at the moment.

Chris Fregly's Advanced Apache Spark PANCAKE STACK workshop, Advanced Spark and TensorFlow Meetup, videos, and presentations span a wide range of data science and engineering components one might evaluate or adopt in a Spark-based end-to-end recommendation engine and data pipeline.

The workshop is timely for data scientists and engineers searching for an integrated perspective on how one could go about building a Spark data pipeline.

The source code is packaged into a single Docker image so attendees can set up and immerse themselves in the interactive workshop while minimizing the need to focus on operations and infrastructure during the workshop.

The integrated sandbox environment, or playground, forms the central focus of the workshop hosted in several countries around the world, and is available on Github.

The workshop attendance and feedback, as well as continuous refinement of the workshop material became the basis for Pipeline.IO, Fregly's start-up.

The company's concept is to productionize what he refers to as the last-mile of the Apache Spark machine learning pipeline, namely the aspects involving model refinement, testing and deployment in a heavily metered and automated manner so data scientists can quickly measure the efficacy of their models and reduce the overall time to optimal model discovery.

InfoQ spoke with Chris Fregly about the PANCAKE STACK workshop and some of the trends ocurring in the data pipeline engineering space he captures in the playground code and the Advanced Spark workshop.

InfoQ: How did the PANCAKE STACK workshop come about, what does it mean and what were your motivations?

Fregly: The PANCAKE STACK is the next generation SMACK Stack which I co-developed back in 2015 with some colleagues from Mesosphere ('M'esos), Typesafe (now Lightbend) ('A'kka), Datastax ('C'assandra), and Confluent ('K'afka). I was working for Databricks at the time, so I provided content for Apache 'S'park as well as the baseline reference app that we used for all demos based on my playground repo which is now the foundation of my new company, PipelineIO. The extended PANCAKE STACK acronym started off as a joke between me and some old BEA Systems co-workers at a bar in Boulder, Colorado earlier this year. High-altitude drinking is a funny thing sometimes! Plus, the domain name pancake-stack.com was available, so I snagged it and the rest is history! The workshop is very dynamic. In addition to new demos and datasets that we're continuously adding, we try to adapt each session to the flow and knowledge level of each group. Sometimes we go beyond the tech represented by the acronym, sometimes we spend a half-day on a single letter like 'T'ensorFlow. Sometimes I bring in guest speakers, sometimes I do the whole thing myself. It all depends on the the background of the attendees, new release announcements, current trends, the city, speaker availability, and even the venue arrangement. As far as motivations go, the workshop is very timely and relevant because of the often bleeding edge technologies used in the stack and people's desire to learn and play with them in an integrated environment. Fortunately, attendees are very resilient and understanding when demo's crash and burn. We've all been there, of course.

InfoQ: What is Spark exceptionally good at, what is is really not good at and how does your workshop demonstrate these attributes?

Fregly: Apache Spark has made big data processing, machine learning, and advanced analytics accessible to the masses. This is awesome. For me, personally, Spark has activated neurons in my brain that haven't been activated in years. Topics like statistics and probabilistic data structures are extremely fascinating to me. As such, I have many machine learning and streaming approximation demos in the PANCAKE STACK Workshop. Other key factors to Spark's success include accessibility of examples, training videos, demos, meetups, etc. Early on, the Berkeley AMPLab created AMPCamp - a series of workshops highlighting the various libraries of Spark including Spark SQL, ML, GraphX, and Streaming. Continuing this accessibility trend, Databricks, founded by the creators of Spark, recently released a Community Edition of their Databricks Notebook product including many Apache Spark training videos and notebooks. This not only increases signups for their product, but also helps the Spark community as a whole.

InfoQ: You mentioned in a recent workshop that Spark Streaming has some limitations and may not satisfy all use cases. Can you expand on this?

FreglySpark Streaming, by nature, is a mini-batch system unlike Apache Storm, Apache Flink, and other event-based (per record) streaming systems. As data flows into Spark Streaming, a batch of received data is submitted to the Spark engine as a job. This is similar to a batch job, but with much less data, of course. The argument is that a mini-batch system will provide higher throughput than an event-based system like Storm - at the expense of higher latency for a single event to make its way through the system. Always a fan of giving the developer control, I would prefer Spark Streaming to let the developer choose their tradeoffs - throughput vs. latency - by enabling both mini-batch and per-record implementations. And as we know from physics, it's much easier to combine atoms (combine events into mini-batches) than split atoms (break up mini-batches into individual events). Mini-batches also limit Complex Event Processing (CEP) functionality offered by event-based streaming systems like Apache Flink, for example. I've also found it difficult to reason through the delivery guarantees and state management of Spark Streaming. Everytime I think I get it right, an edge case pops up that invalidates my thinking. This isn't Spark Streaming's fault, per se, as many other systems such as Apache Kafka (Event Source) and Apache Cassandra (Event Sink) are involved in the complete end-to-end streaming workflow - each with their own transactional and replay characteristics.

The last limitation is lack of consistent support. Unfortunately, a lot of Spark Streaming questions are left unanswered on the Spark mailing list. Again, this is partially due to the complexity of the overall, end-to-end nature of a production-ready Spark Streaming workflow involving many disparate event sources and sinks.

On the positive side, the creator of Spark SQL from Databricks, Michael Armbrust, from Databricks has recently taken more of a lead role in the Spark Streaming project as Spark SQL and Spark Streaming move closer together with the new "Spark SQL on Streaming" Structured Streaming effort. I've already noticed a huge uptick in responses to Spark Streaming questions on the Apache Spark user and dev list. This will likely make Spark Streaming the clear winner over the next few Spark 2.x releases. And I'm very excited about the upcoming "Spark ML on Streaming" effort that has been discussed recently. I recently productionised an incremental "Matrix Factorization on Streaming" Collaborative Filtering Recommendation Algorithm with Spark Streaming for a customer of PipelineIO. This is pretty complex stuff that can't be done with Apache Spark alone. We chose Apache Cassandra, Redis and Dynomite - an open source project from my former employer, Netflix, based on the famous Dynamo Paper from Amazon.

The Redis + Dynomite combination enables fault-tolerant and scalable caching to power the prediction/serving layer for the recommendations. Netflix replicates their Redis and Memcached clusters across 3 AWS regions - and 3 Availability Zones (data centers) per region - for a total of 9 replicas of every user's recommendation list. With this kind of global, highly-available, in-memory cache replication, serving recommendations to end users no longer requires disk I/O. This is an incredible performance advantage. And Apache Spark, while the defacto standard for data processing and ML Model training, cannot fulfill the "last mile" of ML Model serving, unfortunately. And it shouldn't. One thing I emphasize in all of my talks, meetups, and workshops: Apache Spark is a batch-oriented system (including Spark Streaming's mini-batches). Spark should not be placed on the user request/response hot path - especially for ML model serving which requires single-digit response latency.

Unfortunately, Spark Streaming gives you the illusion that Apache Spark is capable of supporting user requests/responses. I see this mistake a lot when working with customers. The fundamental mini-batch nature of Spark Streaming creates intolerably-high latency responses on even the simplest requests. A better choice for low-latency, user request-response is Redis or Memcached for the caching layer and a real data store such as MySQL, Apache Cassandra, or even Elasticsearch. It's worth noting that Elasticsearch is more than just a search engine. It's used heavily as a NoSQL datastore, an analytics engine, and more-recently, a graph database.

InfoQ: Can you tell us about your pipeline.io project and how it differs from the PANCAKE workshop? You mentioned productionization as a focus, as well as single-click deployments from Jupyter notebooks to containerized jobs or services. What problem does this solve?

Fregly: The idea for PipelineIO dates back a few years to my time at Netflix where we were forced to build a custom ML prediction/serving layer. There wasn't - and still isn't, in my opinion - a production-ready, fault-tolerant, and low latency open source system to serve Netflix-scale predictions and recommendations in real-time. This idea was further validated while working with many ML and AI customers at both Databricks and the IBM Spark Technology Center. These pipelines all ended at the training step. The last mile to deploy a model to production was cobbled together, hard to maintain, and offered no model performance insight. This is the exact void that we fill with PipelineIO for Apache Spark ML, TensorFlow, Caffe, and Theano AI Models. We've architected the system to extend any ML or AI pipeline in any on-premise or cloud environment that supports Kubernetes and Docker. At the moment, we have reference pipelines and demos for AWS, Google Platform, and Azure originating from either a Jupyter or Zeppelin Notebook. Other features include streaming A/B and Multi-armed Bandit Model Testing as well as continuous model deployment using industry standard, battle-tested open source projects including Spinnaker.io, Kubernetes, and Docker. In true Netflix "Freedom and Responsibility" fashion, every step of the pipeline can be controlled and monitored by the Data Scientist and Data Engineer directly in a NoOps environment from a notebook or command line. While this empowerment may mortify some people with traditional "hand-it-over-the-wall" mentality, we strongly believe that this is the right environment to embrace given our first-hand experiences.

 

InfoQ: In your experience, what drives people to drop off Spark after they've prototyped with it for a while before going all-in with productionizing a Spark-based pipeline? The one-processor-per-job behavior in Spark comes to mind- do you have specific examples you can elaborate on?

Fregly: I honestly haven't seen too many "drop-off" points with Apache Spark. People want Spark to succeed - there's no question. More than once I've talked a customer out of using Spark for a particular use case (ie. using Spark GraphX for real-time, incremental PageRank) - only to find them using Spark for four other use cases.

Here are a few examples of drop-off points broken down by Spark high-level library:

Spark SQL

  • Lack of full ANSI SQL support - Spark 2.0 boasts full ANSI SQL 2003 support
  • Slow performance - usually caused by undersized cluster, undersized number of partitions, or data skew overloading a single partition

Spark Streaming

  • Mini-batch causing high latency
  • Stability for long-running streaming jobs
  • Difficulty reasoning through data flow, delivery guarantees, fault-tolerance, and interactions with external event sources and sinks

Spark ML

  • Model availability -- most Spark ML algos are distributed and take advantage of Spark's distributed runtime capabilities
  • Performance -- Spark ML is a generalized ML framework and not always optimal for very specific use cases

Spark GraphX

  • Batch graph analytics engine is not a replacement for online, incremental graph algorithms and databases like Neo4J and TitanDB
  • Lack of first-class Spark SQL DataFrame support similar to Spark ML (GraphFrames support may come soon)
  • GraphX is notoriously understaffed by the Apache Spark community

InfoQ: The workshop repository has a number of non-trivial services like Apache Spark, Kafka Native, Cassandra, Redis, PrestoDB, AirFlow and MySQL running in the same Docker container derived from the same image. What might you do differently if you were to productionize these services, assuming the same integration points and using Docker?

Fregly: A single Docker Container is convenient for training and demo purposes, but does not scale, of course. We have a production-ready version with each service defined it its own Docker image. Docker 1.10+ drastically improved Docker Container communication. And Docker 1.12+ offers first-class, native support for MacOS and Windows. Additionally, we utilize Spinnaker.io and Kubernetes heavily to orchestrate, deploy, and version our clusters. A lot of these components are falling into place at the right time for us. We're very fortunate to have the Apache Software Foundation, Google, Docker, Netflix, and others contributing game-changing technologies to the open source community. We're truly standing on the shoulders of giants.

InfoQ: Did you discover anything novel or new about Docker behavior while running the stack in a single container? If so, what are they?

Fregly: Nothing we didn't expect from a scalability/configuration standpoint. Obviously, there are many challenges of a production-ready, distributed system versus a single-host system including logging, monitoring, consistency/availability/performance (CAP), configuration, etc. These are well-known - and we've addressed these in the distributed version, of course. Iteratively rebuilding the Docker image has become an overnight type of thing which we didn't expect. The size of the single Docker image has grown enormously as we pack in more and more real-world demos and datasets. We prefer this bloated approach over the piecemeal, download-on-demand approach as it puts the workshop on better rails during execution. Reducing the number of moving parts has allowed us to scale the workshop to 200+ concurrent people - giving us some really large Spark and TensorFlow clusters to play with during the workshop! We recently had an ~800 CPU, ~5TB Spark Cluster crunch through some pretty CPU and Network intensive Spark ML jobs.

InfoQ: Do you foresee the Kafka 0.10 release including ZooKeeper natively as a necessary abstraction that'll help drive Kafka adoption, or would you still like to see ZooKeeper as a separate integration piece and why?

Fregly: ZooKeeper gets a bad rap, unfortunately. And nobody likes moving parts if they can be avoided. Unfortunately, ZooKeeper - or equivalent Paxos or RAFT implementation - is mandatory for any serious distributed system - especially at large scale. You can tinker around with a shared file system or database to manage shared system state for a while, but very quickly, this component will become a bottleneck at any significant uptick in scale. ZooKeeper is actually quite stable - and has been for a while. Unfortunately, systems that rely on ZooKeeper will produce every edge case imaginable under heavy load. And these edge cases expose obscure bugs. Fortunately, because ZooKeeper is so critical, the community responds to bug fixes relatively quickly. A recent blog post by my buddy at PagerDuty, Evan Gilman, highlights one recent notable obscure bug they discovered in 2015. I love this blog post not only because it hit the Hacker News front page (congrats, Evan!), but because it demonstrates the methodical dissection of a bug that involves both high-level ZooKeeper Java user-space code and low-level IPSec/TCP system-space code. Back to the original question! I don't think anything can slow Kafka's adoption. As funny-man data scientist from Slack (formerly Cloudera), Josh Wills, put it, "Kafka is like oxygen. Saying you use Kafka is like saying you use Linux… It's the least interesting thing you can say in San Francisco."

InfoQ: During one of your recent workshops you spoke about using Elasticsearch as a database or analytics engine. What range of use-cases is Elasticsearch an especially good fit for regarding database and analytics functionality?

Fregly: The Datastax/Cassandra folks aren't going to like this statement, but it's the reality: I'm seeing a lot of small-medium scale shops migrate off Cassandra and on to Elasticsearch. I'm not saying Apple or Netflix will stop using Cassandra - this would be absolutely ridiculous. But Elasticsearch is very developer friendly, easy to setup, distributed, and scalable. Most commonly, I see teams that start developing against a datastore like MySQL or Cassandra. Once the initial threads of functionality are stood up, the business quickly asks for full-text or geo-based search functionality. The team starts investigating Elasticsearch and realizes that it's more than just a search engine. They start load testing Elasticsearch and realize there's no need to support both Cassandra and Elasticsearch when Elasticsearch can handle both. It's worth noting that Elasticsearch has a very sophisticated Spark SQL Connector complete with partition-aware data locality, predicate pushdowns, access to TF/IDF document features, etc. I was fortunate enough to have the creator of the Elasticsearch Spark SQL Connector from Elastic, Costin Leau, come speak at my Advanced Spark and TensorFlow Meetup earlier this year right before ElasticCon in San Francisco. Here's the video.

InfoQ: What do you foresee as the biggest challenge to the Spark development community, technically-speaking, and community-wise?

Fregly: I think the Apache Spark Community is doing a good job of listening to user feedback and adapting. I've seen many Spark users get frustrated with just about every aspect of Spark, but they continue to use the tool knowing that things will get better. They're also doing a good job integrating with the most common and productive tools available including Parquet, Avro, Elasticsearch, Cassandra, Redshift, Kafka, Kinesis, etc. I think Spark Streaming will be the key moving forward. It needs a makeover including better day-to-day community support, more stability, better state management, more flexibility (allow event-based in addition to mini-batch), and a more clear and consistent roadmap. I'm very excited that Michael Armbrust is taking more of a lead role on Spark Streaming. I think we'll see some great things moving forward as Streaming, SQL, and ML converge. We're trying to stay in front and clear the path for that innovation.

InfoQ: If there were one theme across all the workshops you've hosted around the PANCAKE STACK where attendees experience either the steepest learning curve, or hurdles to running the demo code, what would it be?

Fregly: The core theme is end-to-end ML pipeline integration from model training and evaluation to model serving/predicting in production. Every workshop is different, but that's the main takeaway for attendees. The repo that backs the workshop contains a complete end-to-end reference ML pipeline for attendees to take home to share with their friends and co-workers. The variance of attendee experience and skillset is very high, yes. This sometimes leads to some negative feedback as I try to address the median of the distribution. We continually use the feedback, of course, to refine the experience within reason. We've accepted that we can't please everyone so we offer a full 100% refund to those that feel it wasn't worth the money. We have no problem doing this as long as the person provides constructive feedback. In the end, we always have fun. This is why we keep doing these workshops. And we're excited to take the workshop on tour to Europe and Asia in a couple months.

About the Interviewee

Chris Fregly is currently a Research Scientist and Founder at PipelineIO. He's also a Netflix OSS Committer, Apache Spark Contributor, organizer for the San Francisco-based, Advanced Spark and TensorFlow Meetup, and author of the upcoming book Advanced Spark. Formerly, he was a Solutions Architect at Databricks, Research Engineer at the IBM Spark Technology Center, and Streaming Data Engineer at Netflix.

Rate this Article

Adoption
Style

BT