BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Fast Data with Dean Wampler

Fast Data with Dean Wampler

In this podcast, Deam Wampler discusses fast data, streaming, microservices, and the paradox of choice when it comes to the options available today building data pipelines.

Key Takeaways

  • Apache Beam is fast becoming the de-facto standard API for stream processing
  • Spark is great for batch processing, but Flink is tackling the low-latency streaming processing market
  • Avoid running blocking REST calls from within a stream processing system - have them asynchronously launched and communicate over Kafka queues
  • Visibility into telemetry of streaming processing systems is still a new field and under active development
  • Running the fast data platform is easily launched on an existing or new Mesosphere DC/OS runtime
     

What got you from nuclear physics into big data?

  • 2:05 It was a long circuitous arc - today, a lot of people who come from that background (often physics in general) often jump into the big data world or data science.
  • 2:30 You learn a lot about probabilities and statistics with physics anyway, and not being intimidated in big problems that are hard to solve - so it’s a pretty natural transition.
  • 2:40 When I left, the data science concept wasn’t fully formed, and I had been writing software in part because I was doing a lot of programming to do calculations for my work.
  • 3:05 When the big data concept kicked off, it seemed like a natural fit for my maths skills and was an interesting space to get into.

What is fast data engineering at Lightbend?

  • 3:30 It’s a made up title - I might be the only VP of fast data engineering in the world - but I’m focussing on building stream-oriented tooling for people in the big data world.
  • 4:00 Increasingly people need answers quicker - you can’t just capture that data and process it later, but you need to respond at scale.
  • 4:30 It’s not only a competitive advantage to move from batch processing to stream-oriented processing, but sometimes you have no choice.

What is the Lightbend fast data platform?

  • 5:05 If you know Hadoop, you have a curated distribution of a bunch of tools, each of which plays a particular role.
  • 5:20 But you tend to go with a commercial distribution (like Cloudera, or Hortonworks) because you get support.
  • 5:30 The analogy here is that we’re doing something similar, but for stream processing.
  • 5:35 It bundles tools like Kafka, Spark and Spark Streaming - but we also provide HDFS because people do want to use this platform for batch processing as well.
  • 5:50 We’re seeing a convergence of architectures where people need a lot more services in a streaming context.
  • 6:00 Lightbend’s historic strength is in micro-service development with a reactive platform using tools like Akka, Play and Lagom and of course Scala as well.

How do Flink and Spark fit in?

  • 6:20 I’m concerned with the paradox of choice - if you go into a shop to buy a new refrigerator then you panic because you don’t know which one to buy - so I hesitate to give too many choices.
  • 6:40 If you’d asked me a year or two ago, I wouldn’t have advised these, but we’ve since realised that Flink fills a couple of important niches.
  • 7:00 Obviously Spark is older and much more established, but the streaming support is newer.
  • 7:10 The initial implementation was a bit of a hack - by running micro-batches, you can gain the appearance of streaming support.
  • 7:20 If you need low latency - below 10ms, for example - Spark isn’t there yet.
  • 7:30 Flink started as a stream processing engine and so has been better at these low-latency problems, even though Spark is more mature.
  • 7:40 The other think that Flink is doing is the state of the art for streaming semantics - very accurate calculations of data in flight.
  • 7:55 The people who have been thinking about this in detail are Google and the Apache Beam team - Beam was open-sourced from DataFlow - and they have been defining what streaming should be, and Flink is the closest tool for those data flows.
  • 8:10 The way Beam works is you define a dataflow in Beam, and you provide a runner that materialises it in your cluster, and that’s what Flink is doing for us.

Does the platform use Beam and the micro-runners so that you can use micro-batch as well as a stream?

  • 8:45 An emerging possible platform is to use Beam as the uber-API, in the same way that SQL is an uber-API for databases, which can be provided by different platforms and providers.
  • 9:00 We’re looking at that seriously - it’s not something that we’re providing out of the box yet, because it’s a bit early - but we’re thinking about it.

Are you using Spark to train a model, and then Flink to score it?

  • 9:30 That’s one possibility - but I don’t expect most people to run all four parts of our platform.
  • 9:45 However, for large organisations who have a diverse workset, you might be using them together.
  • 10:00 People have data science teams, who are building new models (they don’t necessarily have to be neural network models).
  • 10:10 The question is how they then deploy that model into production environment at scale.
  • 10:20 One possibility is to use an engine like Spark to do periodic training - for example, you need to update the data every minute.
  • 10:30 Spark is a great choice for that because it can handle whatever load and is aimed at micro-batches such as these training sessions.
  • 01:40 I can then use Flink (or Akka streams or Kafka streams) for low-latency scoring with whatever the latest model is.
  • 10:50 We’re thinking how we can share such models in that deployment environment.
  • 10:55 The fast data platform is going to come with sample data applications to show you how to do these things.

You mentioned the paradox of choice - when does it make sense to use which product?

  • 11:40 Stream is a very general term - it’s a buzzword at the moment.
  • 11:55 The right way to approach these problems is to think about the criteria that are going to force you to choose one over another.
  • 12:05 If you’re trying to do something very low latency, then using Spark would not be a great choice.
  • 12:30 If you’re trying to score models on the lookout for fraud, for example, then you would want something with low latency.
  • 12:35 You then have to ask what volume you’re expecting - if I’m trying to score every tweet from the Twitter firehose, then I have to use something that’s good at sharding data across a big cluster, like Flink.
  • 12:50 If I had to do this at low latency and high volume, then that might push me towards Flink.
  • 13:00 If it’s not enormous volume, or I can break the pipeline in a number of channels (such as Kafka topics) then maybe I have a bunch of micro-services that are running, but they don’t need to shard their data - that would be a good fit for Akka streams or Kafka streams.
  • 13:25 So you need to consider the requirements and what the SLAs are, but also know what can be run or can be managed.

What is happening with Beam as a standard API today?

  • 13:55 Beam is a natural API to be an abstraction layer across the streaming engines that I have mentioned.
  • 14:10 For example, SQL has suddenly become popular again - both Spark and Flink offer a SQL like API, and everyone knows SQL, why not use that for the scripting language for stream processing?
  • 14:30 It’s a 90% solution - it doesn’t do everything but it can be a good starting point.
  • 14:40 That’s emerging as the alternative for the core business logic - the alternative domain specific language for streaming.
  • 14:50 If you need something more flexible or concrete, that’s where you’ll see these other APIs shine.
  • 15:05 Beam is a very good example of this, because it has a lot of concepts that are real-world centric.
  • 15:15 For example, if you have an accounting system, and you want to know how much of every item in the store you sold per hour; you don’t know when you’ll have all of the data.
  • 15:45 It takes a while for every network packet to arrive at your machine in your network cluster.
  • 15:50 It might only be a few seconds at worst in the general case - but what if there’s a cut in the network cable somewhere?
  • 16:00 How do I deal with late or delayed packets - do I have a means of doing an estimate, and then be able to follow up with corrections later if necessary?
  • 16:10 These are the kinds of questions that the Beam guys have thought about, and their streaming API has lots of mechanisms for defining these kind of things, that are typically not seen in SQL.
  • 16:35 The main limitation in the Beam API is that it is oriented to windows of events.
  • 16:50 If you’re doing an old-school Extract, Transform and Load process, then something like Beam is overkill for that purpose - just use a micro-service that takes one at a time and then moves it downstream.

How can you join multiple streams together?

  • 17:30 Normally when you think of doing a join, you have some kind of window concept - whether it’s time, or sessions, or something like that.
  • 18:10 If you go to the Apache Beam website you will see a table where you see the runners under development [https://beam.apache.org/documentation/runners/capability-matrix/] and the features that they support.
  • 18:30 When you pick a runner, you’ll have to find out what it can and cannot do.

Coming back to the fast data platform - it supports micro-services, queues and streams?

  • 19:40 If you think about the problems faced by Hadoop ecosystem with mostly batch processing - you didn’t think about running batch processes for weeks or months.
  • 20:00 Maybe the nightly jobs run for a few hours, but then they’re over.
  • 20:05 The advantage was that the sizing that the job had to have was done at the beginning - but if you switched to a continuous stream, then you’re deploying something for weeks or months.
  • 20:30 If you’re running something for long enough, then you’re going to run into obscure problems eventually, like hardware failures, network partitions, big spikes of traffic and so on.
  • 20:40 The streaming processors now start to look a lot more like the kinds of services you might find in a micro-service world, that know how to scale, deal with failure, and so on.
  • 20:50 That forced the architectures to look more like micro-services architectures - and at the same time, we’re writing more micro-services to bring data capabilities into the pipeline.
  • 21:20 We like Kafka as a backplane for data, because it gives you the nice capturing and durability (multi-producer, multi-consumer).
  • 21:35 It could be that the data is coming from a micro-service, or it could be coming from a network socket (like the Twitter firehose).
  • 21:45 Another example is a bunch of micro-services that are getting REST requests from the stream processor to score a batch of data using a model.
  • 22:10 There are ways you can see these in a data pipeline now.

Micro-services can scale horizontally - but it sounds like the streaming processing scales vertically?

  • 22:50 You would not want to do a lot of blocking REST calls from your streaming processor, because that could slow down and become a bottleneck. 
  • 23:00 Maybe what you do is fire off requests asynchronously, like in the scoring requests previously, but the answer gets written into a Kafka topic that is subscribed by another micro-service that sends the events out.
  • 23:30 I like the Actor model - most people either love or hate the Actor model, and most people who hate it don’t like the fact that it is loosely typed.
  • 23:45 I think them as a lightweight processor model - like micro-services - then it’s not a big deal.

There’s a lot of discussion of meshes and fabrics - what are the capabilities of the fast data platform for observability?

  • 24:20 We do anticipate that a lot of these stream processing frameworks will look a lot like that which you’d deploy in a fabric.
  • 24:40 A lot of times when you do stream processing, you’re only working with a small amount of data at any one time (unless you’re doing big joins).
  • 24:50 I’m not looking at terabytes of data per second, in the same league as doing an overnight batch.
  • 25:20 That might be one reason why you go to fabrics to do these kind of streaming jobs.
  • 25:30 As far as understanding what’s going on, we’re thinking about the obvious need to what’s going on at runtime.
  • 25:40 We bought a monitoring company this year called OpsClarity that does a nice job of correlating problems in Spark with problems in Kafka.
  • 26:00 That’s at a coarse grained level, and it’s not giving you very low level visibility about what your job is doing versus what Kafka is doing.
  • 26:05 This platform is going GA soon and will be a 1.0 version without much of this visibility, but we aim to raise the abstraction level so that you can see what your job is doing, and to drill down on the services later as you need it.
  • 26:30 We’re also looking at how you deploy things in a fine-grained way.
  • 26:35 I can’t tell you what serverless looks like in this context, but that’s something we have to take seriously.

What capabilities does the OpsClarity product have?

  • 26:50 They’re doing some anomaly detection, and they have some other things on their roadmap.
  • 27:00 In general, we have ambitions to push that as far as we can so that we can minimise the amount of manual work involved in doing this.

This is all an open-source stack; what does the fast data platform do on top of this?

  • 27:30 In the same way that if you license Cloudera, you’re getting their expertise and their managed and monitoring capabilities around open-source software - it’s very similar.

What’s the getting started experience look like?

  • 27:45 We standardised on Mesosphere DC/OS, because it gives a lot of the enterprise capabilities on top of a next-generation clustering engine which can run anything, as opposed to Hadoop/Yarn which can run Spark jobs but doesn’t know about HDFS, for example.
  • 28:05 We picked that clustering infrastructure - and are now targeting Kubernetes, so if you’ve already got that you can slot right in.
  • 28:15 The way it would work for the fast data platform is that you would decide on your clustering configuration - we’d give you some help with that - stand it up, whether on-premise or in a cloud environment, install Mesosphere DC/OS, and then bootstrap a manager application into DC/OS which would then install our components.
  • 28:40 Once you have DC/OS running, it’s a fairly quick process to get the fast data platform up and running.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT