BT

Spark Summit EU Highlights: TensorFlow, Structured Streaming and GPU Hardware Acceleration

| by Alexandre Rodrigues Follow 0 Followers on Nov 13, 2016. Estimated reading time: 1 minute |

Apache Spark integration with deep learning library TensorFlow, online learning using Structured Streaming and GPU hardware acceleration were the highlights of Spark Summit EU 2016 held last week in Brussels.

The first day featured a walkthrough of the innovations introduced by Spark 2.0. The API was simplified to a single interface for DataFrames and Datasets, making it easier to develop big data applications. The second generation of the Tungsten engine takes the processing closer to the hardware, by using ideas applied in MPP databases to data processing queries: the generated bytecode leverages CPU registers for intermediate data and data in-memory is kept in a space-efficient column-oriented format.

Regardless of the API used, the data operation graph is optimised through the Catalyst Optimizer, that generates the plan for the execution of the computations across cluster and optimized bytecode for each operation.

Structured Streaming, a new high-level API for streaming released as alpha, was also covered at the conference. The API is integrated in Spark’s Dataset and DataFrame APIs and allows developers to express data reading and writing operations to/from external systems in a similar fashion as they would using Spark batch APIs. It provides strong consistencies by compiling the streaming computation as a batch computation and allows the transactional integration with storage systems (such as HDFS and AWS S3).

On the second day, Databricks CEO Ali Ghodsi pictured Spark as a tool to democratize AI by facilitating data preparation for ML algorithms and the management of computation infrastructure. Earlier this year, the deep learning library TensorFlow was integrated to run on Spark in a library called TensorFrames. The library allows data to be passed between DataFrames and TensorFlow runtime.

The data science track had a session on how the Structured Streaming enabled resilience for Machine Learning and that it enabled online learning – it will be possible to update some machine learning models with the data as it arrives than performing the model training in a batch offline job.

The last highlight was the announcement of GPU support on Databricks platform and the integration of more deep learning libraries. The GPU support is made via hardware libraries like CUDA, and having it pre-built in Databricks is said to lower the cluster setup cost.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT