BT

Real-time Data Analytics at Pinterest using MemSQL and Spark Streaming

| by Srini Penchikala Follow 36 Followers on Mar 29, 2015. Estimated reading time: 1 minute |

Pinterest, the company behind the visual bookmarking tool that helps you discover and save creative ideas, is using real-time data analytics for data-driven decision making purposes. It’s experimenting with MemSQL and Spark technologies for real-time user engagement.

Using MemSQL and Spark, Pinterest built a data pipeline to ingest data into MemSQL using Apache Kafka and feeds the data into Spark via Spark Streaming API. This solution provides insight into how users are engaging with Pins across the globe in real-time. This helps Pinterest become a better recommendation engine for showing related Pins as people use the service for different use cases like planning for products to buy, places to go, and recipes to cook.

The Pin engagement data is fed into a Kafka topic which is then consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is persisted to MemSQL database using MemSQL Spark Connector and is made available for query serving. MemSQL Spark connector provides tools for reading from and writing to MemSQL database using Spark. The connector uses MemSQLRDD to read data out of the MemSQL database.

This solution supports an infrastructure that collects, stores and processes user engagement data in real-time. It also helps to achieve the following capabilities:

•    High performance event logging using an agent called Singer to collect event logs and ship them to a centralized repository.
•    Reliable log transport and storage using Apache Kafka and a log persistence service called Secor that reliably writes these events to the long term data store Amazon S3. Secor is designed to overcome S3's weak eventual consistency model, with no data loss and horizontal scalability and optional partitioning of data based on date.
•    Fast query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive.

 

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT