Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Real-time Data Analytics at Pinterest using MemSQL and Spark Streaming

Real-time Data Analytics at Pinterest using MemSQL and Spark Streaming

Leia em Português

Pinterest, the company behind the visual bookmarking tool that helps you discover and save creative ideas, is using real-time data analytics for data-driven decision making purposes. It’s experimenting with MemSQL and Spark technologies for real-time user engagement.

Using MemSQL and Spark, Pinterest built a data pipeline to ingest data into MemSQL using Apache Kafka and feeds the data into Spark via Spark Streaming API. This solution provides insight into how users are engaging with Pins across the globe in real-time. This helps Pinterest become a better recommendation engine for showing related Pins as people use the service for different use cases like planning for products to buy, places to go, and recipes to cook.

The Pin engagement data is fed into a Kafka topic which is then consumed by a Spark streaming job. In this job, each Pin is filtered and then enriched by adding geolocation and Pin category information. The enriched data is persisted to MemSQL database using MemSQL Spark Connector and is made available for query serving. MemSQL Spark connector provides tools for reading from and writing to MemSQL database using Spark. The connector uses MemSQLRDD to read data out of the MemSQL database.

This solution supports an infrastructure that collects, stores and processes user engagement data in real-time. It also helps to achieve the following capabilities:

•    High performance event logging using an agent called Singer to collect event logs and ship them to a centralized repository.
•    Reliable log transport and storage using Apache Kafka and a log persistence service called Secor that reliably writes these events to the long term data store Amazon S3. Secor is designed to overcome S3's weak eventual consistency model, with no data loss and horizontal scalability and optional partitioning of data based on date.
•    Fast query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive.


Rate this Article