Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Goldsky’s Streaming-First Architecture for Blockchain Data with Flink, Redpanda and Kubernetes

Goldsky’s Streaming-First Architecture for Blockchain Data with Flink, Redpanda and Kubernetes

Goldsky created a platform for the real-time processing of blockchain data. The platform allows clients to extract data from blockchains into their databases to support product features without running the data pipeline infrastructure. The event-driven architecture (EDA) of Goldsky leverages Apache Flink, Redpanda, Kubernetes, and cloud provider services.

Goldsky’s platform provides blockchain indexing, subgraphs, and data streaming pipelines that can be used by developers building dApps (decentralized applications) who perhaps are not versed in data engineering and are not familiar with key technologies such as Apache Kafka or Apache Flink.

Yaroslav Tkachenko, principal software engineer at Goldsky, talks about data engineering for blockchain applications:

There has been a paradigm shift in the industry recently where people are realizing that you can use data platform technology that was previously used by internal analytics teams to power customer-facing features. The sort of data pipelines that previously only supported reporting and dashboards are now supporting web application functionality.

The architecture of Goldsky's platform consists of the control plane and data plane components. Control plane components are responsible for exposing configuration management APIs, allowing configuring data processing pipelines, including blockchain data sources, client database sinks, any access credential secrets, and other configuration options. UI and CLI applications utilize control plane APIs to allow clients to configure the pipelines. The data plane executes configured data pipelines, pulling the raw data from source blockchains, transforming it, and inserting it into client data-store sinks.

Goldsky’s Streaming Data Architecture (Source: Redpanda Technology Blog)

Goldsky supports two ways of extracting data from blockchain sources. Direct indexing is based on the Ethereum ETL (extract, transform, load) project, and it works by connecting to blockchain nodes directly and extracting low-level data like logs and transactions. Subgraphs, on the other hand, rely on the processing event telemetry for smart contracts using simple TypeScript applications.

Goldsky uses Redpanda, a Kafka-compatible message broker written in C++ for direct indexing, with blockchain data serialized using Avro. Redpanda is used for messaging and data storage with S3-compatible tiered storage, allowing for much longer data retention in a cost-effective way.

The transformation layer leverages Flink SQL and enables customers to define custom SQL transformations to perform filtering, projections, complex joins, or aggregations. Flink jobs are executed on Kubernetes using Flink Kubernetes Operator. Customers can choose from many pipeline sink types, including PostgreSQL, S3, ElasticSearch, ClickHouse, Rockset, and Apache Kafka.

Tkachenko summarises the benefits of data streaming for blockchain:

Data streaming concepts work extremely well for blockchain data—we’re able to solve challenging problems, such as blockchain reorgs, by modeling them as well-known stream-processing problems like retractions. Building on this further lets us support even more advanced use cases, like enriching on-chain data with off-chain data, reliably calculating Top-N aggregations, and combining data from multiple blockchains together.

About the Author

Rate this Article