Canva evaluated different data massaging solutions for its Product Analytics Platform, including the combination of AWS SNS and SQS, MKS, and Amazon KDS, and eventually chose the latter, primarily based on its much lower costs. The company compared many aspects of these solutions, like performance, maintenance effort, and cost.
Canva processes around 25 billion product analytics events per day to power many user-facing features, such as personalization and recommendations, usage statistics and insights. The data captured is also key to supporting A/B testing for any new product features.
The data pipeline that collects and distributes product analytics events needs support not only very high throughput, but also high availability (99.999% uptime), and be cost-effective, reliable and user-friendly. The team responsible for delivering the event-driven architecture (EDA) for the product analytics used the combination of AWS SQS and SNS in the early stages of the MVP. These services were easy to set up and provided excellent resiliency and scalability, but their cost accounted for 80% of running the architecture.
Product Analytics Data Pipeline Using Amazon KDS (Source: Canva Engineering Blog)
Based on the initial MVP experience, the team decided to look for alternatives that would meet performance requirements at lower costs and considered two other AWS services: Amazon Managed Streaming for Apache Kafka (MSK) and Amazon Kinesis Data Stream (KDS). Engineers compared cost, performance and maintenance between these services and opted for KDS for its low cost (85% cheaper than SQS+SNS) and extremely low maintenance, despite higher latency compared to MSK (10-20ms higher but acceptable).
To improve cost-effectiveness of KDS-based solution, the team used batching of events and zstd compression, with 10x compression ratio and 100ms per batch compression latency. Engineers estimate that the use of compression resulted in $600k annual savings.
One area that required special attention while using KDS was high tail latency (over 500ms) and throttling when throughput spikes would go over 1MB/s hard limit threshold per shard. Engineers implemented a fallback logic that utilized SQS queue and, as a result, achieved p99 latency below 20ms, while paying less than $100 per month for SQS. The fallback option additionally doubles down as a failover mechanism in case KDS would experience severe service degradation or an outage.
Fallback to SQS in Case of KDS Thottling (Source: Canva Engineering Blog)
The team used Protocol Buffers to ensure the architecture could describe and evolve event definitions over time. Canva has been using Protocol Buffers to define contracts between microservices already, but for event definitions, it additionally required full backward and forward compatibility. Engineers also created a home-grown code generation tool on top of protoc.
Datumgen is used to verify compatibility requirements and generate code in multiple languages. Furthermore, the tool extracts metadata from event definitions to enhance the event catalog data with details about technical and business owners, as well as field descriptions. Well-documented and up-to-date event schemas help Canva maintain data quality, avoid costly issues with schema incompatibility at runtime, and empower engineers to discover available product analytics events.