Zendesk reduced its data storage costs by over 80% by migrating from DynamoDB to a tiered storage solution using MySQL and S3. The company considered different storage technologies and decided to combine the relational database and the object store to strike a balance between querybility and scalability while keeping the costs down.
Zendesk created the persistence solution for event-stream data using DynamoDB for storage. The initial design was working well, but the solution became more and more expensive to operate. The team switched to a provisioned billing model and reduced costs by 50% but with the growth of the customer base and the need for Global Secondary Indexes (GSI) to support new query patterns, the costs of running the architecture became unsustainable.
The Initial Architecture Using DynamoDB (Source: Zendesk Engineering Blog)
As Zendesk is running its platform on AWS, the team was looking for alternative storage solutions that could meet functional and technical requirements while reducing costs. They considered S3, Hudi (the datalake used at Zendesk), ElasticSearch, and MySQL, but discounted Hudi because of its complexity and the 24-hour delay and ElasticSearch due to costs similar to using DynamoDB. In the end, the team decided to use MySQL to buffer logs from Apache Kafka and to store metadata and S3 to store raw data in batches of 10,000 per file.
The ingestion flow involves storing log data consumed from Kafka into the buffer table in MySQL. Every hour, a background job uploads new records from the buffer table into S3 in batches of 10,000 logs per file and inserts a metadata record per S3 file. Another hourly job deletes logs from the buffer table older than four hours.
The New Architecture Using MySQL (AuroraDB) and S3 (Source: Zendesk Engineering Blog)
To handle queries, the new solution requires a look-up in the MySQL metadata table, followed by a set of parallel S3-Select queries against files returned by the look-up. Since the data layout was optimized for chronological retrieval, the team experienced problems executing more complex queries.
Shane Hender, group tech lead at Zendesk, explains the challenges with flexible querying in the new architecture:
Once we had the above all working we encountered a performance problem when the client wanted to filter results by any field besides the timestamp. For example, if the client wanted logs for a specific user-id, then we’d at worst have to scan ALL the S3 data within the time-range to find any relevant logs, which made it difficult to figure out what queries to do in parallel.
Engineers have considered duplicating data in S3 to handle more filterable fields, but the approach wasn’t feasible given the number of field combinations. Eventually, they looked to Bloom Filters, further combined with Count-Min Sketch data structures, which together offered an effective way of supporting multi-field filter queries. The improved solution requires an additional table storing serialized data structures used to determine the S3 files to query.
After migrating, Zendesk reduced storage costs to less than 20% of provisioned DynamoDB costs, with MySQL (AuroraDB) accounting for over 90% and S3, together with S3-Select for less than 10%. The new solution offers query latencies of around 200-500 milliseconds, with some spikes into seconds, which the team is looking to optimize further.