RevenueCat extensively uses caching to improve the availability and performance of its product API while ensuring consistency. The company shared its techniques to deliver the platform, which can handle over 1.2 billion daily API requests. The team at RevenueCat created an open-source memcache client that provides several advanced features.
Caching is a well-known solution for enabling fast data retrieval compared to fetching records from the database system or recomputing on the fly. There are many approaches to utilizing caching in modern distributed architectures, but they all strive to ensure low latency, data availability, freshness, and consistency.
RevenueCat achieves low latency in its caching layer by pooling connections between the client library and cache servers to avoid unnecessary TCP handshakes during cache data retrieval. The Memcache client library, the company open-sourced, promotes best practices and provides default configuration values. The team established that the best results were achieved using low timeouts and avoiding client retries. Instead, in case of cache availability problems, the client marks servers as being down and falls back to the data source. It also relies on the TCP protocol to reestablish healthy connections to cache servers.
The company focused on planning its caching infrastructure to minimize the impact of inevitable failures while also considering that small cache servers get more affected by the distribution of hot keys. The planning includes determining cache server numbers and sizing based on understanding the capacity of backend services in case of cache node failures.
Impact of Failure with Mirrored Pools (Source: RevenueCat Engineering Blog)
RevenueCat uses various techniques to manage and operate its caching layer. These include fallback pools (the client falls back to a mirror pool, or a gutter pool that is aimed at storing the hottest keys), dedicated pools for specific use cases, key splitting or local caches for hot keys, recache, stale or lease policies to avoid thundering herd effect on very hot keys.
The engineering team offers some tips regarding resharding or migrating cache clusters. A consistent hashing algorithm helps avoid making the cache cold while adding extra servers to increase cache capacity. Migrations can be driven by cache clients to avoid additional migration activities and ensure smooth execution.
Race Conditions Affecting Cache Consistency (Source: RevenueCat Engineering Blog)
Lastly, the RevenueCat engineers describe common strategies for dealing with cache consistency challenges, including compare-and-swap, leases, recache policies, marking keys as stale, lowering the key TTL alone, or write failure tracking. The company’s Python memcache client library provides support for handling cache consistency for CRUD operations.
InfoQ contacted Guillermo Pérez, the director of engineering (SRE) at RevenueCat, with some questions.
InfoQ: Cache consistency is considered one of the most challenging things in computing. What are the key things to consider when caching data at scale?
Guillermo Pérez: Well, the first thing I would recommend is to determine what level of consistency you really need. A simple cache with a reasonably low TTL, can be enough in many situations, and you don't need to build anything complex for that. At RevenueCat, we needed strong consistency due to the kind of data we manage. A key thing, if you need good consistency, is having good visibility, detecting and reporting inconsistencies by re-checking cache items with some cadence, so you can monitor, track and detect regressions.
InfoQ: RevenueCat's Memcache Python library provides many excellent features described in your post. Can you recommend libraries or materials for engineers using different language stacks?
Guillermo Pérez: I would recommend starting by looking at mcrouter, that implements some helpful routing strategies and speaks memcache protocol, so you can use it from any memcache client from any language.
While the new memcache protocol really helps implementing some advanced semantics, the majority of cache management can be implemented with the current protocol.
The write failure tracker is important but easy to implement, and you can just reuse whichever reliable logging you have in your service.
Finally, I would encourage building a data access layer in your application code that abstracts all details of data fetching, so you can evolve, add caching strategies, etc., without complex application refactors. The sooner you do this, the better you will be prepared to scale.
You can also read on InfoQ about how DoorDash rearchitected its cache to improve scalability and performance.