InfoQ Homepage Presentations Ubiquitous Caching: a Journey of Building Efficient Distributed and In-Process Caches at Twitter

Architecture & Design

Ubiquitous Caching: a Journey of Building Efficient Distributed and In-Process Caches at Twitter

Bookmarks

View Presentation

Speed:

51:48

Summary

Juncheng Yang discusses three trends in hardware, workload, and cache usage that shape the design of modern caches.

Bio

Juncheng Yang is a Ph.D. student at Carnegie Mellon, focused on Efficiency and Performance, Previously at Twitter & Cloudflare, a Facebook Fellow.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Yang: My name is Juncheng Yang. I'm a fifth year PhD student at Carnegie Mellon University. I started working on caching since 2015. I'm going to talk about ubiquitous caching, a journey of building efficient distributed and in-process caches at Twitter. Today's talk is going to have three parts. First, I'm going to talk about three trends in hardware, workload, and cache usage, and how they motivate the design of Segcache, a new cache store design that is more efficient and more performant. Second, I'm going to discuss the design philosophies and tradeoffs that we make in Segcache. Third, I'm going to describe how you reduce microservice tax by using small in-process caches, to build up our Segcache.

Caching Is Everywhere

Let's start with the microservice architecture. Here I'm showing a typical architecture. You have an application. The application sends requests to an API gateway. The API gateway forwards requests to different microservices. These microservices are usually stateless. Serving some of the requests require fetching data from the backend. However, always reading from the storage backend can be slow, and it can be less scalable. A typical practice is to add a distributed cache in between the service and to the backend, so that reading data can be directly from the cache, much faster and scalable always. At first glance, it seems there's only one cache in this diagram. That's not true. To the right of the distributed cache, we have storage service, whether the storage is running the Linux or BSD they will have page cache, which reduce a significant amount of [inaudible 00:02:21]. To the left of the distributed cache, they've placed application services. These services even though they are stateless, they may also have cache, in-process caches, that stores a subset of most frequently used data, so that using this data to run across the network stack. Further to that, in API gateway, we're increasingly seeing usage of cache to cache dynamic response, that's just got QL response. Your API gateway may also have a cache embedded. Going out of your core data center, the edge data center, which is closer to the user, and the edge data center such as the CDN edge data center, serve static objects for you, such as image or videos, that's out of big cache. We will look at this diagram. Almost each box or each part of that diagram have a cache, and the cache is almost everywhere.

What's New in Caching?

The study and use of cache has a long history, dating back to '60s and '70s. Unlike other areas, such as AI and machine learning, or even databases, there hasn't been huge news in caching. What's new in caching these days? Maybe there's some new eviction algorithms, or maybe there's some new architecture, or maybe there's some new features, or maybe there's some new hardware we can deploy a cache on. If you look closely, it's actually a mixture of all of that. All this excitement, all these advancement and improvements boil down to three things: performance, efficiency, and new use cases. First, performance. As a cache user, cache designer, and a cache operator, we all want better performance. There are works to reduce the latency of cache is much tail latency, and increase the throughput and improve the scalability of the cache design. Besides a common case of performance, tail performance and the predictable performance, especially predictable tail performance is very important if we run cache at large scale. There are some works to improve the logging, the monitoring, and the observability towards more caching, so that when your cache goes wrong, you will know it. It will be easier for you to diagnose and find out what goes wrong.

The second is efficiency. There are quite some works like product stuff to actually improve the efficiency or the cost effectiveness of caching. For example, Meta open sourced CacheLib last year. A CacheLib enables you to cache both small objects and large objects on flash. At Twitter, we open sourced Segcache last year too. Segcache is a better storage design that allows you to cache more data in your cache. Besides this, there are also some staff who work on workload modeling. The workload modeling allows cache operators to do better capacity planning and resource provisioning. Besides performance and efficiency, we are increasingly seeing new cache use cases, mostly in caching dynamic data. For example, we have some works on products to do cache API responses, while some others trying to cache dynamic data structures.

At Twitter, we created Pelikan. Pelikan is an open source caching framework that will replace Memcached and Redis. Pelikan adopts a modular design with clear boundary between each module of components. Pelikan was written in C a few years ago. In the last two years, they have migrated Pelikan from C to Rust. By we here, I really mean we. I personally only contributed a very limited amount of code to the code base. The project is led by Yao and Brian, who are the core developers of Pelikan that's contributed to majority of the code base and has led the transition from C to Rust.

Hardware Trend

I'm going to focus on the new datastore on storage component, which we introduced two years ago, which is called Segcache. The design of Segcache was motivated by three trends: hardware, workload, and cache usage. First, let's talk about hardware trend. We observed that the memory capacity scaling had slowed down, but the computing has not. This figure comes from Jim Handy, the memory guy expert being semiconductor memories. It shows the DRAM price per gigabyte in the past 30 years. We observe that for the first 20 years in this figure, we see that the DRAM price dropped significantly, which is almost exponential. Notice log scale on y axis. In the past decade, from around 2011, the DRAM price per gigabyte remains almost constant, of course with some fluctuations. In contrast, the computing power has been constantly increasing. This figure shows the number of CPU cores in a high-end storage CPU from 2007 to 2014. We see that from 2007 to 2014 the number of cores increased by 8 times, which doubles almost every 2 years. From 2014 up to 2022, AMD released CPUs with over 100 cores, which if you consider hyper-threads as 1 core, which is more than 200, another 8 times increase since 2014.

Also, there has been some debate on whether Moore's Law is dead or not. Whether it's dead or not, will not stop that computing power continues to increase. Especially consider the use and wide deployment of accelerators. We believe that the computing power will continue to increase for the foreseeable future. The memory's capacity stopped scaling, and the computing power continued to increase, makes the memory capacity become more important. Making better use of your limited memory capacity or limited cache size becomes increasingly important, especially in an economy environment where cutting costs have become the new law. According to Wanclouds, close to 40% of the companies do not want to spend more money on cloud computing on the infrastructure cost. While more than 40% of the companies want to cut the cost either dramatically, or marginally on cloud computing, and only less than 20% of companies are willing to spend more money on cloud computing. Then comes the first takeaway, making a good use of limited cache space becomes increasingly important. That's the first trend, the hardware trend.

Workload Trend

The second trend is workload trend. We are in a digital era. Many of our social activities have moved online, especially after the pandemic. As a result of this, we're seeing exploding data creation rate. For example, according to IDC, the volume of data created and replicated worldwide increased exponentially. For example, compared to 9 years ago, 2013, in 2022, the volume of data is almost 10 times higher. If we further compared to 2025, just in 3 years, the amount of data created worldwide will almost double again. What does this mean for us as a cache researcher and storage engineers? It means that when you think about the capacity needed for this data, and we consider how to serve this data, maybe separate the cold data from the hot data, and what else. If we look closer at this hot data and how they're accessed over time, we observe that popularity decay has become the new trend.

In this slide, I'm comparing the relative popularity of two workloads, one collected in 2007, and one collected in 2021. The y axis shows the relative popularity measuring popularity of getting a request in a 5-minute window, while x shows time. The figure is created by averaging medians of objects. We can see that the relative popularity of objects created in 2007 does not change much after the data being created. While for our workload from 2021, objects show significant popularity decay over its lifetime. For example, compared to 5 minutes old, the data around 1 day old have 10 times less request popularity, or 10 times less popularity. Of course, we see spikes of some time points like 1 day, 2 day, 3 day, and the 4 days, the congestion is caused by a recommendation system rewiring the data. What's the takeaway for this popularity decay? It means that objects become less malleable over time, and LRU may not be the best eviction algorithm anymore, especially if we consider the scalability problem it introduced and its property of being sized [inaudible 00:14:44]. This makes LRU much less attractive compared to other more fancier algorithms.

The second workload trend is object size. Technically speaking, it's not really a trend but it was a property of key-value cache, or communication workload. We observe that object size are often small. For example, the five largest cache clusters at Twitter have mean object size between 50 bytes and 200 bytes. Across all cache clusters, which is more than 100, the 25 percentile mean object size is 100 bytes, and the median is 300 bytes. What problem does small object brings? The problem is metadata overhead. For example, Memcached uses 56 bytes metadata per object. If you have an object like workload with object size around 100 bytes, so what set of your cache size will be span metadata. If you examine your cache usage, you will see a universe of metadata, and we can call this Metaverse. Besides the object size being small, object size are also non-uniform. This is in contrast to the page cache and CPU cache where the granularity of data being cached, it's usually big size. In key-value cache, object size is non-uniform. Every one of us knows about it.

What's the problem of being non-uniform? The problem is fragmentation. If you have ever run large scale cache clusters, especially Linux clusters, we had run into memory usage I wasn't expecting, or even out-of-memory errors. This is most likely true because of fragmentation. Fragmentation itself is an important problem. Because we always run cache in full capacity, so we never run in 50% capacity, and you shouldn't run at 50% capacity, because it's cache. We run the cache at 100% capacity, creating a new object plus evicting an object. Putting this in terms of allocation, or memory allocation, a cache constantly allocates and deallocates objects. The design of memory allocator have a big impact on the efficiency and performance of the cache. In fact, we are seeing some staff working to improve the performance of memory allocator. The third takeaway from the trend is that memory allocation and memory management is critical for cache design.

Usage Trend

The last trend is usage trend. We're seeing increasing use of time-to-live in professional code. Time-to-live stands for TTL. This is set when the object is written into the cache, and when an object expires it cannot be used. Why do we use TTL? There are a couple of reasons. First, because a distributed cache has different ways to have stale data, or inconsistent data, which is bad. TTL is introduced as an approach to guarantee that stale data do not stay for a long time. Second, TTL is used for periodic refresh. For example, some machine learning prediction service may cache the compute itself in the cache for a few seconds or a few minutes. It does not need to recompute certain data, and save some computation. Third, TTL can be used for implicit deletion. For example, where you implement rate limiters, you can use TTL for time window. Or if your company enforces GDPR regulation, you may have to use TTL to guarantee that the data of this user will not stay for more than a certain amount of time in your cache. Last, we see some engineers use TTLs to signal objects' lifetime, especially when the objects have a short lifetime. In production, we observe some short TTLs are widely used, and this figure shows the distribution of the smallest TTL given the cache cluster service. We observe that for more than 30% of the cache cluster service at Twitter, the smallest TTL is less than 5 minutes, or for more than 70% of the cache clusters, the smallest TTL is less than 6 hours. It's pretty small compared to the data retention time in your backend key-value store.

What does this imply? We measure the working set size of the same workload when we consider TTL, and when we don't consider TTL. We see that if we don't consider TTL, the working set size constantly grow. While if we consider TTL, and if the cache can remove expired objects in time, then the working set size is bounded, and we do not need a huge cache size for this workload. Of course, this is assuming the expired objects can be removed entirely, but it's just not the reality in most cases. When we compare expiration and eviction, expiration removes objects that cannot be used in the future, and eviction removes objects that may potentially be used in the future. In the past, I saw some confusion and misconceptions around expiration and eviction. Some people think that expiration is a way for eviction, or eviction is one type of expiration. I don't think this is true. Expiration and eviction are two different concepts. Expiration are talking about user, while eviction is an internal operation to the cache that's used to make room to run new objects. Don't mix expiration and eviction. The first takeaway of the trend is that removing expired objects is actually more important than eviction, because expired objects are using it.

Let's look at the existing approach, a trend that's being used in production systems to remove expired objects. We see that existing approaches are good, but they're not sufficient, or efficient. For example, modern approach uses the Memcached scanning, which scan the cache periodically to remove expired objects. This is sufficient because it can remove all the expired objects if you scan parsing out. However, it's not efficient because it consumes a lot of CPU cycles, and memory penalty. Similarly, the approach used in Redis are sampling, which is both not efficient and not sufficient, because it cannot remove all the expired objects without sampling from the cluster. Random memory access is also very expensive in terms of computation. How do we find and remove expired objects?

Segcache - Segment-Structured Cache

In the second part, I want to talk about Segcache. Segcache was motivated by the trends I just talked about, and by the problems I just mentioned. Segcache stands for segment-structured cache. It is a highly efficient and high-performance cache storage design that has the following features. First, it supports efficient TTL expiration, meaning that once an object expires, it will be removed from the cache within a few seconds. Second, it has a tiny object metadata, which is only 5 bytes. This is 90% reduction compared to Memcached. Third, Segcache uses an efficient segment-based memory allocator which has almost no memory fragmentation. Third, Segcache uses a popularity-based eviction, which allows Segcache to perform fewer bookkeeping operations while remaining efficient. As a result of this, Segcache use less memory, but it gives a lower miss ratio. It uses fewer cores, but it provides you a higher throughput. Segcache can be deployed both as a distributed cache and as an embedded cache. We open sourced Segcache in the following URLs, https:/open-source.segcache.com, https:/paper.segcache.com. The first URL links to the open source code. The second URL links to a paper which, in detail, describes the design of Segcache.

Segcache Design

I do want to briefly describe the high-level architecture of Segcache. Segcache has three components. The first one is called object store. Object store is where objects are stored. We divide object store into segments, where each segment is a small log, storing objects of similar TTL. This is somewhat similar to the slab storage in Memcached. Slab storage stores objects of similar size, whereas the segment stores objects of similar TTLs, but can be of any size. The second component of Segcache is called TTL buckets. We break all possible TTL [inaudible 00:26:09] into ranges and assign each range to one TTL bucket. The first TTL bucket links out to a segment chain. When we write objects into segments, it is append only. The segment chain is from altering an append only way. All objects in the segment chain are sorted by creation time. Because objects in the same segment chain have similar TTLs, they're also sorted by expiration time. This allows Segcache to quickly discover and remove expired objects. The third component of Segcache is the hash table. Hash table is common in many cache designs. In contrast to the object chaining hash table in Memcached, Segcache use a packet chaining hash table, where objects are first going to hash buckets, then we perform chaining. When we write an object into Segcache, it first finds the overhead TLL buckets, so we find the last segment in this segment chain, then we append to that segment. When it's reading an object, we first look up in the hash table, it finds a pointer to the segment and we read the data from the segment. That's how read and write works in Segcache.

Key Insight

The design of Segcache is driven by one key insight, that key-value cache is not the same as key-value store plus eviction. Why? A couple of reasons. First, key-value cache and key-value store have different design requirements. When we design a key-value cache, the top priority is performance and efficiency. You want your cache to be as fast as possible, and you want your cache to store as many objects as possible. While if you design a key-value store, you want durability, you want consistency. You don't want data loss and you don't want data corruption. You may [inaudible 00:28:31] key-value store. This different design requirement makes key-value cache and key-value store to have different designs and different focus. In addition to design requirements, key-value cache and key-value store also have these different workloads. As we show in the microservice architecture, key-value cache usually sits in front of the key-value store, meaning that all the traffic taken by the key-value store will be filtered first by the key-value cache. The key-value cache usually sees more skewed workloads. Key-value cache workloads usually use shorter TTLs. Because of this difference, key-value cache and key-value store are completely different things. Next time when someone asks you what's the difference between Memcached and RocksDB, not only the storage media is different, it's not only the indexing structure is different, the core difference is one being a key-value cache, the other one being a key-value store.

Segcache Design Philosophies

I want to quote Thomas Sowell, a famous economist. Once he said that, there are no solutions, there are only tradeoffs. This also applies to computer system design. In Segcache design, we try to make the right tradeoff for caching. We have three design philosophies. First, we want to maximize metadata sharing via approximation. You notice that many metadata in cache designs does not need to be fully precise, in other words, they can be approximate. For example, timestamps, TTLs, they do not need to be fully accurate. Our design choice is to share metadata between objects in the same segment, and in the same hash bucket. The tradeoff here is the metadata precision versus space saving. We believe that space saving is more important for cache. As a result, let's see what it looks like. This a Memcached object store. We have a huge object metadata that is the data. Here is Segcache object store. We still have the object metadata, but it's much smaller, and we lift most of the metadata to the segment level. Such metadata includes all timestamps and your fence counters and some of the pointers. Besides sharing in the service segment, we also have sharing between objects in the same hash bucket. This type of sharing allows Segcache to have tiny object metadata.

The second design philosophy of Segcache is be proactive, do not be lazy. Segcache wants to remove expired objects, timely and efficient. It uses approximate TTL bucketing and indexing to discover expired objects and remove them. The tradeoff here is that because the TTL is approximate, Segcache may remove or may expire some objects slightly earlier than what the user specifies. This is an example for cache, because the alternative of removing objects slightly earlier is evicting an object. The object that's being evicted may still have future use, but the object that's being expired slightly earlier, it's highly likely it won't have future use before it expires. Expiring objects slightly earlier is often a better choice than evicting the object. That's why we made this design decision.

The third design philosophy in Segcache is macro management. Segcache manages segments instead of objects. The insight here is that many objects in a cache are short-lived and do not need to be promoted. If you use LRU, you will have object LRU chain and a free chunk chain such as in Memcached. Each operation needs some bookkeeping operations, and also needs logging. This significantly limits the scalability of Memcached. By managing segments, Segcache performs expiration and eviction on the segment level, this will reduce a lot of the unnecessary bookkeeping operations, and also reduce the locking frequency by thousands to ten thousand times because each segment has a thousand to ten thousand objects. The tradeoff here is that it may slightly increase the miss ratio in some rare workloads where they do not exhibit popularity decay. The benefit we get here is we obtain a higher throughput and a better scalability. Meanwhile, we reduce miss ratios in the common cases where workload do exhibit popularity decay. That's the tradeoff we made here. Those are the three design philosophies.

Evaluation Setup

Let's look at some results. We evaluate Segcache together with Twemcache, Memcached, and Hyperbolic. Hyperbolic is a cache design from Princeton. We were using 5-week long production traces when we performed all evaluations of Twitter production fleet. This evaluation only evaluates the storage, so it does not involve the RPC stack. In the first evaluation, we want to understand how much memory is needed to achieve production miss ratios. We use Twemcache as the baseline, which is 100%. This result is the lower the better. Compared to Twemcache, both Memcached and Hyperbolic reduce the footprint to some extent. However, if we look at Segcache, it further reduces the memory footprint that's needed to achieve production miss ratio compared to both Memcached and Hyperbolic. Specifically, compared to Twemcache, it can reduce the memory footprint by 40% to 90%. The largest cache cluster at Twitter, which is the [inaudible 00:35:49] cache cluster, it can reduce memory footprint by 60%. Compared to state of art, which is the best of all these systems, Segcache can reduce memory footprint by 22% to 60%. That's a huge saving, especially for the monitoring workload, which have many different TTLs that's used and different types of objects with different lifetime. This is where Segcache shines the most.

Next, we evaluate the single-threaded throughput performance of different systems. First, I'm showing Twemcache, which has pretty high throughput. Next, I show Memcached and Hyperbolic. Compared to Twemcache, these two are slightly slower or sometimes significantly slower. This is because both Memcached and Hyperbolic need to perform a lot of bookkeeping and expensive operations doing each eviction or doing each request, even Get request need to perform some operations, while Twemcache perform almost no operation upon each category. This is why Twemcache is faster. However, compared to Segcache, Segcache and Twemcache have similar models in terms of throughput. When we compare Segcache to other systems, such as Memcached and Hyperbolic, we observed that Segcache is significantly faster with up to 40% higher throughput compared to Memcached. That's single-threaded throughput.

Let's look at multi-threaded throughput. We compare Segcache with Memcached at different number of threads. We observe that Memcached stopped scaling at around 8 threads, while Segcache continued to scale up to 24 threads. At 24 threads, Segcache provides 8 times higher throughput compared to Memcached. Notice that this result is slightly different from the result we can see online [inaudible 00:38:12], this is because production traffic shows more locality. It causes more contention compared to synthetic workloads such as YCSB, or different workload. That's Segcache and how Segcache performs compared to other systems.

JSegcache - Microservice Tax

In the last part, I'm going to talk about JSegcache, which is Segcache for JVM. Earlier, I showed this slide of microservice architecture and I said that cache is everywhere. We just talk about the big orange box, the distributed cache. In this part, I want to talk about the in-process cache, which sits inside each service. Why do we use in-process cache? There are a couple of reasons. The biggest reason why we use in-process cache is we want to reduce microservice tax. Microservice architecture does not come for free. You need to pay tax for it. Here the tax means, first, long data access latency. Because we've separated out the state from the stateless service, each time when we need to access the data, we need to use RPC cost. Compared to DRAM access, RPC cost is thousands of times slower. Second, using a microservice architecture requires a lot of RPC calls, and these RPC calls consume a lot of CPU cycles. According to a report from Meta, up to 82% of their CPU cycles are spent on microservice tax on data orchestration. Here data orchestration includes I/O processing, encryption and decryption, serialization and deserialization, compression and decompression. It's a lot of CPU cycles being consumed by the microservice tax. By caching a small subset of data in-process, we can reduce the amount of RPC calls we need to make. Therefore, you can significantly reduce the CPU cycles. Last, because usually we shard our stateful service, we shard this storage servers into different sizes. When the size is small, hotspot, load imbalance have happened very frequently. It can cause long tail latency or even timeout and cascading failures. By adding a small in-process cache, it can reduce the microservice tax because it can reduce the data access latency, it can reduce the CPU cycles used on RPC communication, and it also can provide provable load balancing on the backend. Previously, researchers have already shown that by caching NlogN objects in your frontend, can guarantee that there's going to be no hotspot in your backend. Here, N is the number of backend servers. N is not the number of objects. N is the number of backend servers. If you're interested in this, here are two papers, you can take a look.

Caching in JVM

Caching in-process can reduce microservice tax. However, Twitter is a JVM shop and JVM caching is expensive because of two reasons. First, caching in JVM incurs huge garbage collection overhead. Many garbage collectors or garbage collection episodes are generational GC, meaning that they assume that objects either have a short or long lifetime. However, caching in JVM breaks this assumption, because objects in a cache have median lifetime to do eviction, expiration, and refresh. Therefore, caching in JVM consumes a lot of CPU cycles and leads to long pause time. Second, caching in JVM is expensive because it has huge metadata. For example, [inaudible 00:42:59] int uses 4 bytes, while an integer object uses 16 bytes in JVM. An empty string uses 40 bytes in JVM. That's huge. If you're familiar with JVM caching, you're probably familiar with Guava and Caffeine. Caffeine is a popular cache library for JVM or Java. A caffeine cache entry uses 88 bytes, or 152 bytes for each cache entry. Many cache workloads have mean object size from tens of bytes to hundreds of bytes. Such huge metadata will cause you to spend a lot of space on storing the metadata. That's why it's expensive.

Off-Heap Caching for JVM

We've designed and built JSegcache, which is off-heap caching for JVM services. JSegcache is a generalized wrapper around Segcache. It allows caching in JVM based microservices, with shorter GC pause with smaller metadata. It also allows us to have larger caches because the cache is no longer on a JVM heap, and because Twitter uses 32-bit JVM, which limits the heap size to 72 gigabytes. By moving the cache off heap, now the cache size is no longer bounded by the heap size. It looks like it's an amazing idea, but why haven't people used it in the past? There are a couple of reasons. First is native code is less portable. Java code is very portable, and you run it once and you run it everywhere. This is not a big problem for data center applications, because in data centers, the architecture is mostly homogeneous, especially at Twitter.

Second is serialization overhead. Because JVM memory layout is different from native memory layout. To cache a Java object, you need to perform serialization, and you find a common language that can be spoken by both the JVM and native language. In this case, it's the binary. Caching off-heap requires serialization, and this adds CPU overhead. However, we see that this still boils down to a tradeoff that we need to make, a tradeoff between GC and serialization overhead. With off-heap caching, we need to pay serialization overhead for each request, while with on-heap caching, we need to pay the GC overhead, which is less predictable, which can cause long tail latency, and which can also consume more CPU cycles. Which one is better? It depends on your use case and usage scenario. In our case, we find that paying the serialization overhead is actually better than paying the GC overhead. That's why we need to bring Segcache to JVM based microservice using JNI.

Evaluation: JSegcache Microbenchmark

Now let's look at some results. First, we evaluate JSegcache and we compare with Guava and Caffeine, two popular JVM cache libraries. We replay a production workflow. The first figure shows the throughput at different concurrency. We observe that compared to Guava and caffeine, JSegcache has higher throughput compared to Guava, but lower throughput compared to Caffeine. This is because JSegcache needs to pay the serialization overhead. Second, we show the number of cores being used in the benchmark. Because Guava and Caffeine needs to pay the GC overhead which consumes a lot of CPU cycles, they consume significantly more CPU cycles compared to JSegcache, especially when the concurrency is low, or the load is low, such as at 1 or 2. Let's see the DRAM requirement to run this microbenchmark. Compared to Guava and Caffeine, which requires GC to collect garbage, JSegcache does not require GC to collect garbage and manages object lifetime within Segcache. It has significantly lower DRAM requirements.

Staging Experiment 1: Double Cache Size

Moving from microbenchmark to production. I'm going to show you two staging experiments. The first shows the case where we have low DRAM, we want to use more hardware resources if we want to get better performance. This figure shows by doubling the cache size, JSegcache can further reduce miss ratio by 15%. It enables faster data access. JSegcache does not change GC frequency, it reduces the GC pause time by around 50%. You start off cheaper GC, the CPU utilization actually goes down, even though we pay serialization overhead for each request.

Staging Experiment 2: Shrink Heap Size by 30%

While the first staging experiment shows the scenario where we want to spend more hardware resources. The second staging experiment shows the scenario where we want to cut cost in order to shrink the DRAM usage. We shrink heap size by 30%. We observe that JSegcache achieves a similar miss ratio by using a smaller heap and it has a lower GC pause. In conclusion, JSegcache enables better performance and lower cost compared to off-heap cache.

Takeaways

I talked about the three trends that motivates the design of Segcache: the hardware trend, the workload trend, and the cache usage trend. I'm going to talk about the three design philosophies and tradeoffs in Segcache: maximize sharing, be proactive, and macro management. Then I talked about how we reduce microservice tax by adding a small in-process cache in our JVM based microservices. There are two other important messages I want you to walk away with. First, expiration and eviction are different. Some people mix this together, and think expiration is a way of eviction, or eviction is a way of expiration. This is not true. Expiration is triggered by a user but eviction is an internal operation through the cache. We shouldn't mix them together. These are two different things. Second, key-value cache is not a key-value store plus eviction. Key-value cache and key-value store have completely different design requirements, even though they share the same interface. Next time, if someone says Memcached is a key-value store in DRAM, it's correct then because Memcached never offers you an option to disable eviction, it's never meant to be a key-value store. Rather, it's a key-value cache. Last, Segcache is open source and JSegcache will be open sourced.

See more presentations with transcripts

Recorded at:

Jun 29, 2023

Juncheng Yang

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?