BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Caching Beyond RAM: The Case for NVMe

Caching Beyond RAM: The Case for NVMe

Bookmarks
35:03

Summary

Alan Kasindorf explores the possibility of using new storage devices to reduce DRAM dependency for cache workloads and talks about use cases that optimize for different cache workloads.

Bio

Alan (Dormando) Kasindorf is OSS Memcached Project Maintainer. Previously, he was Memcache / Mcrouter at Facebook and Director of Edge Engineering at Fastly.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I'm Alan [Kasindorf] and also AKA @dormando on most social networks and services. I run the Memcache projects since Brad Fitzpatrick left it about 10 years ago. There's the link to the blog. The blog is relatively new. That's actually on the official Memcached website. Consider that a companion to this talk. The posts on there are also about this kind of excursion to MBME storage devices. It goes into more technical detail, more examples, more use cases and lots of extra information.

First off, who here is actually running Memcached in production? Show of hands? A couple. That's good. How many of you used to run it in production? Nice, ok. How many of you know the average size of the items in your data cache, whether it be Redis or Memcached or anything else? I know you do. That's good. We'll come back to that later. So the only other point I want to make before we get started is you can go to memcached.org, download the source binary and run just about everything that we talked about today. It's already in production at some large companies and small companies and seems to work.

Why RAM?

First, I want to go back over why we ended up on RAM in the first place for distributed cache systems. A little bit of a history. Back in 2001, 2002, working for LiveJournal, we had a shared memory cache on every individual web server. And this is about 64 megs, 120 megs. Not a lot in there. But the most we could put in there are translation strings, things about the site layout, icon pads, new URLs, those sorts of things.

The problem was if we started doing active validation, what would be required for doing much larger amounts of data, and even then we were relatively limited by the amount we could put any of these process. And if we started doing broadcasting validation, we start burning CPU on our already fairly so web nodes. So all we really did instead was let it expire. So after five minutes, we'd reload the translation strings. And that actually still had a fairly high hit rate. You recall the primary innovation is just HASH(key). Go directly to the node you want to talk to. Pretty much everything since then is based off of this basic idea. It's just much more complicated. There's multi hashing, there are things like crush hashing to change the layout, there are lots of databases that are based around this same central concept. Unfortunately, it was created after I left LiveJournal. I didn't get to actually get the benefit of that company, but I used it everywhere else.

Also back then, we didn't have 64 bit operating systems. This was terrible. This was really, really awful. I really can't express how awful this was. Our web servers had almost as much RAM or could easily have as much RAM as our databases. And our databases were much more expensive, let's say, $20,000 versus $1,000. And so this got a little bit better with time but it took a long time, after x86, 64, and the Opteron's, an AMDs, and saved RS. So what we did was initially we just filled the extra RAM slots on these web nodes. We were only using but a gigabyte of RAM before.

And so we had to two to three more gigs we could burn. Put a nice two-gig Memcache process in each box. It worked really well with the Memcache layout because we have lots of web servers and we lost one. It wasn't actually that big of a deal, just a small hit with the cache hit rate. And the CPU usage of the early versions of Memcached was especially low. Did absolutely no background processing. It just cared about being an LOU and it was really out of the way. It didn't actually affect the web processing that much.

We threw in all this extra reminisce, on that we have a whole ton of RAM versus the databases. And this had a lot of effects that you wouldn't really think were obvious at the time. Primarily that the databases were more free to do their rights, which means we could actually use all the disk space we were buying. See, we had these nice, big, expensive scuzzy drives and we were running them on raid 10 cutting the storage space in half and still only feeling half of them before having to buy more. And most of our master databases had lots of replicants. So we got to throw away all the replicants or turn them into more primaries and really just save money all around, not just because we're throwing RAM on to the web nodes.

Finally, we have flash, not just for heavy machinery anymore. So up until a certain date, these sorts of things were the standard for flash driving computers. This is heavy. It's metal. It was designed to run in factory machines. The primary selling point for this was that it could withstand 50G's of force before breaking. I'm not going to demonstrate that but that was a state of the art. It was slower than a normal laptop hard drive and it could do one right up. If you tried to write to it for a while, it would go out to one for a minute. It wasn't really all that good. I still used them as boot discs. They were more reliable. They didn't span, they didn't make heat.

Enter Intel one day, get the bright idea to make something that didn't suck. This is the X25-M 80G. I spent $700 on this with my own money. In 2008, 2009, I put it in my personal laptop and suddenly my personal laptop was running sequel queries faster than my $20,000 database server at data center. This was amazing and terrifying at the same time. We had fast consistent reads and writes speeds and much improved but still kind of limited total write. So the main drawback of Flash, or one of the main drawbacks of Flash, so they can only write to the device so many times before it burns out. The X25-Ms did and the X25-Es made a huge jump in the consistency of performance, the amount you could write, the amount you could read. But you can see the total number of write is still pretty low. So not a great fit for small cache devices but it's really good for databases and some other cache platforms.

So fast forward a year or two. I'm pointing varnish in front of Typepad, which was a popular blogging service run by Xixarbar. We had recently rewritten the website to not generate all of it blog statically. So before it was generative in the background, every time you made a change. And maybe half an hour later, your blog would get updated with a new post. And that was dynamic, except you could also still do really complicated templates; these sidebars that were nested, did all sorts of things and pulled up a blog posts and embedded them. And what you've got with that was a 32nd render time on some people’s blog posts. So I decide to throw varnish in front of it because they reversed cache proxy. And I had HA proxy in front of that running L7 load balancing to create a similar analog to what we have with Memcache where a specific URL goes to a specific server.

This worked actually pretty well. I had a demo of it running. But the CPU usage on the servers was almost nil, maybe 4 or 5% CPU, and all we really needed in the machine was RAM. And also coincidentally, this is similar to how Memcache is deployed now. If you replace a load balancer with a router or Twim cash or EB cash, and your place varnished with Memcache, you have something very similar. We have lots of big pool of machines and probably completely idle behind some kind of a router or a smart clan.

So what I did at the time was I did these X25-E's that could withstand a lot more writing, write a little bit faster. And I threw one in each of the servers, and suddenly I didn't need nearly as many. I turned my demo cluster into a full cluster, quadrupling the amount of data I could store on a cachewise. It was fine build tested. I was proud of this. Fargo one day decided that every page load would also load their blog homepage for the release notes. And we didn't notice for a couple of hours, which before, it would have just shut off all Typepad for the entire couple of hours.

So suddenly, we have a CDN. It turns out other people had similar ideas to this. One of them hired me and I spent the next five years building Fastly, which was a similar idea; we flatten the load balancer. It's just Marnie servers on Barney servers. Let's kind of prove the concept that you could actually start using Flash devices with cache patterns, distributed cache patterns, globally distributed cache patterns.

Now these days, fast forward another I guess, it's been seven years, eight years. Digital Ocean, Linotype, AWS, GCS, pretty much every cloud provider gives you an SSD with your $5, $10, $20 VM. And they also have block storage behind that. I forget what they're calling all the different things but they're similar. You get an NBD block, you can get several them, you can raid them together. And beyond that those things are backed by SSD's again. So I finally wanted to come back and take everything I've learned from designing all these processes and designing these companies, and pull this down into standard Memcached in a way that everybody can deploy immediately. You don't have to think about it as much and get immediate benefits.

Tradeoffs

To this, we have to talk about the tradeoffs. I decided that after much thought that these flash devices -. I showed you the one from 2008, which could do 3,300 writes, now they can do 200,000 to 400,000 to 500,000 writes and reads per second. But they still weren't quite good enough for every scenario. So decided to cut a tradeoff and go for it. And to fully understand how to use this thing, we're going to discuss the tradeoffs. And then hopefully, you'll have a good idea of how to deploy it in your own systems.

First, I want to give an example of how people used to solve this problem. I didn't come up and say, "I'm going to solve this problem the first time". It had been done for a while. And a great example is the Moneta open source system from Netflix. I think I see Scott in the audience. Thanks for this. This thing is great. This was pretty similar to a lot of the other designs I've seen. Instead of Memcache or a database, you have a proxy that runs locally on a machine and there's also a Memcache that runs the machine. And there's a separate instance that is running a database instance. It's anything that's an elementary or another disk space key-value store.

This is great for a lot of use cases. It actually solves the small item problem where you can just store lots of lots of data, if there's nothing in. If you had a missing RAM, you just go back to the storage and you can pull it back in. That's great. Unfortunately, there were a lot of steps as MISS process. Still pretty performant. Hot items will tend to stay in RAM. The downsides are the LSM trees and fairly inconsistent read performance. If we're going to hit the top level, bottom level, something in the middle, if you're going to have to bounce off the Bloom filter, every MISS has to go to the database, every delete has to go to the database, every mutation has to go to the database. There's a little bit of overhead from the proxy, which could hurt at higher levels of performance.

And in general, I heard you like databases and caches. So I put a database in front of your database for your cache to cache. I just wanted to see if I could just squish that all down and simplify it. And to do that I asked the one big question: are the small items that make up a good portion of people's workload actually useful to put on the disk? I went with no. So this is Memcache X store’s fundamental trade-off; I just don't optimize it for small items. It's mostly optimized for stuff of one kilobyte or larger. Although I've seen folks go all the way down to 300 bytes. Some of the data still are stored in RAM, some on disk. This is compared to, say, most databases have really good clustered indexes. And if you're going to be fetching lots and lots of items all at the same time, often they're related to each other in some way. So if you're fetching them from a database, you're probably going to be fetching them much faster from a range query on disk, and you could fetch them one at a time from a 1,000, different Memcache service. So if you have this problem, you're going to be stuck optimizing it, no matter what you do. So I'm going to leave that for its own.

Also, if small items make sense for Memcache, if it makes sense for your distributed cache, RAM is actually fairly cost-effective for this. You can write lots and lots of little lines without burning out your flash. You can read back very consistently, and it is actually cost performant. So what do we do from here?

I took a very simple approach. There's a little bit of metadata, the previous and next LLU. The key and then a pointer to disk, those will stay in RAM. And so depending on your key, if you have 200-byte keys with lots of lots of keywords in it, you're going to burn a lot of RAM. You can shorten that down fairly easily and get a lot more out of the system. Also, the metadata, the key, and then the value are on disk. The metadata and the key are copied, which helps with dealing with compaction. So eventually, we do need to compact everything that touches flash as a compaction process.

Fortunately, this one is pretty fast. So writes. This is a little LLU we have in the left here. And the tail of the LLU gets flushed into a sequential write buffer. And that sequential write buffer gets flushed to disk periodically or whatever its full. We can read back from the write buffer if you get a hit on it. So many SSD's really don't like mixed read and writes. If you have a fairly expensive SSD or a kind of a midline one, so you say it will get 250,000 write-offs on spec, if you mix all those reads and writes, it's going to drop to 20K or less, maybe 12K. You can spend $3,000 on a card and get 12K. I’d opt out of it and would be really confused for quite a lot of your life. So this is similar to LSTM trees. They tend to buffer, they have an SSD file and they flush it out and that's all great.

The downside is you're still limited by RAM. So if you have really large keys or items fairly small, maybe you can't put as much on disk but maybe if you're using these bands, it doesn't actually matter that much. You might get a VM with four gigs of RAM, 20 gigs of SSD, you can still double the amount of RAM that you're saving. Or better to phrase that is, you can have the amount of RAMs required for your workload. Best about using the SSD are already available to your VMs. So you can cut your costs in half even if you're only doubling the amount of data you can store.

And similar, the compaction algorithm that I mentioned earlier, it's very similar to this thing, that write buffer. Gets read back into. See if you have a 16 meg write buffer, when we recompact something, we just pulled back 16 megs all at once. We blindly read through it because we have the metadata and the keys in there. And we look back and RAM, is this thing still valid? Okay, save it. If it's not still valid then throw it away. We even have a pre-calculated hash value on that. So the compaction can saturate the desk very, very quickly. Compaction on an LSM tree is, many of you probably know, is fairly slow. It burns a lot of CPU.

And finally, the performance is very consistent. So in the earlier case with Moneta, if you do MISS, you go to the database. If you get a MISS in this thing, the cache, since the hash tables is authoritative, you're done. So you do a delete. It just removes the metadata from RAM. It forgets the precision on disk. It's done. You're doing a few overrides of something that's in memory, you're done. That's it. It doesn't have to touch disk. There’s a little bit of metadata, also for X store, so it knows how much stuff is in disk. So notice on the compact. So there was a lot of the writing that's required for a cache pattern. So normally, if you compare a cache pattern versus the database pattern, you're going to be writing a lot more of the overriding data through loading from an ATL. You're going to be just throwing things away because the TTLs tend to be lower. It's pre-calculated data. You can burn out your flash drives. And if we take this tradeoff, we cut a lot of that burn. A lot of those little writes and keep it to just consistently, the only writes we do is flushing the disk. So we don't mix in the reads and writes. We keep very consistent performance.

This is great with actually a lot of workloads. I've seen Netflix throw everything very aggressively on to this. For the VM situation, everybody is putting sessions in their caches. That's the number one use case for Memcache, is probably is PHP sessions. And you can probably quadruple the amount of PHP sessions you can have on your $5-month VM with this thing. And there’s even as a blog post on the site detailing how to do that very quickly. So you can go set it up and try it out and throw it behind something and see what happens.

And also, a lot of people, when they actually go look, they tell that to their Memcache and run stat slabs command. They come back and they'll see that 50% of their RAM is in, maybe, 40,000 RAMs, using all this memory that could just be thrown to disk and reclaim all that. Because generally in the long tail, the tradeoff that we took- so we have these large items. They take a while to recompute, a while to read over the network. Adding a little bit of disk latency is, so long as it's consistent, as long as we don't have to do a secondary lookup, bounce off, bloom filter, and do all these extra hashes, do multilevel reads. That little bit of extra latency from going to disk doesn't actually add up to much, compared to the hit you're already paying on the network. Just by having Cisco overhead reading multiple packets, decompressing that huge value, or parsing it before you actually use it. Some people are just serving almost everything off of disk, like all their hits; 99% of their hits are coming from disk. And they say, "We don't even notice latency differences versus RAM," which is amazing.

The Future

I want to talk a little bit about the future, the things I’ve recently gotten to work on or going to get working or planning on doing. And then we'll just talk about performance, which is if I just front-loaded the performancing, everybody could have walked out early. So performance for this thing is better than I expected. Netflix has been putting a lot of stuff on it. I keep pushing them because they've been pushing back on and we've been working together pretty closely for over a year on this.

Another company I was working with, they just took all their Google Cloud instances and swapped them with ones with less RAM and a ton of disk. 10X their cache storage and have their backend unload just by just rebooting all their cache service over a week. And it more or less worked. And there's another company- I wish I could have gotten permission to say their names before getting up here- but there's another company I was talking to last week that, they're finishing their testing, and about to cut it over. They're putting terabytes of cache behind this and it seems to be pretty great for them.

One thing I recently got working on is JBOD storage. So initially with this thing, it could only really use one device. And most people, you can do RAID, MD RAID and it didn't really matter that much. But you can do JBOD with this thing, where it will take all the pages that make up the file on disk and strike them over multiple devices. This is useful if all your devices have similar speed, similar size. If you're taking network block device on a cloud service and you want to stitch a couple of them together to get little bit more performance, which is how people have deployed MySQL on RDS and so forth historically, you can do that with this. So you can try it out. This works in line 1.5.10 and newer. So you can go down on the tarball and try it out.

Tiered storage. So I have a pull request. It's mostly working. I did an experiment with it a couple of months ago. I got another big dent on the code and this is kind of exciting. So earlier, I talked about compacted pages. And these are actually multiple streams of pages inside Xstore. So Xstore organizes pages, it says there's a default bucket and you have 64 meg pages that are broken down into these buckets. And when you try to write to a page, it says, "Okay, this thing, this item goes to default. I'm going to GI assign a default page to hold all these similar items." When I start compacting something, I'm going to co-locate all these compacted items into similar pages. And that's the compaction bucket.

What happens over time is that the items in the cache that survived compaction tend to be longer live. They have longer TTLs and are less likely to get overwritten. They're likely to get hit now and again but not very hard. And so you end up lowering the amount of fragmentation, lowering the amount of compaction they have to run because this older items just don't fragment as quickly.

There are also low TTL buckets. So you have items that will expire in less than a day. You don't want to ever actually re-compact them. You can string those into their own. There's a couple of these and hopefully going to expand on that in the future. So tiered storage, you can say I have this one terabyte cheap SSD, put on my compacted pages on it. The performance is a little bit less but that's actually okay. Or I have a one terabyte cheap SSD and I want to put the low TTL stuff on it. I don't care about burning it out all that much. Or I have an Optane device from Intel. It's very, very fast. I want to put my default items there and I want to compact something onto something else, very, very fast, as small as Optane. So this is very exciting, I think. And I'm hoping we will be able to get more features into it, where we can dynamically add and remove things, remove devices.

And friendly persistent memory, it's the thing apparently from Intel Optane. We have the DRAM dims are finally starting to happen after years and years of delays, but they're very exciting. I've been doing a lot of performance testing with the help of Intel. Did a lot of development work on this thing with the help of Intel and I'm hoping that we can evolve this memory and disk split with persistent memory. And the way we're going to be doing that, or the way I'm going to attempt to do this initially, is actually just take this HTable that we already have. It's in memory, make it a little bit wider, make the buckets a little bit wider, put a little bit of the metadata in there, like a hash key, like a 64 bit hash of the key inside the actual hash bucket, and then point the key and the values onto persistent memory. And we do that because persistent memory still has write limits, still has IOP limits. It's much, much higher than a flash drive. The latency is much, much lower but still, for designing this cache system, we go for broke, we go for the maximum. So that means we have the same performance benefits as Xstore where if you did a miss, you're done. You delete, you're done. You don't have to rewrite on a lot of these use cases. So we get a lot more out of this very expensive persistent memory.

Wrap up: Performance

I guess I talked a bit fast but we have quite a lot of data that we can actually talk through and have a little bit of a discussion, if you folks are up to it. First of all, how many of you are running a Redis or a Memcache, and you do more than 100,000 IOPs, or queries keys per second, per server, more than 100,000? One person, two people, three people. So the rest of you are in for a surprise. So this is a graph I did. It's going to be a little bit hard for me to see. Where we tested one Optane, two Optane, three Optane in the JBOD configuration - get rid of these for a second. And this is the 99th percentile versus a target request rate.

So what I did was I had a benchmark program where I say, “an increment to 50,000 QPS, add more load.” So you can see about here, about 500K, there's a big jump in latency. And the real load starts to differ from the target load and the latency starts to go up. So this is when the devices are saturated. The pale line here is Optane, blue line is the SSD and about 1,000 here is where - actually let me just make sure this is popping up on the other one. I'm going to rearrange so I can demo this properly. Here we go, clicking off stuff and it's not really reflecting on here. There, now I can see it's a little bit better.

So 1,000 here, that's your one-millisecond target. A lot of people have this. If you serve the request in one millisecond or less, you're happy. So you can do that up to - let's see, quite a lot for an Optane. And still down here for an SSD. About 250,000 QPS. It isn't an expensive SSD. It's about $1 per gig or four terabyte, but still. There’s the RAM baseline down here as well. This is the very worst case scenario benchmark where my pacing isn't very good. So sometimes they're batching up a whole bunch of stuff. I turned off all the internal batching in Memcache. So if you're actually running this, you have a pretty a good chance of getting better performance in this. You do MultiKey fetches or let XStore do some internal batching or some future improvements to the way we do internal batching or direct IO. This benchmark was done on buffered IO and no fancy async features. So you can just start this, run it, you don't to worry about what devices its sitting on top of, and you get this kind of performance.

There are two Optane devices, scales a little bit better before we get the cliff, and get to about 800,000. I'm going to stop doing that. But with three devices, we've hit the limit of buffered IO, and we start to hit contingent inside the Linux kernel and things start to slow down again. The picture is a little bit different if we switch from the 99th percentile. So I think that's the 90th. And suddenly the less contended devices are doing even better on average latency.

This is the preview of the next blog post I'll be doing, or I've been writing, for quite a while. This will come with full scatter plot details of all the samples that were taken, all the latency samples that were taken during the test, and the complete test script and how that was actually run. So we can go better from here. The Optane is giving very exciting latency numbers less than a millisecond for way, way higher, very close to RAM in these worst case scenarios. I mean, just a simple cost analysis. So let's assume RAM costs about $9 a gig, which apparently would be a deal, it seemed like $12, $14 right now. And how much, percentage-wise, if you have your cache and Optane versus SSD- the red line is the SSD, the blue line is the Optane- and your cost drops quite dramatically as you get more of this on. And the blog posts go a little bit more into this. They show the breakdowns of what you need for network, what you need for item size to reach these numbers, these targets.

That's actually it. So there is the blog again. Another wiki on what's the detailed documentation on how Explore works. Our Twitter handle again.

Questions and Answers

Man: For a new benchmark does the SSD start the PCI attached?.

Kasindorf: The PCI attached. You'll get relatively similar IOPs for the really good setup controllers, not like that high, but they won't be terrible. So at Fastly all our devices were set up. But we had 20 plus of them per server. So you'd be getting similar performance if you JBOD all those together. Or probably more actually. You can tell me I'm wrong too. I'll start a fight. I designed this wrong.

Man: Your job still got to pay attention to how many PCI events your particular device [inaudible 00:30:26] for the SDDs? Or you just recommend there?

Kasindorf: Actually, an important detail. For this particular benchmark, this particular benchmark was run pinned on one new node. The test server has two 16-core CPUs and I’ve actually pinned everything on one node. And so I'm using about half the bandwidth. And the two or three Optane tests, some of those guys were on the far end of the newma section. So generally, it wasn't oversubscribed, especially these servers now they have lots and lots of PCI lanes so it wasn't doing anything terribly optimal. I didn't really try very hard. I found out that I can try really, really hard on- I'm very, very good at getting very good benchmark results. But when people go and install this, they're not going to get that. They're going to have to go learn how to be a systems engineer for 10 years in order to optimize it.

I'm trying to do these benchmarks in a way that if you really flubbed up everything, this is kind of what you would see. I didn't even turn off turbo or adjust the governor or anything. So you can see the wobbling and the benchmark for me not actually optimizing this properly.

Man: You said that kernel things that's reflecting around the [inaudible 00:31:54]. What are the things affecting your background?

Kasindorf: Good things. Don't say “things” in a talk, that's a good point. So we get around two or three. Let me pull up the 99 here. So this here is case for being 100%. So every time we try to read in for something that's not in the page cache, it has to flush something out or throw something away and pull something else in. And that's more or less single threaded in the Linux kernel. So this will be a limit we can bypass once I add or direct dial support, AC Kyle support. Just waiting as long as possible, trying to make sure I'm adding features that people want to actually deploy it.

And also because, as we pointed out earlier, a couple of people raised their hand when I had more than 100K IOPs. This blows those things out of the water already. So what we're going to get when we push harder is probably better, more consistent latency, is what I will be targeting more than just trying to get more throughput out of it. And write-wise it can flush between 500 and 750,000 objects per second. And it does it off to a single thread. So I'm not really too worried about that right now.

Man: You talk about kind of usability features and things. I mean, so how is it that someone that continue to do?

Kasindorf: So usability features, this syntax is a new usability feature. It used to be, you have to specify the page size, the number of pages and things. Now you just say, "Here's the amount of storage that I want off of this path, just do it." If you download the source tarball, you add an extra configure option to enable it and you compile it and start it and then this just works. This is really actually all you need to do to start it and get it to work, this command line. You probably want to add dash end and more RAM or higher numbers here. And there's a bunch of other things you can kind of tune. You can tune the right buffer size, the page size, a couple of other things, the SSD IO threads so it has a static pool of background threads that actually write and read from the disk.

I'm trying to have the same defaults. Most people, if you're using some VMs or smaller systems, you probably don't need to change the default. This is really is all you need. Always adding more stats, more statistical counters, more documentation. As people ask questions, I'm able to answer more of those as time goes on. Usability-wise, there are still a couple of gotchas. If you have obviously lots and lots of small items that never expire, they can kind of clog up and remove some of the room that would normally be available for these item headers.

See more presentations with transcripts

 

Recorded at:

Jan 12, 2019

BT