Big Memory .NET Part 1 – The Challenges in Handling 1 Billion Resident Business Objects
This article describes the concept of Big Memory and concentrates on its applicability to managed execution models like the one used in Microsoft’s Common Language Runtime (CLR). A few different approaches are suggested to resolve GC pausing issues that arise when a managed process starts to store over a few million objects.
Why do we need to store so many objects in memory?
Say we store several hundreds of millions of addresses in an application that needs to calculate routes. The application needs to build an object-graph and stores it together with addresses. Then the application keeps these objects in memory for several days, processing thousands of queries a second.
Or consider social data with lots of messages and responses. An average site may easily add tens of thousands of “social records” daily. Fetching those bits from the database is slow. Many people use out-of-process caches, but I have my data already here. I want a Dictionary with million entries.
Then there is the “Internet of Things”, known for producing countless records as devices generate data from sensors (such as fitness trackers) on their own, without human intervention. When users log-in to a portal they want to quickly see the “overview” of daily activities, i.e. pulse plot. A Big Memory cache is the perfect solution for these kind of problems, as the whole daily series from every user device may be persisted in RAM.
About Big Memory
The term “Big Data” is nothing new. It describes huge volumes in principle; whether on disk, networks or anywhere else. Big Memory facilitates Big Data activities by doing more processing on the server or tight cluster of servers, still keeping stuff in RAM. The Big Memory approach is also conducive to real-time querying/aggregation/analytics. Think map/reduce in real-time, a kind of Hadoop that does not need to “start” and wait until done, rather “real-time Hadoop” that keeps working on data in RAM.
Big Memory comes in different forms, primarily: heaps and composite data structures. Just like in any “regular” programming language, all complex/composite data structures like lists, trees, dictionaries, sets etc. are built around the heap primitives like: alloc/read/write/free.
Big Memory heaps provide a familiar object model; references obtained when you allocate objects, dereference objects, and free up space by deallocating by reference. Heaps are very useful for things that need to be traversed, like graph databases or neural networks. In these kind of systems a reference is all you need to access another object, no need to lookup by keys.
Big Memory data structures, including lookup tables, queues, and trees are now modeled around the Big Memory heap.
Big Memory caches are used for lookups; they are actually a form of dictionary/lookup table. Caches are used everywhere these days, and are usually built on top of heaps. Like many other data structures, a cache is a layer of indirection between a caller and a heap.
What is interesting is both Big Memory heaps and caches may be either local (one machine with 64 gigabites or more of RAM) or distributed. An important requirement is the speed. Since we are in memory, we can have very fast access. Terracotta has been pushing the Big Memory concept in the Java world for a few years now. Redis and memcached are good examples of in-RAM quick storage.
So, what qualifies a solution as a Big Memory one? It is the purposeful memory utilization pattern. If your solution is purposely built to store/cache lots of data at the expense of memory consumption so it can do real-time tasks that otherwise would have been deferred/batched, it is a Big Memory solution. For example, instead of running a Hadoop job for 5 hours every night against many shards in your DB, you may get the end result at any time in seconds from your web servers if they maintain all the rollups in RAM. Storing 100 million pre-computed keys updated with every transaction in RAM is orders of magnitude faster than re-querying any data store, given those updates are very fast and don't take much time. In order for this pattern to be effective, the memory utilization has to be specifically designed for this purpose. It isn’t enough to simply allocate multiple gigabytes of memory in a haphazard fashion.
Garbage Collection and Big Memory? There is a Problem!
GC managed memory is acceptable for most applications and use cases. But GC does not work well with big in-memory heaps, caches, and other data structures. This issue has been well known in the Java community for many years; here is a video about JVM low-latency which describes similar problems and solutions.
Periodically GC makes “major” scans. With enough RAM, those scans can “stop the world” for seconds. Not only does this suspend all operations for the duration of the GC cycle, it also empties the CPU’s low latency caches, so it will take even longer before you return to full speed. What’s worse is from the user’s perspective, these pauses happen randomly. While there are heuristics involved that deal with current memory pressure, low water marks, and other checks and balances to determine when the GC will run, they are hidden from the user.
Here is the non-obvious problem: the more physical memory your host has, the more prone it is to GC-pauses. Why? When you give GC more RAM to work with, it procrastinates collecting Gen2; that is, it keeps postponing “major work” as long as it can. By allocating too many objects very fast you will put enough GC pressure to start sweeping even when there is 50% free. If, on the other hand, you slowly allocate objects, around 250 per second, it will delay the GC cycle. On my machine with 64 GB, the GC does not kick in with a full scan until I use around 60 GB. Then there is a pause lasting between 3 to 30 seconds, depending on the size of each object.
This can be particularly troubling when the workload dramatically increases. For example, if the servers are mostly idle throughout the night and then kick into high gear as employees login at the beginning of their shift. This is when GC would try to collect as little as possible (so no long block-all pause happens), but is already almost out of resources because it was slowly growing for the past few hours and is now at 90% RAM utilization.
What makes matters worse: those patterns are very hard to predict as the business workload may change dynamically. Modern APIs like GCLatencyMode do help in some cases but they make the whole system more complex and brittle.
But we need to store many objects for real life Big Memory applications!
IO bandwidth provides a major limitation for application throughput and latency. For more complex queries, the database may take several minutes to fetch the data from disk. This simply isn’t acceptable to business users who expect real time responses.
On one hand, we want to use .NET and managed memory because we don’t want to step back to the unsafe, costly development of unmanaged applications in languages such as C++, C, and, Delphi.
On the other hand, GC does not allow us to efficiently work with big caches that store a large number of native objects (i.e. reference types) and keep them around for a long time, while still preserving OOP goodies such as virtual methods and interfaces. You can densely pack structs into arrays, but lose the ability to work with subclasses.
The proposed solutions are roughly based on two approaches:
- Make objects, or at least their data/state, “invisible” to GC - this way GC has nothing to scan, or
- Keep objects visible to GC but reuse them at the same location (object pool).
So, one could use:
- Structs instead of classes. Since the GC does not “see” structs (unless they are boxed) this would relieve the pressure on the managed GC system. Keeping data this way requires working with large arrays of structs.
The disadvantage of this approach is struct types cannot represent referential models (parent/child/parent) that may be needed within the business logic. Furthermore, structs do not support inheritance. This means they can only be used for the simplest of DTOs. There are workarounds for this, but they entail simulating OOP features in a rather opaque fashion rarely seen outside of advanced C programming (i.e. the Linux kernel).
- Pre-allocated object pools. This can stop GC from shifting stuff in RAM as objects are “almost pinned” they just keep being recycled at the same address (most of the time).
Just like the structs approach, this one is not transparent enough. These objects would have to be designed so they are able to recycle their internal state. You would not be able to do a regular NEW and Dispose/using (if needed). Then transitive references would have to be “pool-prepped” for this as well. Again, extra work. On the other hand, an object pool still keeps those objects as CLR objects. GC sees them - and has to visit. So this does not really relieve any GC pressure.
- Unmanaged RAM + marshalling. The only thing I like about this is “marshal” reminds me of Tommy Lee Jones’ stellar performances in “The Fugitive” and “US Marshals”. The idea of copying memory to and from an unmanaged heap is interesting but without true “walking” of references it is really not a generic solution. One could not just copy some buffer from one RAM chunk to another.
Yes, there are “zero-copy” serializers, but they are special memory formats incompatible with CLR objects. To use zero-copy, one needs to use object layouts in memory - so this actually shifts serialization time from the “zero-copy serializer” that does nothing, to your business code. Keep in mind, the cost is still there. This approach may be great for some cases, but it cannot be used as a generic solution for polymorphic CLR objects.
- A limited number of pre-allocated byte arrays and serialize smaller data objects in them. This would require a true memory manager that allocates chunks. That is probably very slow due to serialization overhead and is most likely unusable.
So what do we do now? Can a CLR process efficiently keep hundreds of millions of my objects resident beyond Gen 0 without stalls?
In Part 2, Dmitriy Khmaladze introduce his solution to this problem, NFX Pile. Pile is a hybrid memory manager written in C# with 100% managed code designed to act as an in-memory object warehouse.
About the Author
Leonid Ganeline has 15 years as Integration Developer; Microsoft MVP Awards in Integration; blogger. He enjoys all about software, traveling, reading, running.