BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Big Memory .NET Part 2 - Pile, Our Big Memory Solution for .NET

Big Memory .NET Part 2 - Pile, Our Big Memory Solution for .NET

Bookmarks

In part one, Leonid Ganeline, of IT Adapter, introduced the concept of big memory and discussed why it is so hard to deal with in a .NET environment. In part two, Dmitriy Khmaladze, of IT Adapter, describes their solution NFX Pile; a hybrid memory manager written in C# with 100% managed code.

I could not let the problem of garbage collecting pauses go. Something was vexing me; the task I had in one of our projects was a huge graph of users and addresses along with social messages - all of that stuff had to be traversed. A friend of mine told me LinkedIn used to have a C++ app where the whole blob of a social net was taken into RAM and kept there for months… hm... I just could not let it go. We started doing many experiments.

After pondering for a few days, we laid out some different techniques outlined in the prior article. None of them looked good to us.

DTOs suck. They create code you don’t need, chaff. I already have my objects nice and clean: Customer, Address, Link, and Message. Why would I duplicate them into structs? And then there are the references between Customer, Address, and Messages to deal with.

Pools would not have helped; I need to store hundreds of millions of entries. GC would just start pausing for five, sometimes ten, seconds every time. This was unacceptable.

So we started doing the serialization into byte[] method. Yes, I know, slow. But is it?

This was a true “jump off the cliff” experiment. Who could estimate the practical benefit without creating a true memory manager first? So we did. We did a memory manager a la the one in C or C++ with free blocks and chunk headers. 100% managed code. No IntPtr style unmanaged pointers, just an Int32 handle we call a PilePointer. We used our Slim serializer which is akin to Microsoft BinaryFormatter only 5 times faster with 10x smaller payloads.

The end result was really good. We first attained around 50,000 inserts/sec on a single thread while reading around 100,000 ops/sec on a single thread. Then we spent time working on multi-threading synchronization with better allocation strategies and everything started to perform really well.

The key piece, of course, is the Slim Serializer. It is a non-versioning, CLR-only, packing native serializer based on a dynamic expression tree generation. The performance is close to ProtoBuf-net, however Slim does not impose any limits on payload as long as it is serializable (ISerializable). The OnSer/Deser attribute family are supported as well.

We called the memory manager the “Pile” to differentiate from “heap” but retain a similar meaning. So, “The Pile” is a memory manager that brings an unmanaged memory flavor to the managed memory paradigm. As tautological as it may sound, the thing turned out to have a very cool set of practical benefits.

IPile Interface

IPile Implementation

The Pile is 100% managed code. It works as follows:

  1. A user needs to save a CLR object graph: var ptr = pile.Put(myGraph). The graph may contain cycles, complex objects like Dictionary<string, ConcurrentDictionary<...>>… you get the picture. Anything [Serializable] per .NET definition should be Pile-able! On a 6 core machine with 3 Ghz/64GB we were able to put 1.5 million simple objects (with 8 fields) every second!
  2. Pile serializes the payload using an intricate type-compression scheme, so if a CLR object takes 50 bytes, Pile serializer produces 20 bytes. Then it finds a segment. Segments are byte[] chunks of at least 64 MB up to 2 GB in size. It managed memory within a chunk with free lists and scans when needed. It fits the serialized data into the segment at Address (which is just an index in byte[]). This way we get a PilePointer = struct{int Segment, int Address}. It is a struct, not a class. This is VERY important.
  3. You can “dereference” the PilePointer by: var myData = pile.Get(ptr). The pointer returned by Put() is a form of “object ID”. If the pointer is invalid you get “Pile access violation” either the segment is not found or address is out of segment bounds OR the byte position does not start with the proper chunk prefix, so there are some protection mechanisms. These are not 100% perfect, as theoretically one could dereference the freed pointer, BUT this is exactly what we wanted… the unmanaged flavor! I can dereference anywhere from 1-2 MB objects per second this way.
  4. You can do a deletion to free pile chunk: pile.Remove(ptr). The memory would be released so it can get reused.
  5. When segments become empty, they get truncated - the whole byte[] gets “forgotten”. The GC can deallocate this in less than 10ms. YES!!!!
  6. While Pile works, even under the heaviest load (having 60 GB allocated) the complete GC is usually under 30 ms. Most of the time it is closer to 10-15 ms, and in spite of the fact we are churning out hundreds of thousands of objects a second, (put/get) the GC is still very fast - this is because those CLR objects are all in Gen0 - they live for <1ms and are forgotten. The resident data copy is now stored in a pile not touched by GC. Instead of seeing 500,000,000 resident objects, GC sees 200 resident objects (segments) and 1,000,000 gen0 that die instantly!

This is thread-safe of course and also has a side-benefit for caching: since Pile makes a new instance on every Get(), you can safely give-out a copy of object for requesting thread, even if they compete for the same object (same PilePointer). This is like an “actor model” in functional languages when a worker takes the order, mutates it into something new then passes along the chain - no need to lock().

Internally, Pile is 100% thread safe for Put()/Get()/Remove() operations. The implementation is based on some clever ref swaps and custom-designed spin waits instead of lock(), as the latter is too slow for performance sensitive operations.

Test Results

We have constructed a test app, and tried to do similar work in both “native object format” and Pile.

An important note to keep in mind: When you compare the test results from below against, say Redis or Memcached, ask yourself a question: am I comparing apples to apples? Does your API give you back a .NET object or a string you need to parse somehow to access “fields”? If it gives you a string or some other form of “raw” data, please do not forget to account for serialization time to convert the raw data into a usable .NET object. In other words, a string ‘{“name”: ”Frank Drebin”, “salary”: “1000000”}’ does not allow you to access “salary” without parsing. Pile returns you a real CLR object - nothing to parse, so the ser/deser time is already accounted for in the numbers below. When Pile stores byte[] or strings, its performance needs to be multiplied at least five, if not ten, times.

One Intel Core 3x I7 Sandy Bridge 3 GHz 64 GB RAM.

Test object: 7 fields

Native .NET+GC on Windows, GC is in the “Server” mode, 1 thread:

Object size: around 144 bytes

Stable operation with <10 million objects in RAM before slowdowns start

Speed deteriorates (SpeeDet) after: 25 million objects added

Allocated memory at SpeeDet: 10 GB

Writes: average 0.8 million objects / sec (starts at millions+/sec then slows down significantly after SpeeDet), interrupted by GC pauses

Reads while writing: >1 million objects / sec interrupted by GC pauses

Garbage Collection stop-all pauses near SpeeDet: 1,000 - 2,500 ms (2,000 ms average)

Full Garbage Collection near SpeeDet: 1,800 - 5,900 ms

Pile on Windows, 1 thread:

Here is source code for the test application.

Object size: around 75 bytes

Stable operation with >1,000 million serialized objects in RAM (yes, ONE BILLION objects)

Slow down after: 600 million objects (as they stop fitting in 64 GB physical RAM)

Pile memory: 75 GB

Allocated memory: 84 GB

Writes: 0.5 million objects / sec without interruptions

Reads while writing: 0.7 million objects / sec without interruptions

Garbage Collector stop-all pauses at 600M objects: none

Full Garbage Collection: stable time of <30 ms (15 ms average)

How is This Possible?

At first it may seem strange that our Pile solution is as fast or faster than the native GC, especially when you factor in the cost for serialization.

But when you think about it, it really isn’t that unexpected. The native GC approach with native objects and no serialization is 1000s times faster when you use a simple benchmark. But once you overload the GC with millions of the resident objects and keep allocating, the process starts to pause. That is why the seemingly impossible test result is possible in practical scenario. GC just can’t handle tens of millions of objects if they stick around.

Like I have described above, there are several technical achievements in NFX framework that move this solution from Sci-Fi to reality:

  • Very fast serializer (NFX.Slim). It is optimized for the bulk serialization-deserialization. It is not slowing down on a big number of objects and is very efficient in packaging serialized data.
  • The serializer can work with the .NET classes without any additional class treatment. We don’t have to describe classes in IDL. To send any object to Pile we don’t have to do additional boilerplate tasks, we send .NET serializable objects directly. This is cheap in terms of developer labor with business objects.
  • The serializer can work with any .NET classes and with polymorphic classes of any complexity. Only Microsoft BinaryFormatter can do such things, as we know. Unfortunately BinaryFormatter is very slow and inefficient in packing serialized data. NFX.Slim also does type registry stateful compression - it remembers the types written/read. Instead of emitting the type name into the stream every time, it writes the number from the pool.

Cache

After we achieved some very cool performance results with Pile it was a time for cache. After all, a Pile as-is is not enough to “look up” data by key.

A Big Memory Pile Cache is abstracted into the interface that supports priority, maximum age, absolute expiration timestamps and memory limits. When you approach a limit, the object starts to get overwritten if their priorities allow.

The cache evicts old data (auto-delete) and expires objects at a certain timestamp if it was set. It also clones detailed table settings from a cache-wide setup where everything is configurable (index size, LWM, HWM, grow/shrink percentages etc.).

The cache is described in the linked video.

Test Screens

Here you have it. 1+ billion objects are allocated on a Pile. The write throughput is now around a paltry 300,000 inserts a second from 10 threads. This is because we have also allocated 80+GB on a 64 GB machine. Actually, I was surprised swapping did not kill it altogether, it still works, AND full-scan GC is 11 ms! This machine is 3 Ghz 6 core i7 with 64GB physical running Windows 7 64bit.

(Click on the image to enlarge it)

And here we are on Ubuntu Linux under Mono, having 300M objects taking around 27 GB of RAM. The machine is i7 4 core 3 Ghz 32 GB physical. The 8 threads are still inserting at 500K/sec.

(Click on the image to enlarge it)

After I purge the Pile notice the GC deallocating 28 GB in 72 ms (shown in window title):

(Click on the image to enlarge it)

And this is how it looks on a Windows box:

Links

NFX Source

Big Memory Pile

Big Memory Cache

Serializer Benchmark Suite (NFX.Slim serializer among others):

Source

Test results

Test result table: Typical Person

Test result charts: Typical Person

NFX Pile Video - How to store 300,000,000 objects in CLR/.NET

NFX Pile - .NET Managed memory Manager Video

NFX Pile Cache Video

About the Authors

Dmitriy Khmaladze has over 20 years of IT experience in the US. Startups and Fortune 500 clients;  Galaxy Hosted, Pioneered SaaS for medical industry in 1998; 15+years research: language and compiler design, distributed architecture; System programming and architecture, C/C++,.NET, Java,  Android, IOS, Web design, HTML5, CSS, JavaScript, RDBMSs and NoSQL/NewSQL.

Leonid Ganeline has 15 years as Integration Developer; Microsoft MVP Awards in Integration; blogger. He enjoys all about software, traveling, reading, running.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Nice solution

    by weq qew,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Was interested to see how this solution was implemented after reading the first article as i have struggled with GC issue myself with large object graphcs in an application where i require 100k-1000k+ objects to be in memory. Thanks for sharing your thoughts process, methodology as well as source!

  • Memory-Mapped Files

    by And Med,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Very interesting project.... any support for Memory-Mapped Files?

  • Re: Memory-Mapped Files

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    We were planning to add it, so far the most beneficial use-case has been big in-memory CACHE that we use in front of DB. Also the Pile on its own is used for "network"/graph traversing. Certainly the API supports "upgradable memory" conceptually via the IPile interfaces, the memory-mapped file though would have required a serializer with versioning which is way slower than Slim. But I think the future will show that MMF are still very beneficial.

  • Re: Nice solution

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thank you for kind words.
    Try to use it in your project, and see how stuff can get cached, even things like forum posts are good candidates

  • Pile is really great!

    by Mariano Vicario,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    This project is really good! Excellent explanation! thanks!

    But I think that Pile should be a separate project / repo on Github, with a proper readme with examples documentations. And a a nuget package :) That would the strawberry on top of the cake of and excellent work!

  • Memento

    by Keith Robertson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Very cool solution! No one yet has observed that you're using the Memento pattern. A Pile doesn't actually store objects, but representations of objects which you quickly re-hydrate as needed.

    Regarding MMF (memory-mapped files), wouldn't one possible benefit of this be to extend the available virtual address space beyond the system page file/partition (on systems where supported)? MMF might have value even without a versioning serializer if used for that purpose and not for persistence.

    Finally, I rather disagree with the Unistack philosophy. Isn't it better to pick and choose the best frameworks for your purpose? Would've preferred to see Pile and other parts broken out from NFX into independent modules. Just a friendly thought.

  • Re: Memento

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    MMF solution would actually transform this into persistent DB. As far as virtual address space is concerned, the Pile has benefits even when allocating more ram than physical (in this case OS does swapping), whats interesting is that 1,000,000,000 objects could be very usable on a 64 gb machine - (we have tried tried).

    As for Unistack. Pile relies on many services of NFX, i.e. Instrumentation - you can visualize the cache and pile + logging (as video demonstrates). UNISTACK saved us tons of $$$ and developers already. I agree that it is not for everyone, but since the DLL is small you can use it as is or fork it out for your purpose!

  • Re: Pile is really great!

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Mariano, Pile is a part of unistack - it relies on it. For example, the special purposed read/write synchronizer uses App.Active to detect certain conditions. Instrumentation is integrated for detailed view of the memory stack statistics.

    TO make the Pile 100% independent, one would need to repeat 100s of abstractions.
    Pile is only like 5% of what NFX has (full network +web stack) used for cluster programming.
    If you need just the Pile, you can yank it out in your fork! :)

  • Run error

    by Kok How Teh,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi;

    I pull the latest code and bump into runtime error in NetJSONSerializer.cs Serialize() funtion:

    Additional information: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

    Regards,
    KH

  • How to make SlimSerializer to ignore certain properties / fields?

    by Kok How Teh,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi, as indicated in the subject title. Thanks.

  • Your message is awaiting moderation. Thank you for participating in the discussion.

    this message contains many information it is really helpful too

  • What about single block / object ?

    by Gabriel Rabhi,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Great work !

    But, I think it is a bit slower solution than the most advanced pure native one. If you start to read the entire database, you'll have again the same problem :
    1) the GC will struggle freeing the deserialized objects, and will need to release each after usage.
    2) The deserialization is a slow, really slow process compared to simply read the fields in memory.

    If you create 100k+ table of object for long live computation, your GC will need to pause the entire process again. In the other side, the advantage is that you're working with POCO.

    In a project, my goal was to make traversal operation as fast as with huge (billions) number of POCO. I have solved this problem with encoded object data in native memory and unsafe code. The code was tool generated. Each object is a pooled local instance withoud any fields (each thread had few of them). Each object had only a single void* (IntPtr), in wich all fields are encoded (including sub tables and strings). All fields are inlined in single block of native memory. There is only reallocation and copy of this bloc if, for exemple, a string is changed and the memory block is too small to ensure encoding of all fields. But we all time doing a lot more reads than writes. So, a class, in my system, is a struct with one field : the pointer. Each Proporty is a reader of each fields, inlined in the memory block. There is strictly no penalty to read structs or native fields, and a slow offset computation to retreive a string or sub table content.

    The BIG, really BIG advantage of this strategy, is that every objects can be immutable : and immutability is the success key to a fast, really fast multithread processing. No lock, only simple pointer values updated with interlock functions. This permit to manager a MVCC (Multi Value Concurency COntrol) directly in RAM, without the "copy" strategy. Each block of memory is begining with the number of reference on it. The last object to stop using a block is freeing it.

    I hope to be able to publish the code next year, while this system is not at this time in production.

    The benchmark of this where similaire in read/write, but access to the object is near "no-overlead" : huge traversal with acess to the entire data in a second. In other hand, you're solution have other great quality (blocks are movable, simpler implementation, possibility to have a versioning compatible serializer...).

  • Re: What about single block / object ?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Definitely the unmanaged version would work faster. HOWEVER, this defeats the purpose of 100% clr transparency - that is, we store Dictionaries of Rows and other objects. NFX.Pile is 100% transparent - you get back what you put (a copy) which is good for MVVC like you have mentioned. We store everything - even session state (yadayada stateless web...). As far as loading the whole databse - it is VERY fast, the GC is dealing only with Gen 0 and actually rehydrating the thingies with a few million rows from Mongo DB only takes under one minute.

  • Re: How to make SlimSerializer to ignore certain properties / fields?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Properties are not serialized at all. Fields are. You could use [NonSerialzed] standard net attribute for that

  • Re: What about single block / object ?

    by Gabriel Rabhi,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    You're true. Objects are Gen 0 most of the time. It's a great work you've done. I thing your approach is combinable with mine (single native memory block per object, properties are reading values directly in the memory blocks), because in this case, an object composed of 35 strings fields are not allocating 35 GC blocks with 35 pointers to analyze during colect. In my system, access to an object don't need to deserialize first : you have simply to say to an instance "now, you're datas are at this pointer".

    To enumerate 100M+ object, I need only one object instance (yes, only one), for wich I only change the base data pointer by iteration over the cache memory. No copy, no deserialisation. I can say my strategy is amazing in performance point of view. But NFX.Pile is a lot more compliant with GC and .Net philosophy, and it's a big advantage.

    The problem I've had with [one object = one native memory block] is that the coexistence of billions of native memory blocks and the GC supervised ones can cause a sever fragmentation of the memory. Native blocks are not relocatable. Youre solution is avoiding GC heap compaction limitations due to fixed native memory.

    I will try to combine both strategies, to combine extremly fast computation able to randomly access undred million objects per seconds of my strategy, and compliant GC of the FX.Pile.

    Thank you !

  • 2.5 Millions objects / sec

    by Gabriel Rabhi,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I've taken some times to code a "Proof Of Concept" showing my approach in a separate, updated version.

    1) I've re-written a memory manager. It work as a sub allocator in native or .NET blocks. But allocating memory from byte[] still slow because .NET is erasing each block. So, I'm coming back to native allocation, and it is really, really fast ! (just a little bit slower than the Marshal.AllocHGlobal)

    2) I've redeveloped the "one object/one block of memory" concept, and it work very well.

    The benchmark create "Person" objects, with one or two strings in it, and add it to a Repository which maintains the relation between the ID and the memory block (like a Dictionary). My first results, on an i7 3.4 Ghz, with one thread only :

    -> 2.5 Millions objects created, initialized and added to the repository, per second.

    -> 2 Millions objects instance re-created by the repository (using precompiled Lambda constructor), per second.

    I hope i will optimize it to improve this performances.

    Objects are immutable. If we start to modify one, the memory manager look if it need to clone the object memory block to ensure immutability for all threads. The object memory bloc is not cloned each time a field is modified, so, modifications are (near) as fast as one with native POCO objects. The garbage pressure is really low, because an instance of an abject only contains one field of type byte*, and most of the time objects are Gen 0. The GC free my memory blocks naturally when calling objects destructors.

    There is no serialization, this is the reason why the system is so fast.

    Enumerate or inserting all the objects at 2.5 millions objects / seconds is a good start point. I've created 50 millions objects and the garbage is still at milliseconds collect delay because all structures (arrays, dictionaries, memory blocks) are as a tree of unmanaged memory blocks.

    DRAWBACKS

    1) The code of the objects is a little bit more complicated, and need to describe internal variables fields like strings. But in most cases, business objects are not so simple, and there is only few hundred business object classes in most software solutions.

    2) In certain situations, modifying objects can cause allocations and memory copies to expand or reduce the block which contains the variable length fields values (strings, arrays...). In most cases, the memory block had room to integrate variations, and modification are not concern.

    FIRST CONCLUSION

    If we accept to write à little bit more code in business classes, we can jump to a no limit in memory usage, with performance unreachable with other approaches :

    - Low level immutability is a great thing, it permit to access and modify objects without any performance penalty with as many thread we want.

    - Zero serialization systems are dramatically fast.

    - Create temporary and long life huge object collections without any Garbage pauses, is a great thing to write fast business process code.

    I started a GitHub repository to publish the executables, work is in (slow) progress. I will probably write a paper on that technique, completed with Graph Oriented unmanaged code, a Disruptor like inter thread communication I've written (15M calls / s), and a small persistence layer using the really good Windows embedded ESENT engine. The whole create a full featured database toolkit, ready to use for in memory server side business app development.

  • Excellent, Amazing Work!

    by James Hutchison,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Dmitriy, this is awesome stuff! We've setup a test and it works absolutely perfectly.

    Did you ever get any further with the memory mapped implementation?

  • Re: Excellent, Amazing Work!

    by Leo Gan,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Awesome!
    Yes, we have version tolerant serializer in plans that would be ideal for
    materialized pile (i.e. via memory mapped files)

  • Re: Excellent, Amazing Work!

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Yes,
    we are considering version-enabled serializer which can be used with persistence layer.

    I wonder if that fix with MemoryUtilizationModel did the job for your case?
    Best,

    DKh

  • Re: What about single block / object ?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Is your approach similar to Zero-Copy serializers?

  • 2017 NFX Pile Update

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    A 2017 update:
    ==============
    We have increased the insert rate in the test to 2.5 m inserts/sec (x2 speedup)
    also improved the multi threading performance.

    The Pile now allows to pre-allocate pointer buffer sizes and reuse object pointers - edit objects in-place:

    pile.Put( existingPointer, newObject );

  • A fast .NET serializer

    by Vadim Kantorov,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Dmitriy, you may be interested to look at very little-known github.com/skbkontur/GroBuf - claiming 2x perf improvements over protobuf (though don't know details, if they support all ISerializable and such).

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT