BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations LMAX - How to Do 100K TPS at Less than 1ms Latency

LMAX - How to Do 100K TPS at Less than 1ms Latency

Bookmarks
54:50

Summary

Martin Thompson and Michael Barker talk about building a HPC financial system handling over 100K tps at less than 1ms latency by having a new approach to infrastructure and software. Some of the tips include: understand the platform, model the domain, create a clear separation of concerns, choose data structures wisely, and run business logic on a single thread.

Bio

Martin Thompson worked at Betfair, one of the largest sports betting exchange, and the co-founded LMAX/Tradefair, leading the building of the world’s highest performance financial exchange. Michael Barker is currently a lead developer at London Multi-Asset eXchange (LMAX), and a sporadic Open Source contributor to projects like PostgreSQL, JBoss, GNU Classpath and most recently Mono.

About the conference

QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers

Recorded at:

Dec 16, 2010

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Its good to get a refresher on how modern CPUs work

    by Faisal Waris,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I admit my concepts were really outdated

  • Re: Its good to get a refresher on how modern CPUs work

    by Mark Wutka,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    If you want more info on modern CPU's, I thought Cliff Click's Crash Course on Modern Hardware was great.

  • Excellent presentation and some remarks

    by Peter Veentjer,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    He Guys,

    excellent presentation. Good insights in that you need to know what is happening on a lower level instead of believing that the JVM is smart enough to make everything perfect.

    a few remarks:

    - short methods: if a method is short, it is easier to inline. But if you make your method bug (yes I know.. normally a very bad practice), you give the JIT more area to do its work. Especially if you add virtual methods calls to the picture (with runtime multiple runtime-mplementations of that method ), the JIT has a very hard time inlining it.

    - what are you using to look inside the CPU? Normally profilers only are able to show information on the method level and it this is called extremely frequent, you will only see information about parent calls. I have very good experience with Intel VTune 9 (10 doesn't provide support for Java :'( ) which makes it possible to see on the instruction level what is going on, e.g. cache misses.

    - array vs linked lists. You say that array can perform better than linked lists since with a linked list you keep on jumping all over the place in the heap. But if that array doesn't contain primitives, you will have the same problems with arrays since the actual content of the array still could be spread all over the heap. With object arrays you can only access the reference addresses very cheap, but the content of the objects still can be very expensive to obtain.

    - you can do locking with cas without problems (the amount of stuff you can jam in a long is just amazing with some smart bit shifting). Only if you want to introduce blocking next to locking, things get more complicated.

    - object layout: with lower level programming languages you have control on how datastructures are mapped to memory. But with Java, you don't have any guarantee how something is layed out.

    - if you are using cas, there is support for it on the cpu, but you need to realize that a cas still can be very expensive if there is contention on it. Even a volatile write can limit performance considerably. I have done some performance explorations last month an by removing a single volatile write, I could get the performance up from 20M transactions/second to 30.

    - what are you doing to guarantee read consistency? If you are working with plain locks or cas, it still is possible to get all kinds of read anomalies, e.g. a non repeatable read. If you are using a database, setting the correct isolation level, the database is going to take care of it. But when doing with locks/cas, you still could be stuck.

    - what are you doing to guarantee failure atomicity? Normally a database takes care of rolling back bad changes, but when done in memory, you don't have this feature out of the box.

    - preventing garbage: I couldn't agree more. Creating garbage still kills performance, example:

    git.codehaus.org/gitweb.cgi?p=multiverse.git;a=...

    This is a simple benchmark where 2 different tests are run: one with boxing/unboxing and one without. So with the boxing version you have 1 object per transaction and with the primitive version you are creating zero waste. As you can see the scalability of the primitive version is almost linearly and with the boxing version it is a mess. So doing object pooling (or preventing it by design) is one of the easiest way to improve extreme high performant code.

    I really like the concept of the disruptor.

    Peter Veentjer
    Multiverse: Software transactional memory for Java and the JVM
    multiverse.codehaus.org

  • Re: Its good to get a refresher on how modern CPUs work

    by Iouri Goussev,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Yes 100k TSP with 1ms latency is amazing!

  • Re: Excellent presentation and some remarks

    by Michael Barker,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - short methods:

    The point we were trying to make was less about short methods specifically, but more generally about simple code. The JIT will always do better on simpler code. As an example, we noticed a jump in performance in a particularly hot area of code by simply removing a couple of unnecessary try/catch blocks. Doing a bit more research we found that it wasn't that exception handling was slow, but that the resulting complexity of the method was such that Hotspot failed to do a good job of optimising it (we suspect the resulting code was too large to in-line). If your tests show a significant boost from a building a large method, that's good, but for the same performance profile I would prefer to shorter simpler code.

    - looking inside the CPU:

    We haven't had much need to delve that deep yet. Mostly we build the high-performance bits of our code using the principles we talked about then measure the results. Having quite a clean conceptual design has made easy to reason about the system, such that finding bottlenecks hasn't been too hard (so far). Often it's not our code that's the problem, e.g. we spent some time hunting down issues with UDP Multicast on a virtualised (KVM) OS.

    - array vs linked lists & object layout:

    That was mostly an example to illustrate the point about cache-striding. Yes, you'll still need to dereference the object held in the array, still that's only half as many as a linked list :-). Having little control over object layout and requiring that all complex structures (i.e. non-primitives) are heap allocated is really annoying. My most desired feature for Java is value types (blogs.sun.com/jrose/entry/tuples_in_the_vm). While value types won't give you full control over memory layout, you could reasonably expect that the VM implementation will be able to layout the structures in an efficient way, e.g. an array of value types as a single contiguous block of memory.

    - if you are using cas ... contention

    You are more right than you realise, it's not just CAS that suffers under contention, it's everything. We have a design mantra of 1 thread == 1 resource. We have 1 thread writing to the journal, 1 thread for business logic, 1 thread publishing to a topic (multicast address), 1 thread replicating, etc... The disruptor pattern is a whole system pattern that tries to avoid contention wherever possible.

    In fact there is our system there is only 1 contended CAS per service. In our highest throughput system it's only between 2 threads where one thread has about 3-4 orders of magnitude more volume than the other, so in practical terms it's fairly inconsequential.

    - read consistency

    In the business logic, nothing, it's single threaded. In the infrastructure where all the concurrent code is, we rely on the Java Memory Model to provide visibility of memory between threads.

    - what are you doing to guarantee failure atomicity?

    Good input validation, deterministic logic and lots of tests. We've been toying with Clojure-style concurrency model, i.e. fully immutable data model & a single volatile reference as an entry point. However, the heavy garbage from that approach makes it a bit of a non-starter with most of the current JVM technology. The only GC technology I've seen that would make that model viable in a soft real-time environment would be Azul's pauseless compacting GC.

  • Re: Excellent presentation and some remarks

    by Peter Veentjer,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - The point we were trying to make was less about short methods specifically, but more generally about simple code. The JIT will always do better on simpler code. As an example, we noticed a jump in performance in a particularly hot area of code by simply removing a couple of unnecessary try/catch blocks. Doing a bit more research we found that it wasn't that exception handling was slow, but that the resulting complexity of the method was such that Hotspot failed to do a good job of optimising it (we suspect the resulting code was too large to in-line). If your tests show a significant boost from a building a large method, that's good, but for the same performance profile I would prefer to shorter simpler code.

    True.

    - We haven't had much need to delve that deep yet. Mostly we build the high-performance bits of our code using the principles we talked about then measure the results. Having quite a clean conceptual design has made easy to reason about the system, such that finding bottlenecks hasn't been too hard (so far). Often it's not our code that's the problem, e.g. we spent some time hunting down issues with UDP Multicast on a virtualised (KVM) OS.

    My experience with high performance is that feels like a juggling act where each feature or performance improvement, is another ball you need to keep in the air? What are you using to integrate the whole performance aspect in the build process? I have written a small framework this month for benchmarking so that you can track performance over time. It seems that there are not many of such test frameworks.

    - That was mostly an example to illustrate the point about cache-striding. Yes, you'll still need to dereference the object held in the array, still that's only half as many as a linked list :-).

    Not if you integrate the list in the object itself (so adding some next pointer).

    - Having little control over object layout and requiring that all complex structures (i.e. non-primitives) are heap allocated is really annoying. My most desired feature for Java is value types (blogs.sun.com/jrose/entry/tuples_in_the_vm). While value types won't give you full control over memory layout, you could reasonably expect that the VM implementation will be able to layout the structures in an efficient way, e.g. an array of value types as a single contiguous block of memory.

    That would be a very cool feature to have.


    - In the business logic, nothing, it's single threaded. In the infrastructure where all the concurrent code is, we rely on the Java Memory Model to provide visibility of memory between threads.

    If the JMM is applied correctly, you will only get sequential consistency which is a very low form of read consistency. Stuff like dirty reads, non repeatable reads, phantom reads and writeskew still are possible. For small segments of code in most cases it is doable to reason about it, but for larger systems it is quite hard. Another problem is that in my experience only a few (read one or zero) is able to understand what is going on, and all that don't understand it, are likely to break something.

    - Good input validation, deterministic logic and lots of tests. We've been toying with Clojure-style concurrency model, i.e. fully immutable data model & a single volatile reference as an entry point. However, the heavy garbage from that approach makes it a bit of a non-starter with most of the current JVM technology. The only GC technology I've seen that would make that model viable in a soft real-time environment would be Azul's pauseless compacting GC.

    Perhaps I can help you here. What the idea is of STM is that you separate the values from your entities (you could do it using managed refs but could also rely on fully transactional objects). Since the entities only contain the 'identity' you don't need to recreate them (just put them in memory and they can safely be shared by others since they doesn't contain state). But the containers for the actual values are not interesting in your business code, so these can be reused. In Multiverse I pool everything, the goal is to have zero waste. So these containers are reused. One of the performance explorations showed that even the whole pooling could be simplified by making certain transaction implementations (I have different once which are selected using speculation) a container pool (and the transaction itself is pooled).

    One of the problems of STM is that it often is very optimistic, so you need to keep retrying your transaction and in combination with some back-off policy (that causes waiting) the latencies could become unpredictable.

    I'm currently working on finishing up Multiverse 0.7 (also an STM implementation) which has support for lower isolation levels like read committed or repeatable read, which reduce the need for retrying. And it also is possible to put transactions in pessimistic mode which give you the ability to 'claim' objects. For the 0.8 release I'm going to take the whole pessimistic stuff to the next level by introducing blocking, read/write locks etc so that it can be aligned better with GigaSpaces (the company I work for uses it). With these improvements one is able to lift om some of the benefits from the STM in a more traditional concurrent application.

    You and your team are doing some very cool stuff :)

  • Re: Excellent presentation and some remarks

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - My experience with high performance is that feels like a juggling act where each feature or performance improvement, is another ball you need to keep in the air? What are you using to integrate the whole performance aspect in the build process? I have written a small framework this month for benchmarking so that you can track performance over time. It seems that there are not many of such test frameworks.

    Totally agree. We had to develop our own micro performance testing framework.

    code.google.com/p/jmicrobench/

    However be aware that micro benchmarks can be very misleading. For example, the polymorphic call you pointed out depends on the number of implementations for an interface seen during class loading. If 1 or 2 it can still be optimised away but for 3 or more implementations it usually ends up as a vtable lookup. Macro level benchmarks on realistic data have much more value in my experience.

    - Not if you integrate the list in the object itself (so adding some next pointer).

    I see what you mean but this can be the beginning of the road into design evil that comes from conflating concerns into an object. Should a Customer class extend a LinkedList just as an optimisation??? Yes this can also be achieved with bytecode weaving and other techniques but is it really the right approach as general design principles?

    - If the JMM is applied correctly, you will only get sequential consistency which is a very low form of read consistency. Stuff like dirty reads, non repeatable reads, phantom reads and writes skew still are possible. For small segments of code in most cases it is doable to reason about it, but for larger systems it is quite hard. Another problem is that in my experience only a few (read one or zero) is able to understand what is going on, and all that don't understand it, are likely to break something.

    I believe these problems all come from having multiple writers hitting the same data. If the design is taken back to the original principles of OO then all communication is via message passing. We use the Disruptor pattern to ensure we get very efficient message passing and the non-functional qualities of service we require. In some ways we believe in the principles behind the Actor pattern, we just don’t believe in it at the micro granularity that is often discussed in examples. Within a component we have a clean single threaded OO model with all communication between components via message passing.

    It sounds like what you are doing on STM is very interesting and a great approach for some classes of problem. When trying to achieve real-time characteristics it is so important to understand what concurrency is going on and in particular what the behaviour is when the data is contended under heavy load. My personal view is that STM is a great approach for extremely large data sets where the potential for contention is very low. This allows lots of work to progress in parallel. Other problem domains where contention is high requires an approach where the concurrency management cost does not exceed the cost of doing the work. As soon as you have more than one thread trying to write the same data at a minimum you need CAS and a state machine with retries, or locks at worst. If the model is a tree that is not too deep you can CAS the root to swap in the changes after a copy. On contention you keep retrying which can be very costly on GC even for the identity of the nodes. If the entities are indirect from the identity nodes then the indirection can cause a cache miss. If the model is a more complex graph the state machine for retries quickly becomes very complex. That is what the Disruptor is about avoiding ;-)

    Thanks for the feedback. This is an area I feel deserves some debate because the current approaches are running out of steam.

    Regards,
    Martin...

  • Re: Excellent presentation and some remarks

    by Peter Veentjer,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Martin,

    - jmicrobench

    I'll have a look at it.

    - I see what you mean but this can be the beginning of the road into design evil

    I agree. In my case I can get away with it: I wrote a lot of specialized datastructures for the stm, so it doesn't leak into the other parts of the system. I would not like to see leaky stuff normal code.

    - I believe these problems all come from having multiple writers hitting the same data.

    It can also happen when different data gets written. E.g. there are 2 bank accounts; one with 2 euro and one with 0 euro. 1 euro is moved from one account to another, so if someone would look at the system, it should see 2/0 or 1/1. But 2/1 or 1/0 is not correct.

    - It sounds like what you are doing on STM is very interesting and a great approach for some classes of problem. When trying to achieve real-time characteristics it is so important to understand what concurrency is going on and in particular what the behaviour is when the data is contended under heavy load. My personal view is that STM is a great approach for extremely large data sets where the potential for contention is very low.

    Contention management is one subject that needs a lot of research. Apart from it being difficult to add, it is even more difficult to get fast.

    Another problem with stm is what happens with long running transactions; if there is some contention, a long running transaction could be obstructed by shorter ones.

    . If the model is a tree that is not too deep you can CAS the root to swap in the changes after a copy. On contention you keep retrying which can be very costly on GC even for the identity of the nodes.

    The cool thing about stm is that you don't need to swap the whole tree. If you are going to update, you only need to replace the nodes effected (either by the direct updating or rotation). Other transactions still can continue and depending on the stm implementation they can have a stable view of reality even though an update is in progress.

    But I agree that STM is not the holy grail for concurrent problems. My experience is that in most cases performance is not the most important in a lot of applications. Reliability/simplicity and team-performance are often more important.

    But if you want to get the most out of performance, (almost) any kind of framework is going to be in the way and nothing beats doing it yourself.

  • Re: Excellent presentation and some remarks

    by Peter Veentjer,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    PS: This is an example if what the Benchy frameworks (a 0.1 version will be released at the end of the year) looks like:

    Exampe of a benchmark:

    git.codehaus.org/gitweb.cgi?p=multiverse.git;a=...

    A benchmark contains of a lot of test cases and are configured in Groovy so you have a full blown programming language available for building it. And the benchmark/testcase essentially is a key/value bag, so you can pass all the properties you need to it.

    The actual driver contains the logic you want to test:

    git.codehaus.org/gitweb.cgi?p=multiverse.git;a=...

    The driver can be written in any language because a normal Java driver jar is provided that doesn't rely on Groovy.

    All results are stored in a database so you can track performance over time. And you can also create graphs like this:

    git.codehaus.org/gitweb.cgi?p=multiverse.git;a=...

    There is no connection to any kind of rendering framework, although I have included JFreeChart in the Benchy artifacts for ease of use. But if GnuPlot or any other kind of rendering framework is needed, you can do so.

    And there are some command line tools for listing, displaying, deleting benchmarks. Works like a charm.

    The microbenchmarks frameworks I have seen so far :
    - Japex: japex.java.net
    - Caliper:code.google.com/p/caliper/wiki/JavaMicrobenchmarks
    didn't fit my needs (either a crummy dsl for creating the actual benchmark) or too many restrictions on publishing data. So that is why I decided to write my own. And I'm glad I did.

  • Re: Excellent presentation and some remarks

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    -If the model is a tree that is not too deep you can CAS the root to swap in the changes after a copy. On contention you keep retrying which can be very costly on GC even for the identity of the nodes.

    - The cool thing about stm is that you don't need to swap the whole tree. If you are going to update, you only need to replace the nodes effected (either by the direct updating or rotation). Other transactions still can continue and depending on the stm implementation they can have a stable view of reality even though an update is in progress.

    The point I was making is you do not need to replace the whole tree as this is the benefit of a tree were you always navigate from the root. You need to replace the modified node, and all nodes up to the root node, copying references on the way, aka path copy. You then to CAS the root node and retry if someone else beats you to it. This 'trick' is not so easy/possible with a graph structure. The CAS needs a memory barrier for others to see the changes. If only one writer then only a memory barrier without CAS would suffice. The nice thing about this approach is everyone can hold on to a read consistent copy. The down side is the garbage it creates which is also often promoted incurring further cost and requiring later compaction.

  • Please explain few things about the disruptor implementation

    by dominic laporte,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    1) What do you mean by the :MIN sign on the 101,101, 102 ??
    2) How does the Bus logic keep track of where the {journaler,replicator,unmarshaller} is ?
    3) whats the mecanism used between the Bus logic and {journaler,replicator,unmarshaller} ?

    3) If the replicator is stalling onn 101.
    3.1) Can the replicator move to 102 before 101 is finished ?
    3.2) How do you stop the whole system from stalling then ?

    4) If the Bus logic is stalling on 101.
    4.1) Can the Bus logic move to 102 before 101 is finished ?
    4.2) Can the Bus logic come back to 101 if it can proceed with 102 before 101 is finished ? How does it do that ?

  • Re: Please explain few things about the disruptor implementation

    by dominic laporte,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    replace
    4.2) Can the Bus logic come back to 101 if it can proceed with 102 before 101 is finished ? How does it do that ?
    with
    4.2) Can the Bus logic come back to 101 if it can proceed with 102 before 101 is finished ? How does it come back to 101 ?

  • The LMAX Architecture

    by Edward Hoffman,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Java architecture does not support cache coherency in hardware. This essentially means that whenever a thread enters a monitor, all memory changes made before must be committed to main memory, and the entire cache must be invalidated and all registers flushed (please read Chapter 17 of the Java Language Specification). Assuming LMAX is written in Java, is this presentation a true reflection of its implementation?

  • Loved it

    by Martin Perez,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi guys,

    It was a really nice presenation. I loved it. I've got a couple of questions in my mind.

    1st. You made a valid point saying that a retail trading exchange has a bigger volume than a corporate trading exchange. So you've got here a damn fast business logic able to manage 6M TPS and therefore you are now able to cope with a larger volume of customers. And as you mention in the talk the smaller the latency the smaller the spread which is much better for the customer. But now, I'm curious about what happen when you put your average retail customer on the equation. He may perfectly has 200ms latency from his trading office/house to your servers. So although you've got there a really fast setup your retail customer is still exposed to Internet latency. How do you handle that? How does it affect to the pricing engine (I guess spread should increase)? and and how does it impact to the final volume (I would expect much less volume)?

    Second. I find the whole proposal very attractive and really liked the concept of disruptor. If I understood properly the whole point is to have the business logic in a single thread as compact and simple as possible while all the other services should be multi-threaded. You mentioned that JEE was not a valid option here but still all the talk is within Java context. So I'm curious, which kind of server setup are you using (should this information may be disclosed)? (i.e. tomcat, plain sockets setup, plain jms, ...)

    And finally, how do you handle failover? I mean, if you have a super-fast single threaded system which will also probably have all the pricing data and customer positions in memory, still you should need a pretty dammed fast replication system. And that replication system should be in the closest possible state to the single threaded business logic system as otherwise in a failover you will depend on how fast you can recover from the journaling system and pray that this journaling system is up to date. I'm quite interested on how it works and if you have experienced any real failover situations which you could share with us.

    Thanks,
    Martin

  • Great presentation!

    by Matthias Weiser,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I have been working in this area for more than 10 years and also spent several years as a programmer at a major derivatives exchange.
    Those days the business logic was done in Cobol and the architecture in C. Still a lot of the concepts from the old days there show up in your presentation.
    Even though I am working in a .NET environment now, I can confirm most of your statements.
    Could you elaborate a bit how you are determining that your code is machine friendly even though you are using a virtual machine? I end up writing a lot of small test apps and benchmark them, but this is not close to determining that something is cache-friendly. Btw which OS are you using?

  • Re: Please explain few things about the disruptor implementation

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    1) What do you mean by the :MIN sign on the 101,101, 102 ??
    2) How does the Bus logic keep track of where the {journaler,replicator,unmarshaller} is ?
    3) what's the mechanism used between the Bus logic and {journaler,replicator,unmarshaller} ?

    For some context we call all consumers of events an EventProcessor. The business logic, jounaller, replicator and marshaller are all examples of EventProcessors. EventProcessors maintain their own high water mark of what event they have processed up to. When an EventProcessor wants to make progress it makes a call on what we call a ThresholdBarrier using the signature 'long waitFor(long sequence)' which will effectively block until the sequence is available which that EventProcessor is waiting for. The actual available sequence is what is returned which may be ahead of the requested sequence allowing batch progress. The MIN: on the presentation is a means of expressing that ThresholdBarrier on which the business logic is waiting will return the minimum sequence of all the EventProcessors and RingBuffer it is waiting on. A ThresholdBarrier represents the concerns of maintaining the dependencies. The journaller, replicator and unmarshaller only depend on the RingBuffer progressing. Whereas the business logic depends on the RingBuffer and the journaller, replicator and unmarshaller before it can progress a given sequence.

    All this happens without locks.

    3) If the replicator is stalling on 101.
    3.1) Can the replicator move to 102 before 101 is finished ?
    3.2) How do you stop the whole system from stalling then ?

    The replicator has two threads internally. One is sending messages to its cluster pair without blocking. A second thread is listening for acknowledgements and will update its high water mark for this EventProcessor when the ack arrives. It effectively pipelines requests so it can progress with replicating 102 before it receives the ack from 101. However the dependent business logic will not progress until it gets the ack.

    4) If the Bus logic is stalling on 101.
    4.1) Can the Bus logic move to 102 before 101 is finished ?
    4.2) Can the Bus logic come back to 101 if it can proceed with 102 before 101 is finished ? How does it come back to 101 ?

    The whole point of the business logic running as a single thread working on in-memory data structures is that it will never stall unless a catastrophic issue occurs such as an OutOfMemoryException occurs. The business logic does not do IO or contend on a locks. The business logic will not progress to 102 before 101 is finished.

  • Re: The LMAX Architecture

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - Java architecture does not support cache coherency in hardware. This essentially means that whenever a thread enters a monitor, all memory changes made before must be committed to main memory, and the entire cache must be invalidated and all registers flushed (please read Chapter 17 of the Java Language Specification). Assuming LMAX is written in Java, is this presentation a true reflection of its implementation?

    The whole point to the talk is that it is best not to use locks on highly contended data. This is why we use a single thread to own the data writes with other threads only reading this data as it is consumed. The single thread only needs a memory barrier to make the changes visible when the updates are complete.

    A monitor is surfaced in Java via the synchronised keyword. Not only does a monitor cause the changes on a thread to be made visible it also prevents instruction reordering and has the potential to context switch to the kernel for arbitration. For a low-latency system monitors should be avoided.

    Flushing registers and marking cache lines as modified does not mean they are evicted from the cache of the core that is the sole writer. These changes can be effectively propagated via the L3 cache and socket interlinks to other cores reading the data. The big performance hit comes from cores contending on trying to write the same cache line. They have to contend on claiming ownership. For me this comes down to the mechanical sympathy for how caches work regardless of language. We use the cache coherency model as a very efficient means of propagating changes between cores.

  • Re: The LMAX Architecture

    by dominic laporte,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thank you so much for your reply.

    I am very impressed with your architecture and i find it very impressive. However i am still not sure how you are handling your error conditions. I would appreciate if you can answer more questions regarding error handling.

    1) Am i correct in thinking that the logic between all these components is based on the sole fact that once a request *passes* the receiver it cannot fail in all the subsequent stages ? All the components have a sequenced logic so getting a return of 200 'long waitFor(190)' in the bus logic means that all transactions between 190 and 200 have all *passed*.

    2) if one of the {journaler,replicator,unmarshaller} fails on a sequence (for whatever reason) the bus logic will never proceed past that sequence. Is this correct ?

  • Re: Loved it

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - 1st. You made a valid point saying that a retail trading exchange has a bigger volume than a corporate trading exchange. So you've got here a damn fast business logic able to manage 6M TPS and therefore you are now able to cope with a larger volume of customers. And as you mention in the talk the smaller the latency the smaller the spread which is much better for the customer. But now, I'm curious about what happen when you put your average retail customer on the equation. He may perfectly has 200ms latency from his trading office/house to your servers. So although you've got there a really fast setup your retail customer is still exposed to Internet latency. How do you handle that? How does it affect to the pricing engine (I guess spread should increase)? and and how does it impact to the final volume (I would expect much less volume)?

    Retail customers and the wholesale market makers are putting orders into the same exchange. The main proposition for our exchange is for wholesale market makers to compete offering the best prices to the retail customers. Each come in over their own channels. Once past our firewalls everyone traverses the same infrastructure. Outside our firewalls we have no control over the latency we can offer. Wholesale market makers can use the financial extranets and retail customers can use the Internet. If retail customers locate their servers geographically close to ours then they can get low latency. None of this effects our matching engine. The spreads are tight because we have multiple market makers competing with each other to be top of book and in addition retail customers can place their own orders inside the spread to become top of book themselves, thus further reducing the spread. This is just a free market and drives natural price efficiency. What is very important for us is we offer predictable latency to the wholesale market makers so they can offer wholesale prices.

    - Second. I find the whole proposal very attractive and really liked the concept of disruptor. If I understood properly the whole point is to have the business logic in a single thread as compact and simple as possible while all the other services should be multi-threaded. You mentioned that JEE was not a valid option here but still all the talk is within Java context. So I'm curious, which kind of server setup are you using (should this information may be disclosed)? (i.e. tomcat, plain sockets setup, plain jms, ...)

    We have our own custom application server that is launched as a Java process. The Disruptor pattern is at the heart of this. For the web clients we take HTTP traffic from Servlets running in the Caucho Resin web container. All our messaging, reliable delivery, consistency models, resource management, and HA are our own intellectual property.

    - And finally, how do you handle failover? I mean, if you have a super-fast single threaded system which will also probably have all the pricing data and customer positions in memory, still you should need a pretty dammed fast replication system. And that replication system should be in the closest possible state to the single threaded business logic system as otherwise in a failover you will depend on how fast you can recover from the journaling system and pray that this journaling system is up to date. I'm quite interested on how it works and if you have experienced any real failover situations which you could share with us.

    All services are replicated to multiple backup nodes. This is master and multi-slave system. One slave is co-located and kept in lock step with the master so it can instantly take over in the event of failure. No IP addresses need to migrate because we use IP Multicast for all communications. We have other slaves nodes in our disaster recovery location running less than one second behind the master.

    The journals never fall behind because we write with forced synchronisation to storage. We have spent a lot of time getting this to perform very fast by developing our own algorithms. The only time we need to go to journals is if we have a complete system failure and we need to recover. To date we have successfully coped with single node failures and complete switch-over to our disaster recover location which is geographically remote.

  • Re: The LMAX Architecture

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    1) Am i correct in thinking that the logic between all these components is based on the sole fact that once a request *passes* the receiver it cannot fail in all the subsequent stages ? All the components have a sequenced logic so getting a return of 200 'long waitFor(190)' in the bus logic means that all transactions between 190 and 200 have all *passed*.

    It is perfectly possible for a request to fail at any stage. Getting a return of 200 when 190 is requested means that the events from 190 to 200 in the RingBuffer are now available to be processed by that EventProcessor. Each EventProcesssor has a top level exception handler that can decide on the appropriate course of action to take which could be anything from removing the node from the cluster to simply marking the event as dead and not for further processing.

    2) if one of the {journaler,replicator,unmarshaller} fails on a sequence (for whatever reason) the bus logic will never proceed past that sequence. Is this correct ?

    It depends on the class of failure. For example if the journaller failed to write a log entry this would likely be because the storage device has failed or become full. In this case the node should be marked for removal from the cluster. If the failure is more benign such as the marshaller parsing a corrupt message then it would be marked so and ignored by the business logic. If this is the master node and the replicator is not getting acks from the lockstep slave then the cluster is informed and the master can run at risk until a new lockstep slave is promoted.

  • Re: Loved it

    by Martin Perez,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - 1st. You made a valid point saying that a retail trading exchange has a bigger volume than a corporate trading exchange. So you've got here a damn fast business logic able to manage 6M TPS and therefore you are now able to cope with a larger volume of customers. And as you mention in the talk the smaller the latency the smaller the spread which is much better for the customer. But now, I'm curious about what happen when you put your average retail customer on the equation. He may perfectly has 200ms latency from his trading office/house to your servers. So although you've got there a really fast setup your retail customer is still exposed to Internet latency. How do you handle that? How does it affect to the pricing engine (I guess spread should increase)? and and how does it impact to the final volume (I would expect much less volume)?

    Retail customers and the wholesale market makers are putting orders into the same exchange. The main proposition for our exchange is for wholesale market makers to compete offering the best prices to the retail customers. Each come in over their own channels. Once past our firewalls everyone traverses the same infrastructure. Outside our firewalls we have no control over the latency we can offer. Wholesale market makers can use the financial extranets and retail customers can use the Internet. If retail customers locate their servers geographically close to ours then they can get low latency. None of this effects our matching engine. The spreads are tight because we have multiple market makers competing with each other to be top of book and in addition retail customers can place their own orders inside the spread to become top of book themselves, thus further reducing the spread. This is just a free market and drives natural price efficiency. What is very important for us is we offer predictable latency to the wholesale market makers so they can offer wholesale prices.


    Nice.


    - Second. I find the whole proposal very attractive and really liked the concept of disruptor. If I understood properly the whole point is to have the business logic in a single thread as compact and simple as possible while all the other services should be multi-threaded. You mentioned that JEE was not a valid option here but still all the talk is within Java context. So I'm curious, which kind of server setup are you using (should this information may be disclosed)? (i.e. tomcat, plain sockets setup, plain jms, ...)

    We have our own custom application server that is launched as a Java process. The Disruptor pattern is at the heart of this. For the web clients we take HTTP traffic from Servlets running in the Caucho Resin web container. All our messaging, reliable delivery, consistency models, resource management, and HA are our own intellectual property.


    That is quite interesting. I'm not a big fan of bespoke infrastructure development myself having tons of Open Source proven application servers, messaging platforms, protocols, frameworks, etc. out there. In fact I would be quite skeptical if it wasn't because your company comes out as a Betfair spin-off. That is good credentials and impressive volume. So surely it must be top-notch stuff.


    - And finally, how do you handle failover? I mean, if you have a super-fast single threaded system which will also probably have all the pricing data and customer positions in memory, still you should need a pretty dammed fast replication system. And that replication system should be in the closest possible state to the single threaded business logic system as otherwise in a failover you will depend on how fast you can recover from the journaling system and pray that this journaling system is up to date. I'm quite interested on how it works and if you have experienced any real failover situations which you could share with us.

    All services are replicated to multiple backup nodes. This is master and multi-slave system. One slave is co-located and kept in lock step with the master so it can instantly take over in the event of failure. No IP addresses need to migrate because we use IP Multicast for all communications. We have other slaves nodes in our disaster recovery location running less than one second behind the master.

    The journals never fall behind because we write with forced synchronisation to storage. We have spent a lot of time getting this to perform very fast by developing our own algorithms. The only time we need to go to journals is if we have a complete system failure and we need to recover. To date we have successfully coped with single node failures and complete switch-over to our disaster recover location which is geographically remote.


    Pretty impresive Martin. Well done!

    Thanks for your answers,
    Martin

  • Re: Excellent presentation and some remarks

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - short methods: if a method is short, it is easier to inline. But if you make your method big (yes I know.. normally a very bad practice), you give the JIT more area to do its work. Especially if you add virtual methods calls to the picture (with runtime multiple runtime-mplementations of that method ), the JIT has a very hard time inlining it.

    -XX:MaxInlineSize=value is used to set the maximum size of method Hotspot will inline. The default for this is 35 bytes of bytecode which is pretty small.

    Hotspot can do both call site and called method inlining which if make your methods small and not complex it has more options.

    Simple elegant code wins on so many levels :-)

  • Re: Great presentation!

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - I have been working in this area for more than 10 years and also spent several years as a programmer at a major derivatives exchange.
    Those days the business logic was done in Cobol and the architecture in C. Still a lot of the concepts from the old days there show up in your presentation. Even though I am working in a .NET environment now, I can confirm most of your statements.

    It is a fun space with lots of computer science challenges :-)

    - Could you elaborate a bit how you are determining that your code is machine friendly even though you are using a virtual machine? I end up writing a lot of small test apps and benchmark them, but this is not close to determining that something is cache-friendly. Btw which OS are you using?

    -XX:+PrintOptoAssembly with a debug build can be useful for seeing what assembly code is being produced when Hotspot complies a method. This is also available in JDK 7 as -XX:+PrintAssembly. -XX:+LogCompilation and -XX:+PrintCompilation can be useful to see what is being compiled and why.

    It is however very difficult to get exact behaviour and I often end up running a series of experiments and measuring the result so I can confirm the theories I want to test. Often it is not just the JVM which surprises us. We run on Linux and we often find behaviours in file systems and other sub-systems that have us begin with, "Hmmmmm that's interesting..."

  • Re: Excellent presentation and some remarks

    by Peter Veentjer,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Martin,

    I agree that a short method has many advantages, but once true virtual methods enter the picture, the total result is something that is much harder to optimize (so not purely the inlining optimization) since the JIT has less area to do its magic.

    And with a nesting of method calls, you are still going to end up with some parts being inlined, and some parts not, since at some point the resulting method body exceeds some limit.

    - Could you elaborate a bit how you are determining that your code is machine friendly even though you are using a virtual machine? I end up writing a lot of small test apps and benchmark them, but this is not close to determining that something is cache-friendly. Btw which OS are you using?

    If you want to look under the hood (so the generated machine instructions and the meatured performance impact of this code) Intel VTune really is the only tool I know about that helps. It gives feedback on all kinds of behavior like branch mis predications or cache failures. It also is great that it can combine the Javacode and the generated assembler code in just a single view.

  • If you restart every night isn't that more expensive than performing a GC.

    by Peter Lawrey,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Do you warm up your application after you restart? Or do you restart at a quiet time and let it warm itself?

    Have you tried direct byte buffers to reduce GC cost?

    I am surprised you can restart and warmup your JVM in less time than it takes to do a GC.

  • Re: If you restart every night isn't that more expensive than performing a

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - Do you warm up your application after you restart? Or do you restart at a quiet time and let it warm itself?

    Financial exchanges have an end of trading day period. During this period a lot of things are reset such as FIX sequence numbers. We use this opportunity to bounce our application because we cannot guarantee that a full GC will compact the old generation with CMS. We can warm up the application before making it available again. Our end of day window is 20 mins which is much less than other exchanges, e.g. Equity exchanges are closed for many hours. We can go for over 5 days without needing this but do it each day for good measure. I can see us doing this only once per week in future.

    - Have you tried direct byte buffers to reduce GC cost?

    With the Disruptor pattern we do not incur GC cost from buffers as we reuse buffers pre-allocated in the RingBuffer for both incoming and outgoing messages. The buffers in our RingBuffer of the Disruptor can be heap or direct byte buffers depending on how we want to use them.

  • Re: The LMAX Architecture

    by Carl Byström,

    Your message is awaiting moderation. Thank you for participating in the discussion.


    The whole point to the talk is that it is best not to use locks on highly contended data. This is why we use a single thread to own the data writes with other threads only reading this data as it is consumed. The single thread only needs a memory barrier to make the changes visible when the updates are complete.


    Could you elaborate some more on how you do this? What kind of barriers are you using to synchronize the read threads and the write thread?

    There also seem to be quite some interest in your architecture and your philosophy regarding this. Would love to see some detailed blog post/article on this subject.

  • Re: The LMAX Architecture

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    - Could you elaborate some more on how you do this? What kind of barriers are you using to synchronize the read threads and the write thread?

    At a low level we are simply using "volatile" longs tracking sequence counters for synchronising progress between components. I'm guessing your question is a bit more generic. If by barrier you mean memory barrier or fence then we use volatile, or the private sun.misc.Unsafe. If you mean our higher level ThresholdBarrier class then this is proprietary concept as part of the Disruptor.

    - There also seem to be quite some interest in your architecture and your philosophy regarding this. Would love to see some detailed blog post/article on this subject.

    Thanks for the interest. Later this year we may write in more detail about this or even open source some of the code.

  • Re: The LMAX Architecture

    by Matt Passell,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for the interest. Later this year we may write in more detail about this or even open source some of the code.


    If you do put up a post/article, I'd also be interested. Is there an LMAX/Tradefair blog? If not, where might you post it?

  • Re: The LMAX Architecture

    by Michael Barker,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    We have a community site (community.lmaxtrader.com/), which is very finance/trading trading related. We hope to have a tech blog up soon and it will probably be linked from the community site or our home page. My personal blog is available at mikes-tech.blogspot.com.

  • Re: The LMAX Architecture

    by Matt Passell,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Great, thanks! Looking forward to it.

  • Comments and Questions

    by Scott Smith,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    This is a very nice architecture. I have some general comments and some questions.

    Comments:

    C1: I think there is a typo on the slide that says “3 billion instructions per second”. I think it should say “3 billion cycles per second”. As your own slides indicate, a single read L3 takes 42 cycles.

    C2: I think some of the comments regarding pipeline architectures are not accurate (see below). In fact, I tend to interpret your architecture as a three-stage pipelined system in which the first stage is a parallel stage with three threads. I really like the idea of using a single ring buffer to pass information between the stages of a pipeline. I could use this technique today in a pipelined system I work on and I would only need to modify a small amount of code that is concerned with how data gets passed between stages.

    C3: I think another advantage of your architecture is that since the threads all access the ring buffer in a regular manor, the CPUs pre-fetching of data can be more effective, meaning that in many cases, the L3 fetch cost (42 cycles) effectively goes to zero, since it is done in parallel with the work being done by the thread.

    Questions:

    Q1: How does a thread know what is can work on next (based on the preconditions)? Are there flags in the ring buffer entries that indicate what work has been done to the entry?

    Q2: When a thread is waiting for work, I assume it spins? This seems power-intensive, but may be the right approach, based on your needs.

    Q3: Why are there separate ring buffers on either side of the “business logic”? It seems like the business logic and the marshaller can just be on the same ring buffer as the other tasks. I think it may be because you have a rule that data in the ring buffer cannot be modified. So each thread that needs to modify data adds another ring buffer. Is that correct?

    Q4: I don’t understand why it is said that if there is a problem with throughput of a pipelined system, you have to change everything. I think you just have to “fix the slowest stage” just like in your system. I don’t see any difference.

    Q5: A TPS rate of 6 M/sec equates to the slowest thread in the “pipeline” doing its work in 500 clock cycles. As your slides state, a fetch from L3 takes 42 cycles (fetch from RAM takes 182 cycles according to your slides). So I assume the 6M/sec number must be the speed at which a producer/consumer pair can pass data via the ring buffer, not the speed at which the entire system runs.

  • Re: Comments and Questions

    by Martin Thompson,

    Your message is awaiting moderation. Thank you for participating in the discussion.


    C1: I think there is a typo on the slide that says “3 billion instructions per second”. I think it should say “3 billion cycles per second”. As your own slides indicate, a single read L3 takes 42 cycles.


    You are correct that you do have 3 billion cycles per second on a 3Ghz chip per thread. Instructions per second is a function of cycles per second times instructions per cycle (IPC). IPC is not a constant given the mix of instructions executed. Modern CPUs have multiple execution units so if just doing integer math then IPC can be as high as 3. However given the instruction processing pipeline and number of execution units a 1:1 ratio is not a bad approximation.

    I'm deliberately avoiding a debate about the cost of cache misses or divide instructions which can change the mix significantly. A large part of this talk was about how to avoid cache misses.


    C2: I think some of the comments regarding pipeline architectures are not accurate (see below). In fact, I tend to interpret your architecture as a three-stage pipelined system in which the first stage is a parallel stage with three threads. I really like the idea of using a single ring buffer to pass information between the stages of a pipeline. I could use this technique today in a pipelined system I work on and I would only need to modify a small amount of code that is concerned with how data gets passed between stages.


    Logical pipelines are a very useful model for this type of problem. We chose this one example of a pipeline to demonstrate use of the Disruptor. We have others that range from parsers to network packet capture devices. Some have no dependent stages and all happen in parallel. The point we are trying to make is that queues as an implementation for synchronising activities are not well suited to modern hardware. We believe the Disruptor is a much better fit.


    C3: I think another advantage of your architecture is that since the threads all access the ring buffer in a regular manor, the CPUs pre-fetching of data can be more effective, meaning that in many cases, the L3 fetch cost (42 cycles) effectively goes to zero, since it is done in parallel with the work being done by the thread.


    I do not believe this is correct. What is does is share the cost of fetching the same shared data from main memory to the L3 in some cases. The cost at 42 cycles from the L3 to the executing core is the same for each thread. However the thread that creates the data that the others are reading will likely just push the data to the other cores via the L3 so main memory is not involved.


    Q1: How does a thread know what is can work on next (based on the preconditions)? Are there flags in the ring buffer entries that indicate what work has been done to the entry?


    The ring buffer contains a strictly increasing sequence which it uses to index every message. Each event processor contains a sequence number indicating the latest message processed. So if an EventProcessor is gated on another, like the business logic gating on the journaller then it looks a the minimum value of the RingBuffer's cursor and the journal EventProcessor's sequence (which is updated after it processes each batch of messages). The business logic EventProcessor can then process from its own sequence value (exclusive) up to the calculated minimum value (inclusive).

    The RingBuffer internally maintains a 2-phase commit protocol for claiming the next slot and then publishing the fact that is now available for processing.


    Q2: When a thread is waiting for work, I assume it spins? This seems power-intensive, but may be the right approach, based on your needs.


    We have swappable strategies for this. In a strict low-latency environment with sufficient cores the ThresholdBarrier will busy spin on a volatile variable. We have other strategies dependent on use whereby they either spin with a yield to release the quantum and at the other extreme use condition variables to signal when latency is not so important or CPU core are restricted.


    Q3: Why are there separate ring buffers on either side of the “business logic”? It seems like the business logic and the marshaller can just be on the same ring buffer as the other tasks. I think it may be because you have a rule that data in the ring buffer cannot be modified. So each thread that needs to modify data adds another ring buffer. Is that correct?


    The Ring Buffers provide a logical separation between the inputs and outputs of the system. In many cases we have multiple outputs separated according to the type of outbound message, e.g. market data, execution reports (the diagram is simplified to indicate a single output). A single message into the business logic may result in multiple messages sent out. We don't significantly change the data in the Ring Buffer other than to decorate it, e.g. marshalling a byte array into a real message object.


    Q4: I don’t understand why it is said that if there is a problem with throughput of a pipelined system, you have to change everything. I think you just have to “fix the slowest stage” just like in your system. I don’t see any difference.


    A pipeline model at its most basic is a series of stages with output of one becoming the input of another. The initial version of our system used that model, however we struggled to get the code to look right. We frequently tied ourselves in knots trying to make the business logic wait on the results of journalling and replication but use the output of marshalling, further complicated by having to dynamically change the pipeline (adding or removing the replication stage as cluster state changes). By thinking in terms of EventProcessors and ThresholdBarriers, and starting from the premise of parallel execution, and not thinking about a series of stages (some potentially parallel), the code became simpler and cleaner.

    If stages run in parallel then as you say you can focus on improving the slowest stage. If they cannot run in parallel then the overall cost of the pipeline can dominate. However most people design pipelines without parallel running because trying to synchronise multiple parallel stages is very difficult with queues between the stages in an efficient and none contended manner. The thing to focus on from a design perspective is how Little's and Amdahal's law's apply and that costs for coordination.


    Q5: A TPS rate of 6 M/sec equates to the slowest thread in the “pipeline” doing its work in 500 clock cycles. As your slides state, a fetch from L3 takes 42 cycles (fetch from RAM takes 182 cycles according to your slides). So I assume the 6M/sec number must be the speed at which a producer/consumer pair can pass data via the ring buffer, not the speed at which the entire system runs.


    The 6M TPS relates to purely the business logic thread running our matching engine logic. This is running in a component performance test case in isolation. The whole pipeline cannot run as this speed and is greatly affected by the network or storage employed by the journaller and replicator. For example the difference between 10GigE with kernel bypass and standard 1GigE is nearly an order of magnitude on both throughput and latency.

  • Re: The LMAX Architecture

    by Michael Barker,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    LMAX blogs are now available at: blogs.lmax.com/

  • Re: The LMAX Architecture

    by Michael Barker,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    The Disruptor is now open source: code.google.com/p/disruptor/

  • Great presentation and a question about other development languages

    by Greg Saunders,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Although I am not doing anything of this nature "yet" I was curious as to your thoughts on this being implemented in the Delphi Object Pascal language from Embarcadero. I know this site is geared to Java, .NET, and Ruby... but I would be interested in thoughts on this being implemented in other languages.

    Again... very informative presentation!

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT