TPL Dataflow – The Successor to CCR
TPL Dataflow is Microsoft’s new library for highly concurrent applications. Using asynchronous message passing and pipelining, it promises to offer more control than thread pools and better performance than manual threading. The downside is that you have to adhere to design patterns that may be unfamiliar to .NET programmers.
A dataflow consists of a series of “blocks”. Each block, known as a port in the older CCR, can be a source and/or target for data. Data usually enters into a dataflow by being posted to a propagation block, which is block that implements both ISourceBlock<T> and ITargetBlock<T>. Since this block is a source, it may be linked to other target or propagation blocks. Data flows from one block to the next in an asynchronous fashion, often being buffered at the source or target until such time as it is needed.
As the name implies, the infrastructure beneath TPL Dataflow is .NET 4’s Task Parallel Library. And like the TPL you can replace the default scheduler with a custom implementation. Out of the box you one based on .NET’s thread pool system and one that uses the synchronization context framework. The latter would be used when you want the dataflow to run on a particular thread, for example when using dataflow to manipulate a GUI. The documentation is unclear on this point, but it appears that you can set the scheduler on a block-by-block basis. If true, this would provide an excellent means for marshaling data back into the GUI thread.
By default a dataflow is tuned for performance over fairness. Implementation wise, this means that once a block is activated it will continue to process data until it runs out. To prevent one block from consuming all available resources one can put a message count limit on it. When set, the block will only process that many pieces of data before terminating the current task and allowing another to take over.
Blocks can consume data on a greedy or non-greedy fashion. The latter is very important for use in situations where multiple targets are vying for messages from the same source. For example, when using multiple targets to load-balance a single source you want to ensure those targets are non-greedy lest everything be dumped into the first greedy block. Another reason to use non-greedy consumers is to allow messages to be dropped when they are replaced by newer versions. An example of this would be the Broadcast block. It offers each message to every one of its target. If the targets don’t accept the block it will only remain available until the next time the Broadcast block receives a message.
Turning to the issue of locking, TPL Dataflow has an interesting design to avoid deadlocks. When a given block needs input from multiple sources, it will wait until all sources have data available. At that point it will use a two-phase commit to ensure it can actually acquire data from each source. It is highly recommended that one relies on this mechanism and not try to manually lock data that is inside the workflow.
This design, especially when combined with blocks like Broadcast, can lead to race conditions. Essentially the problem is that one message be delivered and acted upon by multiple blocks. Since locks can interfere with the scheduler, it is much safer to use immutable data types as messages.
The TPL Dataflow library is available with the Async CTP.
Roy Rapoport Aug 28, 2014