Affects Version/s: 0.9.2
Fix Version/s: None
There are use cases of the MDA framework such as federation metadata processing which are almost embarrassingly parallel, for example when thousands of SAML metadata items are passed through a series of stages which require no interaction between the items. It would make sense to add a stage which allows this to be parallelised explicitly.
This would most obviously have a Pipeline property, an ExecutorService and some kind of partitioning strategy determining how many tasks to create.
Minimising copies of the items is important. So partitioning would have to work as:
- Figure out how many tasks to create
- Create that many new empty collections of items
- Partition the existing collection into those new collections (without copying the items)
- Clear the existing collection (the new collections now have the only references to the items)
- Create and submit tasks to run the pipeline on each of the new collections (results in a happens-before relation and the items are no longer "ours")
- Wait for each task's Future in order, and return the resulting collection's items to the original collection when it is done
- This should result in the same item ordering as in the original collection, assuming that the pipeline doesn't reorder or delete items (it's fine for it to do so, though)
I'm fairly sure that the guarantees we get from an ExecutorService and from Future mean that this is thread safe even if the item classes aren't.
The critical issues are choice of ExecutorService (which determines how many threads will be used) and the choice of how many tasks to use (which should be a bigger number, but not too big). There are probably some decent heuristics around this, but they interact.