Based on my on experience with the batch processing of small matrices I would recommend having each thread handle one, or a few, matrices. This means that the per-thread program is essentially simple scalar code for the various matrix operations. You may want to download the “batched solver” code from the registered developer website for an example of how to do this for the matrix inverse. The code is under BSD license so you could simply use it as a building block in your processing pipeline.