11 Nested for loops in CUDA

Hello, I am converting a JAVA simulation program to CUDA for a friend. He has 11 nested for loops, and I’m trying to get as much performance as possible from it. I’m am planning on using a grid-stride indexing system and I would appreciate any snippets and suggestions maximizing outputs. Currently, there is 1.3 quadrillion combinations I have to iterate through, testing each one, Each combination taking about ~64 bytes of local storage. I’m not to great at CUDA yet, so sorry if their is a obvious solution.