How to use CUDA to resolve a complex graph problem

Given a DAG, suppose the DAG may have 10000 nodes, and each node will do complex processing. E.g., each node will receive millions independent data items from its predecessor, while the data item processing demand querying 2 large tables(1E5 items). The nodes are different and need to do different operation on the data. How should I use CUDA to resolve this problem? Any suggestion is welcome! thx!

Is it possible to group those nodes into several subsets that perform identical operations on their input data?

If each node indeed performs different processing, that means a lot of divergent paths have to be taken on the GPU, which is really bad for performance.

Depending on their contents, tables might be stored in textures for cached read access, or alternatively accessed from global memory via __ldg(), also using the caches. Is the read order completely random or is there some locality in the access patterns?

This description is so vague that I doubt any specific recommendation can be generated based on it. Is there a particular algorithm name, or specific use case that you could name that describes or exemplifies the data structures and processing you have in mind?

My only thought here is that you would want to try to think outside the box. To use a simple analogy, to perform list processing you could either do list processing the classical way (following links etc) which likely has poor performance on a GPU, or you could implement the list on top of an array (pointers then turn into array indexes) which is kept compact at all times, and permits the use of more easily parallelizable array-oriented processing.

You might want to look into the details of GPU-accelerated in-memory databases which seems like a problem loosely related to yours.

I doubt the nodes can be grouped into several subsets because these nodes should be processed level by level.
There are 10x different type nodes in the DAG. If I process one node each time, and store the results in the global memory for the next kernel. Will it be slower than a multi-core processor, e.g., one Intel Xeon processor.

The tables used are 2 hash tables. I have no idea how to store them in GPU device now.

GPU-accelerated in-memory database is a helpful example. Thank you.