overcoming misaligned memory accesses

Here I am describing a simplified version of my problem, but it captures the essential operations.

Let us say that we have an integer array which can hold 1000 numbers between 0 to 70,000. Call it

int sen[1000]

.

So,

sen

could look like this

[23,7000, 567,......]

. Now, for each element of sen, I need to do some processing. And, the processing looks like this:

for (c = 0; c < N; c++) 
              neu1[c] += syn0[c + word * N];

In the actual code, there are multiple such loops for each element of

sen

Data structures:
1.

float neu1[N]
  1. float syn0[V * N]
/* Here, V = 70,000 (the range of numbers that can be found in sen). N = 400

I am thinking that the best way to parallelize this would be to process the elements in a sen in parallel. Considering all the loops, I think that I should use N threads per word, so that a single thread accesses the global memory (syn0) only once per loop. Also, since all the neu1 updates are independent, they can reside in the private memory of the threads and get updated independently.

My main concern right now is the following:

Global memory accesses are happening in a random fashion, because syn0 is accessed based on the value of elements in sen. And, as we can see those values do not appear in any order. Is this a big problem? Or, can we hide the memory latency by giving enough number of threads to the GPU?

If my sen array is 1000 long, then I will be launching 400,000 threads (N = 400).

cross posting:

[url]cuda - Parallelizing the pseudocode to work on a GPU: overcoming misaligned memory accesses - Stack Overflow