Segmenting random sized nodes
Hi

I'm having a stab at implementing [url="http://research.microsoft.com/apps/pubs/default.aspx?id=70568"]Zhou et al.[/url]'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.

I figure that for kd-nodes with segment sizes of

| 4 | 1 | 5 | 2 |

I can used cudpp's scan to compute where each nodes segments should start

| 0 | 4 | 5 | 10 | and giving me a total of 12 segments

But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.

I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like

| 253 | 1 | 1 | 1 |

Where one thread pr node really won't perform well.

Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.

So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node

| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |

So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?

Hoping you can help

/papaboo
Hi



I'm having a stab at implementing Zhou et al.'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.



I figure that for kd-nodes with segment sizes of



| 4 | 1 | 5 | 2 |



I can used cudpp's scan to compute where each nodes segments should start



| 0 | 4 | 5 | 10 | and giving me a total of 12 segments



But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.



I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like



| 253 | 1 | 1 | 1 |



Where one thread pr node really won't perform well.



Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.



So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node



| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |



So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?



Hoping you can help



/papaboo

#1
Posted 09/11/2010 09:31 AM   
Hi

I'm having a stab at implementing [url="http://research.microsoft.com/apps/pubs/default.aspx?id=70568"]Zhou et al.[/url]'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.

I figure that for kd-nodes with segment sizes of

| 4 | 1 | 5 | 2 |

I can used cudpp's scan to compute where each nodes segments should start

| 0 | 4 | 5 | 10 | and giving me a total of 12 segments

But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.

I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like

| 253 | 1 | 1 | 1 |

Where one thread pr node really won't perform well.

Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.

So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node

| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |

So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?

Hoping you can help

/papaboo
Hi



I'm having a stab at implementing Zhou et al.'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.



I figure that for kd-nodes with segment sizes of



| 4 | 1 | 5 | 2 |



I can used cudpp's scan to compute where each nodes segments should start



| 0 | 4 | 5 | 10 | and giving me a total of 12 segments



But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.



I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like



| 253 | 1 | 1 | 1 |



Where one thread pr node really won't perform well.



Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.



So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node



| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |



So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?



Hoping you can help



/papaboo

#2
Posted 09/11/2010 09:31 AM   
[quote name='papaboo' post='1115723' date='Sep 11 2010, 11:31 AM']Hi

I'm having a stab at implementing [url="http://research.microsoft.com/apps/pubs/default.aspx?id=70568"]Zhou et al.[/url]'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.

I figure that for kd-nodes with segment sizes of

| 4 | 1 | 5 | 2 |

I can used cudpp's scan to compute where each nodes segments should start

| 0 | 4 | 5 | 10 | and giving me a total of 12 segments

But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.

I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like

| 253 | 1 | 1 | 1 |

Where one thread pr node really won't perform well.

Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.

So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node

| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |

So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?

Hoping you can help

/papaboo[/quote]

Could the threads in a block work from a todo-list in shared memory, or would this be unhelpful because of (SIMD) architecture? Since you sometimes have to synchronise threads, can they run in an uncoordinated fashion? I'm not quite clear on this, maybe someone can clarify this for us.
[quote name='papaboo' post='1115723' date='Sep 11 2010, 11:31 AM']Hi



I'm having a stab at implementing Zhou et al.'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.



I figure that for kd-nodes with segment sizes of



| 4 | 1 | 5 | 2 |



I can used cudpp's scan to compute where each nodes segments should start



| 0 | 4 | 5 | 10 | and giving me a total of 12 segments



But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.



I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like



| 253 | 1 | 1 | 1 |



Where one thread pr node really won't perform well.



Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.



So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node



| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |



So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?



Hoping you can help



/papaboo



Could the threads in a block work from a todo-list in shared memory, or would this be unhelpful because of (SIMD) architecture? Since you sometimes have to synchronise threads, can they run in an uncoordinated fashion? I'm not quite clear on this, maybe someone can clarify this for us.

#3
Posted 09/13/2010 02:57 PM   
[quote name='papaboo' post='1115723' date='Sep 11 2010, 11:31 AM']Hi

I'm having a stab at implementing [url="http://research.microsoft.com/apps/pubs/default.aspx?id=70568"]Zhou et al.[/url]'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.

I figure that for kd-nodes with segment sizes of

| 4 | 1 | 5 | 2 |

I can used cudpp's scan to compute where each nodes segments should start

| 0 | 4 | 5 | 10 | and giving me a total of 12 segments

But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.

I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like

| 253 | 1 | 1 | 1 |

Where one thread pr node really won't perform well.

Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.

So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node

| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |

So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?

Hoping you can help

/papaboo[/quote]

Could the threads in a block work from a todo-list in shared memory, or would this be unhelpful because of (SIMD) architecture? Since you sometimes have to synchronise threads, can they run in an uncoordinated fashion? I'm not quite clear on this, maybe someone can clarify this for us.
[quote name='papaboo' post='1115723' date='Sep 11 2010, 11:31 AM']Hi



I'm having a stab at implementing Zhou et al.'s paper on real-time kd-tree construction on the gpu. I have a pretty basic implementation up and running, but right now I'm copying the number of photons ranged over by each node back to the host in order to schedule kernels for each node with an optimal number of threads/blocks. Obviously lots of copying and several kernel launches pr. node isn't really a fast solution, so now I want to extend it with gpu computed segments, and this is where I'm stuck at the moment.



I figure that for kd-nodes with segment sizes of



| 4 | 1 | 5 | 2 |



I can used cudpp's scan to compute where each nodes segments should start



| 0 | 4 | 5 | 10 | and giving me a total of 12 segments



But now I need to write those segments into my arrays and this is where I can't find a reasonable solution.



I could of course start a thread for each node and then let that thread handle segment creation for that node. That would work reasonable for the above examples but since nodes are created using spatial median split, SAH or VVH, they can become pretty unbalanced and for 65k photons I could end up with nodes looking like



| 253 | 1 | 1 | 1 |



Where one thread pr node really won't perform well.



Instead I would like to 'flip' the algorithm and start a thread pr segment, which would scale well, but what I can't figure out is how that thread would go about finding out which node it belongs to.



So in the first example that resulted in 12 segments I would start 12 threads that each belonged to node



| 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |



So does anyone have a solution to going from | 0 | 4 | 5 | 10 | to | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | or another way to perform this segmentation of arbitrarily sized nodes?



Hoping you can help



/papaboo



Could the threads in a block work from a todo-list in shared memory, or would this be unhelpful because of (SIMD) architecture? Since you sometimes have to synchronise threads, can they run in an uncoordinated fashion? I'm not quite clear on this, maybe someone can clarify this for us.

#4
Posted 09/13/2010 02:57 PM   
Scroll To Top

Add Reply