CUDA for SQL Group By Can CUDA perform SQL Group by efficiently?
Is it at all possible for CUDA to perform SQL Group by like feature? if so, then how would it be done and would it be fast enough to be more beneficial than using SQL?

for example 150,000 rows of decimal values:

Combination 1: Column1 & Column2
SELECT Column1, Column2, COUNT(*), COUNT(Column3), Avg(Column4)
FROM Table
GROUP BY Column1, Column2

Combination 2: Column1 & Column3
SELECT Column1, Column3, COUNT(*), COUNT(Column4), Avg(Column5)
FROM Table
GROUP BY Column1, Column3

Currently SQL would average about 800 ms for each group but since there are 72 columns, there are about 800,000 combination possibilities of all columns for 2 to 4 columns. As you can see with some math, it would take about 7 days to complete.

CUDA has a similar example called Histogram which only counts the number of occurrence, which is almost like a basic group by with a COUNT(*).

Any help on how to accomplish this with CUDA would be grateful.
Is it at all possible for CUDA to perform SQL Group by like feature? if so, then how would it be done and would it be fast enough to be more beneficial than using SQL?



for example 150,000 rows of decimal values:



Combination 1: Column1 & Column2

SELECT Column1, Column2, COUNT(*), COUNT(Column3), Avg(Column4)

FROM Table

GROUP BY Column1, Column2



Combination 2: Column1 & Column3

SELECT Column1, Column3, COUNT(*), COUNT(Column4), Avg(Column5)

FROM Table

GROUP BY Column1, Column3



Currently SQL would average about 800 ms for each group but since there are 72 columns, there are about 800,000 combination possibilities of all columns for 2 to 4 columns. As you can see with some math, it would take about 7 days to complete.



CUDA has a similar example called Histogram which only counts the number of occurrence, which is almost like a basic group by with a COUNT(*).



Any help on how to accomplish this with CUDA would be grateful.

#1
Posted 04/25/2012 05:52 AM   
This could work well on CUDA, but the choice of using a histogram or using a sort and segmented reduction will depend on how many histogram buckets you will end up with. If it is small (<<1000), use the histogram, if it is large, (>>1000) then try sorting the rows by the columns that you are reducing over to create a lot of small buckets and then do a reduction within each bucket.
This could work well on CUDA, but the choice of using a histogram or using a sort and segmented reduction will depend on how many histogram buckets you will end up with. If it is small (<<1000), use the histogram, if it is large, (>>1000) then try sorting the rows by the columns that you are reducing over to create a lot of small buckets and then do a reduction within each bucket.

#2
Posted 04/25/2012 09:50 PM   
What interests you is to count(*) the number of entries corresponding to each (columnA, columnB) couple. 150 000 rows with 2 doubles (2x8 Bytes) on each, so 2.2MB of data.
I would bet on single-threaded processing by the CPU, due to the CPU cache enabling the whole dataset to be cached. And instead sorting, I would go for a hash -> 2 values + counter, that may fit into 4MB of cache (any basic CPU).

1 direct loop, no multithreading, NO SORTING, you just build your hashtable, adding a new element (count=1) or incrementing count when older value is found. I bet it will be faster than sending the data back and forth the GPU, not talking about executing a single line of Kernel code!

If you had 1GB+ of data, I would have chosen the GPU, where GPU memory subsystem would make it far faster than it's CPU counter part, and maybe having as many counter per entry as the number of CUDA devices, to avoid atomic on global memory :)
What interests you is to count(*) the number of entries corresponding to each (columnA, columnB) couple. 150 000 rows with 2 doubles (2x8 Bytes) on each, so 2.2MB of data.

I would bet on single-threaded processing by the CPU, due to the CPU cache enabling the whole dataset to be cached. And instead sorting, I would go for a hash -> 2 values + counter, that may fit into 4MB of cache (any basic CPU).



1 direct loop, no multithreading, NO SORTING, you just build your hashtable, adding a new element (count=1) or incrementing count when older value is found. I bet it will be faster than sending the data back and forth the GPU, not talking about executing a single line of Kernel code!



If you had 1GB+ of data, I would have chosen the GPU, where GPU memory subsystem would make it far faster than it's CPU counter part, and maybe having as many counter per entry as the number of CUDA devices, to avoid atomic on global memory :)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#3
Posted 04/26/2012 07:40 PM   
[quote name='parallelis' date='26 April 2012 - 08:40 PM' timestamp='1335469228' post='1401305']
If you had 1GB+ of data, I would have chosen the GPU
[/quote]

I might have interpreted the question incorrectly, but I thought he was accessing permutations of columns from a table with 72 columns, so the array size would be 72*150,000*16 = ~165MB.

"Currently SQL would average about 800 ms for each group but since there are 72 columns"
[quote name='parallelis' date='26 April 2012 - 08:40 PM' timestamp='1335469228' post='1401305']

If you had 1GB+ of data, I would have chosen the GPU





I might have interpreted the question incorrectly, but I thought he was accessing permutations of columns from a table with 72 columns, so the array size would be 72*150,000*16 = ~165MB.



"Currently SQL would average about 800 ms for each group but since there are 72 columns"

#4
Posted 04/26/2012 10:13 PM   
That's right, the GPU would have been my choice for a much larger dataset!
That's right, the GPU would have been my choice for a much larger dataset!

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#5
Posted 05/01/2012 03:09 PM   
Perhaps take a look at Alenka GPU db. I belive this such functionality and is opensource.
http://sourceforge.net/projects/alenka/
Perhaps take a look at Alenka GPU db. I belive this such functionality and is opensource.

http://sourceforge.net/projects/alenka/

#6
Posted 05/01/2012 03:45 PM   
I just took a look and it's extremely promising for little databases (few GB) with huge concurrent access, would like to try it on a staging server to see how it compares with MongoDB!
I just took a look and it's extremely promising for little databases (few GB) with huge concurrent access, would like to try it on a staging server to see how it compares with MongoDB!

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#7
Posted 05/03/2012 06:32 PM   
[quote name='parallelis' date='03 May 2012 - 07:32 PM' timestamp='1336069929' post='1403908']
I just took a look and it's extremely promising for little databases (few GB) with huge concurrent access, would like to try it on a staging server to see how it compares with MongoDB!
[/quote]

Alenka is currently built on thrust primitives. We're hoping to make the individual set and join operations faster by a factor of 10 compared to what is currently in thrust.
[quote name='parallelis' date='03 May 2012 - 07:32 PM' timestamp='1336069929' post='1403908']

I just took a look and it's extremely promising for little databases (few GB) with huge concurrent access, would like to try it on a staging server to see how it compares with MongoDB!





Alenka is currently built on thrust primitives. We're hoping to make the individual set and join operations faster by a factor of 10 compared to what is currently in thrust.

#8
Posted 05/03/2012 09:14 PM   
Scroll To Top