Nonlinear least-squares?
  1 / 2    
Hi,

Have any CUDA samples been published/posted for nonlinear least-squares regression (Nelder-Mead, Levenberg–Marquardt, Gauss-Newton, Simulated annealing, etc)?

I have a problem with ~10^7 independent nonlinear regression tasks, each of which is small (between 3-16 floats being compared to a 2-parameter model). Seems perfect for GPU computing, no? Any help would be appreciated!

David
Hi,



Have any CUDA samples been published/posted for nonlinear least-squares regression (Nelder-Mead, Levenberg–Marquardt, Gauss-Newton, Simulated annealing, etc)?



I have a problem with ~10^7 independent nonlinear regression tasks, each of which is small (between 3-16 floats being compared to a 2-parameter model). Seems perfect for GPU computing, no? Any help would be appreciated!



David

#1
Posted 06/14/2008 07:38 PM   
Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra "space" between small tasks.

Haven't heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.

Btw -- Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?
Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra "space" between small tasks.



Haven't heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.



Btw -- Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?

#2
Posted 06/14/2008 07:55 PM   
[quote name='kristleifur' date='Jun 14 2008, 12:55 PM']Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra "space" between small tasks.

Haven't heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.
[right][snapback]393617[/snapback][/right]
[/quote]

Thanks for the thoughts, kretleifur! Would be great if someone out there could port an existing parallel implementation to CUDA...

[quote name='kristleifur' date='Jun 14 2008, 12:55 PM']Btw -- Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?
[right][snapback]393617[/snapback][/right]
[/quote]

A typical problem would take a stack of 16 1-megapixel float images (64MB total), and solve the above inverse problem on each 16-pixel (64 byte) "column" independently. Additionally, I may have 2-8 sets of this data (128MB-512MB) to solve in parallel.
[quote name='kristleifur' date='Jun 14 2008, 12:55 PM']Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra "space" between small tasks.



Haven't heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.

[snapback]393617[/snapback]






Thanks for the thoughts, kretleifur! Would be great if someone out there could port an existing parallel implementation to CUDA...



[quote name='kristleifur' date='Jun 14 2008, 12:55 PM']Btw -- Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?

[snapback]393617[/snapback]






A typical problem would take a stack of 16 1-megapixel float images (64MB total), and solve the above inverse problem on each 16-pixel (64 byte) "column" independently. Additionally, I may have 2-8 sets of this data (128MB-512MB) to solve in parallel.

#3
Posted 06/14/2008 08:26 PM   
If you have code of an existing parallel implementation, it is probably not too difficult to port.
If you have code of an existing parallel implementation, it is probably not too difficult to port.

greets,
Denis

#4
Posted 06/15/2008 06:07 AM   
Did you make any progress with this problem?

I am currently working on a similar problem in medical imaging - basically I have a set of roughly 10^5 vectors of length 200 and need to fit a model to each of these vectors.

In principle, this should be ideally suited for CUDA, but one of my problems is, that the library I use for this task on the CPU makes heavy use of function pointers, so I can't port it easily to the GPU.

So if anybody has a Levenberg-Marquardt implementation in CUDA, I'd be very glad to hear from him.

Best regards,

Michi
Did you make any progress with this problem?



I am currently working on a similar problem in medical imaging - basically I have a set of roughly 10^5 vectors of length 200 and need to fit a model to each of these vectors.



In principle, this should be ideally suited for CUDA, but one of my problems is, that the library I use for this task on the CPU makes heavy use of function pointers, so I can't port it easily to the GPU.



So if anybody has a Levenberg-Marquardt implementation in CUDA, I'd be very glad to hear from him.



Best regards,



Michi

#5
Posted 07/08/2008 09:16 AM   
I'm just chiming in to say I would also like to to see someone port some of these algorithms to CUDA.

With my current employment conditions though, if I were to write it even in my free time it would become company property. :P
I'm just chiming in to say I would also like to to see someone port some of these algorithms to CUDA.



With my current employment conditions though, if I were to write it even in my free time it would become company property. :P

#6
Posted 12/15/2008 10:18 PM   
Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.

Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...
Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.



Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...

GPU.NET: Write your GPU code in 100% pure C#.

Learn more at tidepowerd.com, and download a free 30-day trial of GPU.NET. Follow @tidepowerd for release updates.



GPU.NET example projects

#7
Posted 12/16/2008 05:25 PM   
[quote name='profquail' post='477561' date='Dec 16 2008, 10:55 PM']Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.

Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...[/quote]

Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it'd be nice to know if someone else has done it. If i do complete it, I'll post a link here.
[quote name='profquail' post='477561' date='Dec 16 2008, 10:55 PM']Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.



Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...



Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it'd be nice to know if someone else has done it. If i do complete it, I'll post a link here.

#8
Posted 08/15/2010 10:35 AM   
[quote name='profquail' post='477561' date='Dec 16 2008, 10:55 PM']Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.

Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...[/quote]

Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it'd be nice to know if someone else has done it. If i do complete it, I'll post a link here.
[quote name='profquail' post='477561' date='Dec 16 2008, 10:55 PM']Well, I can't offer anything for this yet, but as part of a school (university) related project, I'm implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I'm planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.



Like I said, I don't really have a timeframe for this to be done, other than to say that I'll need it for use in another project within a few months, so it shouldn't be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you'll be able to grab it and use it in whatever you like. I'll post back here in the future when I have something to share...



Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it'd be nice to know if someone else has done it. If i do complete it, I'll post a link here.

#9
Posted 08/15/2010 10:35 AM   
Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?
Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?

#10
Posted 08/15/2010 01:07 PM   
Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?
Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?

#11
Posted 08/15/2010 01:07 PM   
[url="http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm"]http://en.wikipedia.org/wiki/Levenberg%E2%...uardt_algorithm[/url]
[url="http://www.ics.forth.gr/~lourakis/levmar/"]http://www.ics.forth.gr/~lourakis/levmar/[/url]
A roadmap could be:
Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;
alternatively,
use lourakis without libraries and replace the matrix stuff with material from the SDK etc.
That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.
As function pointers go, afaik templates will help at least to a point, see the discussion in [url="http://forums.nvidia.com/lofiversion/index.php?t87781.html"]http://forums.nvidia.com/lofiversion/index.php?t87781.html[/url].
A platform solution seems to exist (mainly) for Linux, [url="http://www.plm.eecs.uni-kassel.de/CuPP/"]http://www.plm.eecs.uni-kassel.de/CuPP/[/url].
Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.
http://en.wikipedia.org/wiki/Levenberg%E2%...uardt_algorithm

http://www.ics.forth.gr/~lourakis/levmar/

A roadmap could be:

Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;

alternatively,

use lourakis without libraries and replace the matrix stuff with material from the SDK etc.

That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.

As function pointers go, afaik templates will help at least to a point, see the discussion in http://forums.nvidia.com/lofiversion/index.php?t87781.html.

A platform solution seems to exist (mainly) for Linux, http://www.plm.eecs.uni-kassel.de/CuPP/.

Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.

#12
Posted 08/15/2010 02:41 PM   
[url="http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm"]http://en.wikipedia.org/wiki/Levenberg%E2%...uardt_algorithm[/url]
[url="http://www.ics.forth.gr/~lourakis/levmar/"]http://www.ics.forth.gr/~lourakis/levmar/[/url]
A roadmap could be:
Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;
alternatively,
use lourakis without libraries and replace the matrix stuff with material from the SDK etc.
That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.
As function pointers go, afaik templates will help at least to a point, see the discussion in [url="http://forums.nvidia.com/lofiversion/index.php?t87781.html"]http://forums.nvidia.com/lofiversion/index.php?t87781.html[/url].
A platform solution seems to exist (mainly) for Linux, [url="http://www.plm.eecs.uni-kassel.de/CuPP/"]http://www.plm.eecs.uni-kassel.de/CuPP/[/url].
Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.
http://en.wikipedia.org/wiki/Levenberg%E2%...uardt_algorithm

http://www.ics.forth.gr/~lourakis/levmar/

A roadmap could be:

Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;

alternatively,

use lourakis without libraries and replace the matrix stuff with material from the SDK etc.

That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.

As function pointers go, afaik templates will help at least to a point, see the discussion in http://forums.nvidia.com/lofiversion/index.php?t87781.html.

A platform solution seems to exist (mainly) for Linux, http://www.plm.eecs.uni-kassel.de/CuPP/.

Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.

#13
Posted 08/15/2010 02:41 PM   
hi!

has been a while since the last posting here...

I was wondering if anyone knows, if there is already a nonlinear least-square implementation for cuda somewhere available?

Thanks in advance for any hint :)
hi!



has been a while since the last posting here...



I was wondering if anyone knows, if there is already a nonlinear least-square implementation for cuda somewhere available?



Thanks in advance for any hint :)

#14
Posted 04/01/2011 02:26 PM   
[quote name='DELUXEnized' date='01 April 2011 - 03:26 PM' timestamp='1301668007' post='1217397']
hi!

has been a while since the last posting here...

I was wondering if anyone knows, if there is already a nonlinear least-square implementation for cuda somewhere available?

Thanks in advance for any hint :)
[/quote]


I'm porting [url="http://devernay.free.fr/hacks/cminpack/cminpack.html"]cminpack[/url], which includes lmder and lmdif
(levenberg-marquardt optimization for know Jacobian and aproximated Jacobian) to CUDA.

In particular those functions are ported and tested OK for computing capability 1.1 (with float -single precision- and
without pointers to functions).

I would like it to support double-precision and pointers to functions but I have not a GPU card to test and
doesn't find any proper way to simulate, say, capability 2.0

Any idea?
[quote name='DELUXEnized' date='01 April 2011 - 03:26 PM' timestamp='1301668007' post='1217397']

hi!



has been a while since the last posting here...



I was wondering if anyone knows, if there is already a nonlinear least-square implementation for cuda somewhere available?



Thanks in advance for any hint :)







I'm porting cminpack, which includes lmder and lmdif

(levenberg-marquardt optimization for know Jacobian and aproximated Jacobian) to CUDA.



In particular those functions are ported and tested OK for computing capability 1.1 (with float -single precision- and

without pointers to functions).



I would like it to support double-precision and pointers to functions but I have not a GPU card to test and

doesn't find any proper way to simulate, say, capability 2.0



Any idea?

#15
Posted 01/04/2012 05:04 PM   
  1 / 2    
Scroll To Top