Have any CUDA samples been published/posted for nonlinear least-squares regression (Nelder-Mead, Levenberg–Marquardt, Gauss-Newton, Simulated annealing, etc)?
I have a problem with ~10^7 independent nonlinear regression tasks, each of which is small (between 3-16 floats being compared to a 2-parameter model). Seems perfect for GPU computing, no? Any help would be appreciated!
Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra “space” between small tasks.
Haven’t heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.
Btw – Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?
Thanks for the thoughts, kretleifur! Would be great if someone out there could port an existing parallel implementation to CUDA…
A typical problem would take a stack of 16 1-megapixel float images (64MB total), and solve the above inverse problem on each 16-pixel (64 byte) “column” independently. Additionally, I may have 2-8 sets of this data (128MB-512MB) to solve in parallel.
I am currently working on a similar problem in medical imaging - basically I have a set of roughly 10^5 vectors of length 200 and need to fit a model to each of these vectors.
In principle, this should be ideally suited for CUDA, but one of my problems is, that the library I use for this task on the CPU makes heavy use of function pointers, so I can’t port it easily to the GPU.
So if anybody has a Levenberg-Marquardt implementation in CUDA, I’d be very glad to hear from him.
Well, I can’t offer anything for this yet, but as part of a school (university) related project, I’m implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I’m planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.
Like I said, I don’t really have a timeframe for this to be done, other than to say that I’ll need it for use in another project within a few months, so it shouldn’t be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you’ll be able to grab it and use it in whatever you like. I’ll post back here in the future when I have something to share…
Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it’d be nice to know if someone else has done it. If i do complete it, I’ll post a link here.
Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it’d be nice to know if someone else has done it. If i do complete it, I’ll post a link here.
I’m also looking into a non-linear least squares implementation on GPU.
I have just ordered my GTX680, that comes with the latest CUDA capability 3.0. I would be happy to help test your code on my card when it arrives and is installed.
There does not seem to be any GPU package out there for parallel continuous minimization. The only thing I could find is this article where section 6.3 covers the Weierstrass Continuous function. The authors are part of the so called ParadisEO project, that just put their GPU version of the code online, but I was not able to find the continuous case on their svn.
Do you plan to share this work as open source software? If so, I’d be very interested in checking out what you have so far. I have compute capability 2.0 and above devices that I could test on.
The forums are back up again and my card has arrived (I’ve been playing around a bit with it meanwhile).
If you - and others, of course - are interested to develop this into an open source project, count me in.
I have experience working with optimization problems and I have started with the CUDA documentation. If I would do it for myself (which I still plan to do), I would start with levmar for known jacobian, box constrained and using PyCUDA. Single-float is sufficient for my purposes.
I haven’t worked with cminpack, but I just saw it has a less restrictive license, which should add some momentum if some of you are working on commercial software.
Also, if something similar has been started somewhere else, please mention it here.