This looks awesome. I've been itching to try out some ideas I have after having gone through Bishop's book, but I've been hesitant to write the algorithms from scratch. Now I'll have to decide between learning matlab or a library such as this.
You should definitely use a scriptable ML library. The process is very iterative and not suited to a compiled language like C++. I use skilearn alot, but also the matlab toolboxes or R are great. At its heart ML is alot of stats, so use something built for maths, not C++. It doesn't really make sense to break out C++ until you know exactly what algorithm and settings you need and your application is real time.
I am very agreed with this. I had even been thinking in making a DSL (Scheme based probably) oriented to ML instead of a library. I would found that more useful in the exploring phase.
If the OP was thinking of writing his own algorithms, and this is a linkable library with that heavy math already implemented, couldn't he write bindings for Python/Lua/Tcl/Ruby and have everything he needs for script-ability, or am I missing something?
I'm in the Stanford/Coursera machine learning course right now, and something like this is nearly excactly what I've been looking for.
As some others have said, GPLv3 is off-putting, but there is the LGPL mlpack lib (http://www.mlpack.org/) (also C++). Personally, project-wise, the only way this could be improved is if the project were pure C, and a BSD, MIT, or similar license. Quite looking forward to checking these out, though.
honestly you guys are crazy if you think shark is gonna help you learn machine learning. Its ideal for deployment of ML on things like embedded computer, robotics, games etc. where real time learning is required. Machine learning requires alot of experimentation and C++ is a terrible medium for that.
There are loads of good machine learning libraries implemented for python and matlab. Pretty much every good paper in machine learning is accompanied by an algorithm implemented in matlab or python or R. learn using those reference designs. Once you figured out what you want, then deploy on a system in C++ by all means using shark. I do robotics for a living, and I do go from scripting to C++. Unless its absolutely necessary I avoid C++. Only things like vision which is so CPU hungry that it has its own computer do I require C++, every other algorithm stays in python.
Actually you forget that performance when you need to train for days at a time is critical, if I use Octave/Matlab/R my current project might take months to train instead of weeks. All my ML code is high performance threaded C++. I recommend you use a good template linear algebra library like Eigen, you can do plenty of experimentation in C++. I find with a set of a few modern libraries and the required experience a C++ programmer is just as if not more efficient than a Python/R/Matlab programmer. It comes down to the skill of the programmer and the proper choice of libraries.
True that matlab octave and R are all rubbish for performance. I use python + numpy which all delegates to BLAS for the hardcore linear algebra stuff. I don't normally find C++ gains me all that much.
You can also do GPU acceleration pretty easy using theano (e.g. http://deeplearning.net/software/theano/tutorial/using_gpu.h...)
So I reckon my GPU accelerated python still beats a C++ pthreads approach, and is alot faster to develop on.
Your mileage may vary, from what you said you probably know what you are doing, maybe GPU is not applicable. I was really replying to the initial comments that said they want to start learning machine learning on a C++ system. Training for days suggests you are doing something hardcore like MCMC/DBN/Guassian Processes, learners should not start there though....
I'm doing deep belief networks with dropout, and don't have access to GPU's with good double precision performance. I used to write graphics device drivers, so GPU computing has a special place in my heart and definitely agree with you there performance wise. It is funny though that my little laptop is hitting training times similar to some papers where people are using low end GPU's though, its amazing what you can do when you pay attention to performance.
I suspect my tuned C++ code will work quite well on a Intel MIC, and that is probably where I'm going to go when I have more resources to throw at the problem. I do know that Theano does use Alex's C++ CUDA code under the covers and I have done lots of reading of some of theano's code looking at implementation details to help developing my code. I just am not a big python (or most scripting languages actually) fan, perhaps I'm just too old school and written C, C++, C# and Java too long. If it doesn't smell or feel like C, I feel like Scotty in Star Trek 4 when he was making the transparent aluminum on the mac.
Implement the algorithm yourself, first, in Python+Numpy. The only reason I feel comfortable with Gaussian Processes and SVMs is due to writing code to solve them manually.
Once you're happy with the basics, and can test your ideas with code you intimately understand, optimise for speed by using a library like this.
The only tricky part would be writing a quadratic solver. Alternatives: either solve a linear SVM using gradient descent (simpler to write), or offload the core of the algorithm to an existing solver like cvxopt.
On reflection I guess I might have had more free time to spend on this than a normal person - I did the SVM as a [small] part of my masters project, so if you're time constrained with a real job and a life then might be best to disregard me.
If you had the quadratic solver, I would think it would be reasonable to add the rest of the code. If you started adding costs, gammas, etc. I would think it would take a while. I spent hours looking at the source code of libSVM at my last job and never really understood what the hell was going on