Jason Brownlee published an excerpt from his “Small Projects Methodology: Learn and Practive Applied Machine Learning” focusing on the process of implementing machine learning algorithms:
Implementing a machine learning algorithm in code can teach you a lot about
the algorithm and how it works.
In this post you will learn how to be effective at implementing machine
learning algorithms and how to maximize your learning from these projects.
If you think about it, the process of implementing machine learning algorithms is in many ways similar to how machine learning works.
Original title and link: How to Implement a Machine Learning Algorithm
Found by Daniel Gutierrez from Inside BigData:
Vowpal Wabbit (aka VW) is an open source fast out-of-core learning system
library and program started and led by John Langford who works at Microsoft
Research New York. Vowpal Wabbit is notable as an efficient scalable
implementation of online machine learning and support for a number of
machine learning reductions, importance weighting, and a selection of
different loss functions and optimization algorithms.
The project is on GitHub, there’s a short wiki page, and a presentation.
Continue to read ➤
Bill Gates in a tweet-based interview:
Q: @fesja: @BillGates if you were 20 years old now, what would you do? which area?
A: Bill Gates: When it comes to technology, there are four areas where I think a lot of exciting things will happen in the coming decades: big data, machine learning, genomics, and ubiquitous computing. So if I were 20 years old today, I’d be looking into one (or maybe more!) of those fields.
To say that Bill Gates always had a great understanding of technology trends would be an understatement.
Original title and link: Bill Gates: Four Areas of Technology I’d look into ( ©myNoSQL)
Below is a compilation of APIs that have benefited from Machine Learning in
one way or another, we truly are living in the future so strap into your
rocketship and prepare for blastoff.
Original title and link: List of Machine Learning APIs ( ©myNoSQL)
Created by Andreas Mueller:
Then you can head to this Quora thread to read a bit more about the pros and cons of the different classification algorithms.
Original title and link: Machine Learning Cheatsheets ( ©myNoSQL)
Aria Haghighi about the present and future of products based on machine learning:
But I think there’s an even bigger barrier beyond ingenious model design and engineering skills. In the case of machine translation and speech recognition, the problem being solved is straightforward to understand and well-specified. Many of the NLP technologies that I think will revolutionize consumer products over the next decade are much vaguer. How, exactly, can we take the excellent research in structured topic models, discourse processing, or sentiment analysis and make a mass-appeal consumer product?
Original title and link: Machine Learning: Interesting Problems Are Never Off the Shelf ( ©myNoSQL)
Skytree Server connects to any number of existing data stores, including Hadoop, and, says Hack, is tens of thousands of times faster than existing tools, performing in minutes tasks that would have taken hours or days. As of now, it’s tuned to five specific use cases the company says are the most common — recommendation systems, anomaly/outlier identification, predictive analytics, clustering and market segmentation, and similarity search.
There’s a limited but free Skytree version available on demand, so I expect to read some more about it soon.
Original title and link: Skytree Launches a MacHine Learning Server ( ©myNoSQL)
Ricky Ho published yet another great article giving a high level summary of the algorithms used by different machine learning models:
- decision trees
- linear regression methods
- neural networks
- bayesian networks
- support vector machines
- nearest neighbor
For classification and regression problem, there are different choices of Machine Learning Models each of which can be viewed as a blackbox that solve the same problem. However, each model come from a different algorithm approaches and will perform differently under different data set. The best way is to use cross-validation to determine which model perform best on test data.
Original title and link: Characteristics of Machine Learning Models ( ©myNoSQL)
The presentation Cloudera Data Science team (Josh Wills, Tom Pierce, Jeff Hammerbacher) gave a couple of days ago on the state of machine learning and Hadoop.
Continue to read ➤
Trying to combine MPI and Hadoop MapReduce for eliminating the drawbacks in each of them:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
Original title and link: Combining Hadoop MapReduce and MPI for Terascale Learning ( ©myNoSQL)