Otherwise, if thereвЂ™s a move that creates two lines of two in a row, play that.. The Industrial Revolution automated manual work and the Information Revolution did the same for mental work, but machine learning automates automation itself. Without it, programmers become the bottleneck holding up progress. With it, the pace of progress picks up. If youвЂ™re a lazy and not-too-bright computer scientist, machine learning is the ideal occupation, because learning algorithms do all the work but let you take all the credit. On the other hand, learning algorithms could put us out of our jobs, which would only be poetic justice.. Machine learning also has a growing role on the battlefield. Learners can help dissipate the fog of war, sifting through reconnaissance imagery, processing after-action reports, and piecing together a picture of the situation for the commander. Learning powers the brains of military robots, helping them keep their bearings, adapt to the terrain, distinguish enemy vehicles from civilian ones, and home in on their targets. DARPAвЂ™s AlphaDog carries soldiersвЂ™ gear for them. Drones can fly autonomously with the help of learning algorithms; although they are still partly controlled by human pilots, the trend is for one pilot to oversee larger and larger swarms. In the army of the future, learners will greatly outnumber soldiers, saving countless lives.. If so few learners can do so much, the logical question is: Could one learner do everything? In other words, could a single algorithm learn all that can be learned from data? This is a very tall order, since it would ultimately include everything in an adultвЂ™s brain, everything evolution has created, and the sum total of all scientific knowledge. But in fact all the major learners-including nearest-neighbor, decision trees, and Bayesian networks, a generalization of NaГЇve Bayes-are universal in the following sense: if you give the learner enough ofthe appropriate data, it can approximate any function arbitrarily closely-which is math-speak for learning anything. The catch is that вЂњenough dataвЂќ could be infinite. Learning from finite data requires making assumptions, as weвЂ™ll see, and different learners make different assumptions, whichmakes them good for some things but not others.. SoвЂ¦ what shall it be? Date or no date? Is there a pattern that distinguishes the yeses from the nos? And, most important, what does that pattern say about today?. Accuracy you can believe in. We can learn decision trees using a variant of theвЂњdivide and conquerвЂќ algorithm. First we pick an attribute to test at the root. Then we focus on the examples that went down each branch and pick the next test for those. (For example, we check whether tax-cutters are pro-life or pro-choice.) We repeat this for each new node we induce until allthe examples in a branch have the same class, at which point we label that branch with the class.. Backprop is an instance of a strategy that is very common in both nature and technology: if youвЂ™re in a hurry to get to the top of the mountain, climb the steepest slope you can find. The technical term for this is gradient ascent (if you want to get to the top) or gradient descent (if youвЂ™re looking for the valley bottom). Bacteria can find food by swimming up the concentration gradientof, say, glucose molecules, and they can flee from poisons by swimming down their gradient. All sorts of things, from aircraft wings to antenna arrays, can be optimized by gradient descent. Backprop is an efficient way to do it in a multilayer perceptron: keep tweaking the weights so as to lower the error, and stop when all tweaks fail. With backprop, you donвЂ™t have to figure out how to tweak each neuronвЂ™s weights from scratch, which would be too slow; you can do it layer by layer, tweaking each neuron based on how you tweaked the neurons it connects to. If you had to throw out your entire machine-learning toolkit in an emergency save for one tool, gradient descent is probably the one youвЂ™d want to hold on to.. Hyperspace is a double-edged sword. On the one hand, the higher dimensional the space, the more room it has for highly convoluted surfaces and local optima. On the other hand, to be stuck in a local optimum you have to be stuck inevery dimension, so itвЂ™s more difficult to get stuck in many dimensions than it is in three. In hyperspace there are mountain passes all over the (hyper) place. So, with a little help from a human sherpa, backprop can often find its way to a perfectly good set of weights. It may be only the mystical valley of Shangri-La, not the sea, but why complain if in hyperspace there are millions of Shangri-Las, each with billions of mountain passes leading to it?. If the states and observations are continuous variables instead of discrete ones, the HMM becomes whatвЂ™s known as a Kalman filter. Economists use Kalman filters to remove noise from time series of quantities like GDP, inflation, and unemployment. The вЂњtrueвЂќ GDP values are the hidden states; at each time step, the true value should be similar to the observed one, but also to the previous true value, since the economy seldom makes abrupt jumps. The Kalman filter trades off these two, yielding a smoother curve that still accords with the observations. When a missile cruises to its target, itвЂ™s a Kalman filter that keeps it on track. Without it, there would have been no man on the moon.. Organizing the world into objects and categories is second nature to an adult but not to an infant, and even less to Robby the robot. We could endow him with a visual cortex in the form of a multilayer perceptron and show him labeled examples of all the objects and categories in the world-hereвЂ™s Mommy close up, hereвЂ™s Mommy far away-but weвЂ™d never be done. What we need is an algorithm that will spontaneously group together similar objects, or different images of the same object. This is the problem of clustering, and itвЂ™s one of the most intensively studied in machine learning.. Suppose we zoom out from Palo Alto, and I give you the GPS coordinates of the main cities in the Bay Area:. Again, you can probably surmise just by looking at this plot that the cities are on a bay, and if you draw a line running through them, you can locate each city using just one number: how far it is from San Francisco along that line. But PCA canвЂ™t find this curve; instead, it draws a straight line running down the middle of the bay, where there are no cities at all. Far from elucidating the shape of the data, PCA obscures it.. All this power comes at a cost, however. In an ordinary classifier, such as a decision tree or a perceptron, inferring an entityвЂ™s class from its attributes is a matter of a few lookups and a bit of arithmetic. In a network, each nodeвЂ™s class depends indirectly on all the othersвЂ™, and we canвЂ™t infer it in isolation. We can resort to the same kinds of inference techniques we used for Bayesian networks, like loopy belief propagation or MCMC, but the scale is different. A typical Bayesian network has perhaps thousands of variables, but a typical social network has millions of nodes or more. Luckily, because the model of the network consists of many repetitions of the same features with the same weights, we can often condense the network into вЂњsupernodes,вЂќ each consisting of many nodes that we know will have the same probabilities, and solve a much smaller problem with the same result.. Out of many models, one.