Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so. In this book I use the termmachine learning to refer broadly to all of them.. My first direct experience of rule learning in action was when, having just moved to the United States to start graduate school, I applied for a credit card. The bank sent me a letter sayingвЂњWe regret that your application has been rejected due to INSUFFICIENT-TIME-AT-CURRENT-ADDRESS and NO-PREVIOUS-CREDIT-HISTORYвЂќ (or some other all-cap words to that effect). I knew right then that there was much research left to do in machine learning.. Learning to cure cancer. Hyperspace is a double-edged sword. On the one hand, the higher dimensional the space, the more room it has for highly convoluted surfaces and local optima. On the other hand, to be stuck in a local optimum you have to be stuck inevery dimension, so itвЂ™s more difficult to get stuck in many dimensions than it is in three. In hyperspace there are mountain passes all over the (hyper) place. So, with a little help from a human sherpa, backprop can often find its way to a perfectly good set of weights. It may be only the mystical valley of Shangri-La, not the sea, but why complain if in hyperspace there are millions of Shangri-Las, each with billions of mountain passes leading to it?. The exploration-exploitation dilemma. If we measure not just the probability of vowels versus consonants, but the probability of each letter in the alphabet following each other, we can have fun generating new texts with the same statistics asOnegin: choose the first letter, then choose the second based on the first, and so on. The result is complete gibberish, of course, but if we let each letter depend on several previous letters instead of just one, it starts to sound more like the ramblings of a drunkard, locally coherent even if globally meaningless. Still not enough to pass the Turing test, but models like this are a key component of machine-translation systems, like Google Translate, which lets you see the whole web in English (or almost), regardless of the language the pages were originally written in.. The Bayesian method is not just applicable to learning Bayesian networks and their special cases. (Conversely, despite their name, Bayesian networks arenвЂ™t necessarily Bayesian: frequentists can learn them, too, as we just saw.) We can put a prior distribution on any class of hypotheses-sets of rules, neural networks, programs-and then update it with the hypothesesвЂ™ likelihood given the data. BayesiansвЂ™ view is that itвЂ™s up to you what representation you choose, but then you have to learn it using BayesвЂ™ theorem. In the 1990s, they mounted a spectacular takeover of the Conference on Neural Information Processing Systems (NIPS for short), the main venue for connectionist research. The ringleaders (so to speak) were David MacKay, Radford Neal, and Michael Jordan. MacKay, a Brit who was a student of John HopfieldвЂ™s at Caltech and later became chief scientific advisor to the UKвЂ™s Department of Energy, showed how to learn multilayer perceptrons the Bayesian way. Neal introduced the connectionists to MCMC, and Jordan introduced them to variational inference. Finally, they pointed out that in the limit you could вЂњintegrate outвЂќ the neurons in a multilayer perceptron, leaving a type of Bayesian model that made no reference to them. Before long, the wordneural in the title of a paper submitted to NIPS became a good predictor of rejection. Some researchers joked that the conference should change its name to BIPS, for Bayesian Information Processing Systems.. In reality, we usually let SVMs violate some constraints, meaning classify some examples incorrectly or by less than the margin, because otherwise they would overfit. If thereвЂ™s a noisy negative example somewhere in the middle of the positive region, we donвЂ™t want the frontier to wind around inside the positive region just to get that example right. But the SVM pays a penalty for each example it gets wrong, which encourages it to keep those to a minimum. SVMs are like the sandworms inDune: big, tough, and able to survive a few explosions from slithering over landmines but not too many.. The most important question in any analogical learner is how to measure similarity. It could be as simple as Euclidean distance between data points, or as complex as a whole program with multiple levels of subroutines whose final output is a similarity value. Either way, the similarity function controls how the learner generalizes from known examples to new ones. ItвЂ™s where we insert our knowledge of the problem domain into the learner, making it the analogizersвЂ™ answer to HumeвЂ™s question. We can apply analogical learning to all kinds of objects, not just vectors of attributes, provided we have a way of measuring the similarity between them. For example, we can measure the similarity between two molecules by the number of identical substructures they contain. Methane and methanol are similar because they have three carbon-hydrogen bonds in common and differ
only in the replacement of a hydrogen atom by a hydroxyl group:. And so we have traveled through the territories of the five tribes, gathering their insights, negotiating the border crossings, wondering how the pieces might fit together. We know immensely more now than when we started out. But something is still missing. ThereвЂ™s a gaping hole in the center of the puzzle, making it hard to see the pattern. The problem is that all the learners weвЂ™ve seen so far need a teacher to tell them the right answer. They canвЂ™t learn to distinguish tumor cells from healthy ones unless someone labels them вЂњtumorвЂќ or вЂњhealthy.вЂќ But humans can learn without a teacher; they do it from the day theyвЂ™re born. Like Frodo at the gates of Mordor, our long journey will have been in vain if we donвЂ™t find a way around this barrier. But there is a path past the ramparts and the guards, and the prize is near. Follow meвЂ¦. This direction-known as the first principal component of the data-is also the direction along which the spread of the data is greatest. (Notice how, if you project the shops onto thex axis, theyвЂ™re farther apart in the right figure than in the left one.) After youвЂ™ve found the first principal component, you can look for the second one, which in this case is the direction of greatest variation at right angles to University Avenue. On a map, thereвЂ™s only one possible direction left (the direction of the cross streets). But if Palo Alto was on a hillside, one or both of the two first principal components would be partly uphill, and the third and last one would be up into the air. We can apply the same idea to data in thousands or millions of dimensions, like face images, successively looking for the directions of greatest variation until the remaining variability is small, at which point we can stop. For example, after rotating the axes in the figure above, most shops havey = 0, so the averagey is very small, and we donвЂ™t lose too much information by ignoring they coordinate altogether. And if we decide to keepy, surelyz (up into the air) is insignificant. As it turns out, the whole process of finding the principal components can all be accomplished in one shot with a bit of linear algebra. Best of all, a few dimensions often account for the bulk of the variation in even very high-dimensional data. Even if thatвЂ™s not the case, eyeballing the data in the top two or three dimensions often yields a lot of insight because it takes advantage of your visual systemвЂ™s amazing powers of perception.. CHAPTER NINE: The Pieces of the Puzzle Fall into Place. Most of all, though, Alchemy addresses the problems that each of the five tribes of machine learning has worked on for so long. LetвЂ™s look at each of them in turn.. A company like this could quickly become one of the most valuable in the world. As Alexis Madrigal of theAtlantic points out, today your profile can be bought for half a cent or less, but the value of a user to the Internet advertising industry is more like $1,200 per year. GoogleвЂ™s sliver of your data is worth about $20, FacebookвЂ™s $5, and so on. Add to that all the slivers that no one has yet, and the fact that the whole is more than the sum of the parts-a model of you based on all your data is much better than a thousand models based on a thousand slivers-and weвЂ™relooking at easily over a trillion dollars per year for an economy the size of the United States. It doesnвЂ™t take a large cut of that to make a Fortune 500 company. If you decide to take up the challenge and wind up becoming a billionaire, remember where you first got the idea.. In any case, banning robot warfare may not be viable. Far from banning drones-the precursors of tomorrowвЂ™s warbots-countries large and small are busy developing them, presumably because in their estimation the benefits outweigh the risks. As with any weapon, itвЂ™s safer to have robots than to trust the other side not to. If in future wars millions of kamikaze drones will destroy conventional armies in minutes, theyвЂ™d better be our drones. If World War III will be over in seconds, as one side takes control of the otherвЂ™s systems, weвЂ™d better have the smarter, faster, more resilient network. (Off-grid systems are not the answer: systems that arenвЂ™t networked canвЂ™t be hacked, but theycanвЂ™t compete with networked systems, either.) And, on balance, a robot arms race may be a good thing, if it hastens the day when the Fifth Geneva Convention bans humans in combat. War will always be with us, but the casualties of war need not be..