Once the inevitable happens and learning algorithms become the middlemen, power becomes concentrated in them. GoogleвЂ™s algorithms largely determine what information you find, AmazonвЂ™s what products you buy, and Match.comвЂ™s who you date. The last mile is still yours-choosing from among the options the algorithms present you with-but 99.9 percent of the selection was done by them. The success or failure of acompany now depends on how much the learners like its products, and the success of a whole economy-whether everyone gets the best products for their needs at the best price-depends on how good the learners are.. If so few learners can do so much, the logical question is: Could one learner do everything? In other words, could a single algorithm learn all that can be learned from data? This is a very tall order, since it would ultimately include everything in an adultвЂ™s brain, everything evolution has created, and the sum total of all scientific knowledge. But in fact all the major learners-including nearest-neighbor, decision trees, and Bayesian networks, a generalization of NaГЇve Bayes-are universal in the following sense: if you give the learner enough ofthe appropriate data, it can approximate any function arbitrarily closely-which is math-speak for learning anything. The catch is that вЂњenough dataвЂќ could be infinite. Learning from finite data requires making assumptions, as weвЂ™ll see, and different learners make different assumptions, whichmakes them good for some things but not others.. Of course, we donвЂ™t have to start from scratch in our hunt for the Master Algorithm. We have a few decades of machine learning research to draw on. Some of the smartest people on the planet have devoted their lives to inventing learning algorithms, and some would even claim that they already have a universal learner in hand. We will stand on the shoulders of these giants, but take such claims with a grain of salt. Which raises the question: how will we know when weвЂ™ve found the Master Algorithm? When the same learner, with only parameter changes and minimal input aside from the data, can understand video and text as well as humans, and make significant new discoveries in biology, sociology, and other sciences. Clearly, by this standard no learner has yet been demonstrated to be the Master Algorithm, even in the unlikely case one already exists.. Fortunately, something happens in learning that kills off one of the exponentials, leaving only anвЂњordinaryвЂќ singly exponential intractable problem. Suppose you have a bag full of concept definitions, each written on a piece of paper, and you take out a random one and see how well it matches the data. A bad definition is no more likely to get, say, all thousand examples in your data right than a coin is likely to come up heads a thousand times in a row. вЂњA chair has four legs and is red or has a seat but no legsвЂќ will probably match some but not all chairs youвЂ™ve seen and also match some but not all other things. So if a random definition correctly matches a thousand examples, then itвЂ™s extremely unlikely to be the wrong definition, or at least itвЂ™s pretty close to the real one. And if the definition agrees with a million examples, then itвЂ™s practically certain to be the right one. How else would it get all those examples right?. Of course, itвЂ™s not enough to be able to tell when youвЂ™re overfitting; we need to avoid it in the first place. That means stopping short of perfectly fitting the data even if weвЂ™re able to. One method is to use statistical significance tests to make sure the patterns weвЂ™re seeing are really there. For example, a rule covering three hundred positive examples versus one hundred negatives and a rule covering three positives versus one negative are both 75 percent accurate on the training data, but the first rule is almost certainly better than coin flipping, while the second isnвЂ™t, since four flipsof an unbiased coin could easily result in three heads. When constructing a rule, if at some point we canвЂ™t find any conditions that significantly improve its accuracy then we just stop, even if it still covers some negative examples. This reduces the ruleвЂ™s training-set accuracy, but probably makes it a more accurate generalization, which is what we really care about.. Geoff Hinton went on to try many variations on Boltzmann machines over the following decades. Hinton, a psychologist turned computer scientist and great-great-grandson of George Boole, the inventor of the logical calculus used in all digital computers, is the worldвЂ™s leading connectionist. He has tried longer and harder to understand how the brain works than anyone else. He tells of coming home from work one day in a state of great excitement, exclaiming вЂњI did it! IвЂ™ve figured out how the brain works!вЂќ His daughter replied, вЂњOh, Dad, not again!вЂќHintonвЂ™s latest passion is deep learning, which weвЂ™ll meet later in this chapter. He was also involved in the development of backpropagation, an even better algorithm than Boltzmann machines for solving the credit-assignment problem that weвЂ™ll look at next. Boltzmann machines could solve the credit-assignment problem in principle, but in practice learning was very slow and painful, making this approach impractical for most applications. The next breakthrough involved getting rid of another oversimplification that dated all the way back to McCulloch and Pitts.. FromEugene Onegin to Siri. Looking around for applications, Vapnik and his coworkers soon alighted on handwritten digit recognition, which their connectionist colleagues at Bell Labs were the world experts on. To everyoneвЂ™s surprise, SVMs did as well out of the box as multilayer perceptrons that had been carefully crafted for digit recognition over the years. This set the stage for a long-running, wide-ranging competition between the two. SVMs can be seen as a generalization of the perceptron, because a hyperplane boundary between classes is what you get when you use a particular similarity measure (the dot product between vectors). But SVMs have a major advantage compared to multilayer perceptrons: the weights have a single optimum instead of many local ones and so learning them reliably is much easier. Despite this, SVMs are no less expressive than multilayer perceptrons; the support vectors effectively act as a hidden layer and their weighted average as the output layer. For example, an SVM can easily represent the exclusive-OR function by having one support vector for each of the four possible configurations. But the connectionists didnвЂ™t give up without a fight. In 1995, Larry Jackel, the head of VapnikвЂ™s department at Bell Labs, bet him a fancy dinner that by 2000 neural networks would be as well understood as SVMs. He lost. But in return, Vapnik bet that by 2005 no one would use neural networks any more, and he also lost. (The only one to get a free dinner was Yann LeCun, their witness.) Moreover, with the advent of deep learning, connectionists have regained the upper hand. Provided you can learn them, networks with many layers can express many functions more compactly than SVMs, which always have just one layer, and this can make all the difference.. AnalogizersвЂ™ neatest trick, however, is learning across problem domains. Humans do it all the time: an executive can move from, say, a media company to a consumer-products one without starting from scratch because many of the same management skills still apply. Wall Street hires lots of physicists because physical and financial problems, although superficially very different, often have a similar mathematical structure. Yet all the learners weвЂ™ve seen so far would fall flat if we, say, trained them to predict Brownian motion and then asked them to predict the stock market. Stock prices and the velocities of particles suspended in a fluid are just different variables, so the learner wouldnвЂ™t even know where to start. But analogizers can do this using structure mapping, an algorithm invented by Dedre Gentner, a psychologist at Northwestern University. Structure mapping takes two descriptions, finds a coherent correspondence between some of their parts and relations, and then, based on that correspondence, transfers further properties from one structure to the other. For example, if the structures are the solar system and the atom, we can map planets to electrons and the sun to the nucleus and conclude, as Bohr did, that electrons
revolve around the nucleus. The truth is more subtle, of course, and we often need to refine analogies after we make them. But being able to learn from a single example like this is surely a key attribute of a universal learner. When weвЂ™re confronted with a new type of cancer-and that happens all the time because cancers keep mutating-the models weвЂ™ve learned for previous ones donвЂ™t apply. Neither do we have time to gather data on the new cancer from a lot of patients; there may be only one, and she urgently needs a cure. Our best hope is then to compare the new cancer with known ones and try to find one whose behavior is similar enough that some of the same lines of attack will work.. If we endow Robby the robot with all the learning abilities weвЂ™ve seen so far in this book, heвЂ™ll be pretty smart but still a bit autistic. HeвЂ™ll see the world as a bunch of separate objects, which he can identify, manipulate, and even make predictions about, but he wonвЂ™t understand that the world is a web of interconnections. Robby the doctor would be very good at diagnosing someone with the flu based on his symptoms but unable to suspect that the patient has swine flu because he has been in contact with someone infected with it. Before Google, search engines decided whether a web page was relevant to your query by looking at its content-what else? Brin and PageвЂ™s insight was that the strongest sign a page is relevant is that relevant pages link to it. Similarly, if you want to predict whether a teenager is at risk of starting to smoke, by far the best thing you can do is check whether her close friends smoke. An enzymeвЂ™s shape is as inseparable from the shapes of the molecules it brings together as a lock is from its key. Predator and prey have deeply entwined properties, each evolved to defeat the otherвЂ™s properties. In all of these cases, the best way to understand an entity-whether itвЂ™s a person, an animal, a web page, or a molecule-is to understand how it relates to other entities. This requires a new kind of learning that doesnвЂ™t treat the data as a random sample of unrelated objects but as a glimpse into a complex network. Nodes in the network interact; what you do to one affects the others and comes back to affect you. Relational learners, as theyвЂ™re called, may not quite have social intelligence, but theyвЂ™re the next best thing. In traditional statistical learning, every man is an island, entire of itself. In relational learning, every man is a piece of the continent, a part of the main. Humans arerelational learners, wired to connect, and if we want Robby to grow into a perceptive, socially adept robot, we need to wire him to connect, too.. HereвЂ™s a challenge: you have fifteen minutes to combine decision trees, multilayer perceptrons, classifier systems, NaГЇve Bayes, and SVMs into a single algorithm possessing the best properties of each. Quick-what can you do? Clearly, it canвЂ™t involve the details of the individual algorithms; thereвЂ™s no time for that. But how about the following? Think of each learner as an expert on a committee. Each looks carefully at the instance to be classified-what is the diagnosis for this patient?-and confidently makes its prediction. YouвЂ™re not an expert yourself, but youвЂ™re the chair of the committee, and your job is to combine their recommendations into a final decision. What you have on your hands is in fact a new classification problem, where instead of the patientвЂ™s symptoms, the input is the expertsвЂ™ opinions. But you can apply machine learning to this problem in the same way the experts applied it to the original one. We call this metalearning because itвЂ™s learning about the learners. The metalearner can itself be any learner, from a decision tree to a simple weighted vote. To learn the weights, or the decision tree, we replace the attributes of each original example by the learnersвЂ™ predictions. Learners that often predict the correct class will get high weights, and inaccurate ones will tend to be ignored. With a decision tree, the choice of whether to use a learner can be contingent on other learnersвЂ™ predictions. Either way, to obtain a learnerвЂ™s prediction for a given training example, we must first apply it to the original training setexcluding that example and use the resulting classifier-otherwise the committee risks being dominated by learners that overfit, since they can predict the correct class just by remembering it. The Netflix Prize winner used metalearning to combine hundreds of different learners. Watson uses it to choose its final answer from the available candidates. Nate Silver combines polls in a similar way to predict election results.. If you donвЂ™t like the idea of a profit-making entity holding the keys to your kingdom, you can join a data union instead. (If there isnвЂ™t one in your neck of the cyberwoods yet, consider starting it.) The twentieth century needed labor unions to balance the power of workers and bosses. The twenty-first needs data unions for a similar reason. Corporations have a vastly greater ability to gather and use data than individuals. This leads to an asymmetry in power, and the more valuable the data-the better and more useful the models that can be learned from it-the greater the asymmetry. A data union lets its members bargain on equal terms with companies about the use of their data. Perhaps labor unions can get the ball rolling, and shore up their membership, by starting data unions for their members. But labor unions are organized by occupation and location; data unions can be more flexible. Join up with people you have a lot in common with; the models learned will be more useful to you that way. Notice that being in a data union does not mean letting other members see your data; it just means letting everyone use the models learned from the pooled data. Data unions can also be your vehicle for telling politicians what you want. Your data can influence the world as much as your vote-or more-because you only go to the polls on election day. On all other days, your data is your vote. Stand up and be counted!. The main problem with this scenario, as you may have already guessed, is that letting robots learn ethics by observing humans may not be such a good idea. The robot is liable to get seriously confused when it sees that humansвЂ™ actions often violate their ethical principles. We can clean up the training data by including only the examples where, say, a panel of ethicists agrees that the soldier made the right decision, and the panelists can also inspect and tweak the model post-learning to their satisfaction. Agreement may be hard to reach, however, particularly if the panel includes all the different kinds of people it should. Teaching ethics to robots, with their logical minds and lack of baggage, will force us to examine our assumptions and sort out our contradictions. In this, as in many other areas, the greatest benefit of machine learning may ultimately be not what the machines learn but what we learn by teaching them.. Google + Master Algorithm = Skynet?. This does not mean that there is nothing to worry about, however. The first big worry, as with any technology, is that AI could fall into the wrong hands. If a criminal or prankster programs an AI to take over the world, weвЂ™d better have an AI police capable of catching it and erasing it before it gets too far. The best insurance policy against vast AIs gone amok is vaster AIs keeping the peace..