Otherwise, if the center square is free, play there.. The second thing is that machine learning is a sword with which to slay the complexity monster. Given enough data, a learning program thatвЂ™s only a few hundred lines long can easily generate a program with millions of lines, and it can do this again and again for different problems. The reduction in complexity for the programmer is phenomenal. Of course, like the Hydra, the complexity monster sprouts new heads as soon as we cut offthe old ones, but they start off smaller and take a while to grow, so we still get a big leg up.. The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.. Even test-set accuracy is not foolproof. According to legend, in an early military application a simple learner detected tanks with 100 percent accuracy in both the training set and the test set, each consisting of one hundred images. Amazing-or suspicious? Turns out all the tank images were lighter than the nontank ones, and thatвЂ™s all the learner was picking up. These days we have larger data sets, but the quality of data collection isnвЂ™t necessarily better, so caveat emptor. Hard-nosed empirical evaluation played an important role in the growth of machine learning from a fledgling field into a mature one. Up to the late 1980s, researchers in each tribe mostly believed their own rhetoric, assumed their paradigm was fundamentally better, and communicated little with the other camps. Then symbolists like Ray Mooney and Jude Shavlik started to systematically compare the different algorithms on the same data sets and-surprise, surprise-no clear winner emerged. Today the rivalry continues, but there is much more cross-pollination. Having a common experimental framework and a large repository of data sets maintained by the machine-learning group at the University of California, Irvine, did wonders for progress. And as weвЂ™ll see, our best hope of creating a universal learner lies in synthesizing ideas from different paradigms.. But first he had to graduate. Prudently, he picked a more conservative topic for his dissertation-Boolean circuits with cycles-and in 1959 he earned the worldвЂ™s first PhD in computer science. His PhD advisor, Arthur Burks, nevertheless encouraged HollandвЂ™s interest in evolutionary computation and was instrumental in getting him a faculty job at Michigan and shielding him from senior colleagues who didnвЂ™t think that stuff was computer science. Burks himself was so open-minded because he had been a close collaborator of John von Neumann, who had proved the possibility of self-reproducing machines. Indeed, it had fallen to him to complete the work when von Neumann died of cancer in 1957. That von Neumann could prove that such machines are possible was quite remarkable, given the primitive state of genetics and computer science at the time. But his automaton just made exact copies of itself; evolving automata had to wait for Holland.. For the first few decades, the genetic algorithms community consisted mainly of John Holland, his students, and their students. Circa 1983, the biggest problem genetic algorithms had been able to solve was learning to control gas pipeline systems. But then, at around the same time neural networks were making their comeback, interest in evolutionary computation took off. The first international conference on genetic algorithms was held in Pittsburgh in 1985, and a Cambrian explosion of genetic algorithm variants was under way. Some of these tried to model evolution more closely-the basic genetic algorithm was only a very crude approximation, after all-and others radiated in very different directions, crossing over evolutionary ideas with computer science concepts that would have bemused Darwin.. On a lighter note, MicrosoftвЂ™s Xbox Live uses a Bayesian network to rate players and match players of similar skill. The outcome of a game is a probabilistic function of the opponentsвЂ™ skill levels, and using BayesвЂ™ theorem we can infer a playerвЂ™s skill from the outcomes of his games.. Superficially, an SVM looks a lot like weightedk-nearest-neighbor: the frontier between the positive and negative classes is defined by a set of examples and their weights, together with a similarity measure. A test example belongs to the positive class if, on average, it looks more like the positive examples than the negative ones. The average is weighted, and the SVM remembers only the key examples required to pin down the frontier. If you look back at the Posistan/Negaland example, once we throw away all the towns that arenвЂ™t on the border, all thatвЂ™s left is this map:. Suppose I give you the GPS coordinates of all the shops in Palo Alto, California, and you plot a few of them on a piece of paper:. If we endow Robby the robot with all the learning abilities weвЂ™ve seen so far in this book, heвЂ™ll be pretty smart but still a bit autistic. HeвЂ™ll see the world as a bunch of separate objects, which he can identify, manipulate, and even make predictions about, but he wonвЂ™t understand that the world is a web of interconnections. Robby the doctor would be very good at diagnosing someone with the flu based on his symptoms but unable to suspect that the patient has swine flu because he has been in contact
with someone infected with it. Before Google, search engines decided whether a web page was relevant to your query by looking at its content-what else? Brin and PageвЂ™s insight was that the strongest sign a page is relevant is that relevant pages link to it. Similarly, if you want to predict whether a teenager is at risk of starting to smoke, by far the best thing you can do is check whether her close friends smoke. An enzymeвЂ™s shape is as inseparable from the shapes of the molecules it brings together as a lock is from its key. Predator and prey have deeply entwined properties, each evolved to defeat the otherвЂ™s properties. In all of these cases, the best way to understand an entity-whether itвЂ™s a person, an animal, a web page, or a molecule-is to understand how it relates to other entities. This requires a new kind of learning that doesnвЂ™t treat the data as a random sample of unrelated objects but as a glimpse into a complex network. Nodes in the network interact; what you do to one affects the others and comes back to affect you. Relational learners, as theyвЂ™re called, may not quite have social intelligence, but theyвЂ™re the next best thing. In traditional statistical learning, every man is an island, entire of itself. In relational learning, every man is a piece of the continent, a part of the main. Humans arerelational learners, wired to connect, and if we want Robby to grow into a perceptive, socially adept robot, we need to wire him to connect, too.. CHAPTER NINE: The Pieces of the Puzzle Fall into Place. WhatвЂ™s that statue up ahead? Aristotle, looking rather disapprovingly toward the tangled mess of the gradient descent quarter. YouвЂ™ve come full circle. You have the unified optimizer you need for the Master Algorithm, but this is no time to congratulate yourself. Night has fallen, and you still have much to do. You enter the Citadel of Evaluation through the imposing but rather narrow Accuracy Gate. The inscription above it says вЂњAbandon all hope of overfitting, ye who enter here.вЂќ As you circle past the palaces of the five tribesвЂ™ evaluators, you mentally snap the pieces into place. You use accuracy to evaluate yes-or-no predictions and squared error for continuous ones. Fitness is just the evolutionariesвЂ™ name for the scoring function; you can make it anything you want, including accuracy and squared error. Posterior probability reduces to squared error if you ignore the priorprobability and the errors follow a normal distribution. The margin, if you allow it to be violated for a price, becomes a softer version of accuracy: instead of paying no penalty for a correct prediction and a penalty of one for an incorrect prediction, the penalty is zero until you get inside themargin, at which point it starts to steadily go up. Whew! Combining the evaluators was a lot easier than combining the optimizers. But the Towers of Representation, looming above you, fill you with a sense of foreboding.. You can download the learner IвЂ™ve just described from alchemy.cs.washington.edu. We christened it Alchemy to remind ourselves that, despite all its successes, machine learning is still in the alchemy stage of science. If you do download it, youвЂ™ll see that it includes a lot more than the basic algorithm IвЂ™ve described butalso that it is still missing a few things I said the universal learner ought to have, like crossover. Nevertheless, letвЂ™s use the name Alchemy to refer to our candidate universal learner for simplicity.. Alchemy also enables the five types of unsupervised learning we saw in the previous chapter. It does relational learning, obviously, and in fact thatвЂ™s where most of its applications to date have been. Alchemy uses logic to represent relations among entities and Markov networks to let them be uncertain. We can turn Alchemy into a reinforcement learner by wrapping delayed rewards around it and using it to learn the value of each state in the same way that traditional reinforcement learners use, say, a neural network. We can do chunking in Alchemy by adding a new operation that condenses chains of rules into single rules. (For example,If A then B andIf B then C intoIf A then C.) An MLN with a single unobserved variable connected to all the observable ones does clustering. (An unobserved variable is a variable whose values we never see in the data; itвЂ™s вЂњhidden,вЂќ so to speak, and can only be inferred.) MLNs with more than one unobserved variable do a kind of discrete dimensionality reduction by inferring the values of those (fewer) variables from the (more numerous) observable ones. Alchemy can also handle MLNs with continuous unobserved variables, which would be needed to do things like principal-component analysis and Isomap. So Alchemy can in principle do all the things we want Robby the robot to do, or at least all the things weвЂ™ve discussed in this book. Indeed, weвЂ™ve used Alchemy to let a robot learn a map of its environment, figuring out from its sensors where the walls and doors are, their angles and distances, and so on, which is the first step in building a competent housebot.. Epilogue.