If programmers are minor gods, the complexity monster is the devil himself. Little by little, itвЂ™s winning the war.. Cyberwar is an instance of asymmetric warfare, where one side canвЂ™t match the otherвЂ™s conventional military power but can still inflict grievous damage. A handful of terrorists armed with little more than box cutters can knock down the Twin Towers and kill thousands of innocents. All the biggest threats to US security today are in the realm of asymmetric warfare, and thereвЂ™s an effective weapon against all of them: information. If the enemy canвЂ™t hide, he canвЂ™t survive. The good news is that we have plenty of information, and thatвЂ™s also the bad news.. Of course, we donвЂ™t have to start from scratch in our hunt for the Master Algorithm. We have a few decades of machine learning research to draw on. Some of the smartest people on the planet have devoted their lives to inventing learning algorithms, and some would even claim that they already have a universal learner in hand. We will stand on the shoulders of these giants, but take such claims with a grain of salt. Which raises the question: how will we know when weвЂ™ve found the Master Algorithm? When the same learner, with only parameter changes and minimal input aside from the data, can understand video and text as well as humans, and make significant new discoveries in biology, sociology, and other sciences. Clearly, by this standard no learner has yet been demonstrated to be the Master Algorithm, even in the unlikely case one already exists.. Machine learning has an unavoidable element of gambling. In the firstDirty Harry movie, Clint Eastwood chases a bank robber, repeatedly firing at him. Finally, the robber is lying next to a loaded gun, unsure whether to spring for it. Did Harry fire six shots or only five? Harry sympathizes (so to speak):вЂњYouвЂ™ve got to ask yourself one question: вЂDo I feel lucky?вЂ™ Well, do you, punk?вЂќ ThatвЂ™s the question machine learners have to ask themselves every day when they go to work: Do I feel lucky today? Just like evolution, machine learning doesnвЂ™t get it right every time; in fact, errors are the rule, not the exception. But itвЂ™s OK, because we discard the misses and build on the hits, and the cumulative result is what matters. Once we acquire a new piece of knowledge, it becomes a basis for inducing yet more knowledge. The only question is where to begin.. NewtonвЂ™s principle is only the first step, however. We still need to figure out what is true of everything weвЂ™ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know theform of the truth, and the learnerвЂ™s job is to flesh it out. For example, in the dating problem you could assume that your friendвЂ™s answer is determined by a single factor, in which case learning just consists of checking each known factor (day of week, type of date, weather, and TV programming) to see if it correctly predicts her answer every time. The problem, of course, is that none of them do! You gambled and failed. So you relax your assumptions a bit. What if your friendвЂ™s answer is determined by a conjunction of two factors? With four factors, each with two possible values, there are twenty-four possibilities to check (six pairs of factors to pick from times two choices for each factorвЂ™s value). Now we have an embarrassment of riches: four conjunctions of two factors correctly predict the outcome! What to do? If youвЂ™re feeling lucky, you can just pick one of them and hope for the best. A more sensible option, though, is democracy: let them vote, and pick the winning prediction.. EinsteinвЂ™s general relativity was only widely accepted once Arthur Eddington empirically confirmed its prediction that the sun bends the light of distant stars. But you donвЂ™t need to wait around for new data to arrive to decide whether you can trust your learner. Rather, you take the data you have and randomly divide it into a training set, which you give to the learner, and a test set, which you hide from it and use to verify its accuracy. Accuracy on held-out data is the gold standard in machine learning. You can write a paper about a great new learning algorithm youвЂ™ve invented, but if your algorithm is not significantly more accurate than previous ones on held-out data, the paper is not publishable.. Another limitation of inverse deduction is that itвЂ™s very computationally intensive, which makes it hard to scale to massive data sets. For these, the symbolist algorithm of choice is decision tree induction. Decision trees can be viewed as an answer to the question of what to do if rules of more than one concept match an instance. How do we then decide which concept the instance belongs to? If we see a partly occluded object with a flat surface and four legs, how do we decide whether it is a table or a chair? One option is to order the rules, for example by decreasing accuracy, and choose the first one that matches. Another is to let the rules vote. Decision trees instead ensure a priori that each instance will be matched by exactly one rule. This will be the case if each pair of rules differs in at least one attribute test, and such a rule set can be organized into a decision tree. For example, consider these rules:. A spin glass is still a very unrealistic model of the brain, though. For one, spin interactions are symmetric, and connections between neurons in the brain are not. Another big issue that HopfieldвЂ™s model ignored is that real neurons are statistical: they donвЂ™t deterministically turn on and off
as a function of their inputs; rather, as the weighted sum of inputs increases, the neuron becomes more likely to fire, but itвЂ™s not certain that it will. In 1985, David Ackley, Geoff Hinton, and Terry Sejnowski replaced the deterministic neurons in Hopfield networks with probabilistic ones. A neural network now had a probability distribution over its states, with higher-energy states being exponentially less likely than lower-energy ones. In fact, the probability of finding the network in a particular state was given by the well-known Boltzmann distribution from thermodynamics, so they called their network a Boltzmann machine.. The key input to a genetic algorithm, as HollandвЂ™s creation came to be known, is a fitness function. Given a candidate program and some purpose it is meant to fill, the fitness function assigns the program a numeric score reflecting how well it fits the purpose. In natural selection, itвЂ™s questionable whether fitness can be interpreted this way: while the fitness of a wing for flight makes intuitive sense, evolution as a whole has no known purpose. Nevertheless, in machine learning having something like a fitness function is a no-brainer. If we need a program that can diagnose a patient, one that correctly diagnoses 60 percent of the patients in our database is better than one that only gets it right 55 percent of the time, and thus a possible fitness function is the fraction of correctly diagnosed cases.. Evolution searches for good structures, and neural learning fills them in: this combination is the easiest of the steps weвЂ™ll take toward the Master Algorithm. This may come as a surprise to anyone familiar with the never-ending twists and turns of the nature versus nurture controversy, 2,500 years old and still going strong. Seeing life through the eyes of a computer clarifies a lot of things, however. вЂњNatureвЂќfor a computer is the program it runs, and вЂњnurtureвЂќ is the data it gets. The question of which one is more important is clearly absurd; thereвЂ™s no output without both program and data, and itвЂ™s not like the output is, say, 60 percent caused by the program and 40 percent by the data. ThatвЂ™s the kind of linear thinking that a familiarity with machine learning immunizes you against.. Again, you can probably surmise just by looking at this plot that the cities are on a bay, and if you draw a line running through them, you can locate each city using just one number: how far it is from San Francisco along that line. But PCA canвЂ™t find this curve; instead, it draws a straight line running down the middle of the bay, where there are no cities at all. Far from elucidating the shape of the data, PCA obscures it.. Instead, imagine for a moment that weвЂ™re going to develop the Bay Area from scratch. WeвЂ™ve decided where each city will be located, and our budget allows us to build a single road connecting them. Naturally, we lay down a road that goes from San Francisco to San Bruno, from there to San Mateo, and so on all the way to Oakland. This road is a pretty good one-dimensional representation of the Bay Area and can be found by a simple algorithm: build a road between each pair of nearby cities. Of course, in general this will result in a network of roads, not a single road running by every city. But we can force the latter by building the single road that best approximates the network, in the sense that the distances between cities along this road are as close as possible to the distances along the network.. Our first step toward the Master Algorithm will be surprisingly simple. As it turns out, itвЂ™s not hard to combine many different learners into one, using what is known as metalearning. Netflix, Watson, Kinect, and countless others use it, and itвЂ™s one of the most powerful arrows in the machine learnerвЂ™s quiver. ItвЂ™s also a stepping-stone to the deeper unification that will follow.. The first step accomplished, you hurry on to the Bayesian district. Even from a distance, you can see how it clusters around the Cathedral of BayesвЂ™ Theorem. MCMC Alley zigzags randomly along the way. This is going to take a while. You take a shortcut onto Belief Propagation Street, but it seems to loop around forever. Then you see it: the Most Likely Avenue, rising majestically toward the Posterior Probability Gate. Rather than average over all models, you can head straight for the most probable one, confident that the resulting predictions will be almost the same. And you can let genetic search pick the modelвЂ™s structure and gradient descent its parameters. With a sigh of relief, you realize thatвЂ™s all the probabilistic inference youвЂ™ll need, at least until itвЂ™s time to answer questions using the model.. IвЂ™m grateful to everyone who read and commented on drafts of the book at various stages, including Mike Belfiore, Thomas Dietterich, Tiago Domingos, Oren Etzioni, Abe Friesen, Rob Gens, Alon Halevy, David Israel, Henry Kautz, ChloГ© Kiddon, Gary Marcus, Ray Mooney, Kevin Murphy, Franzi Roesner, and Ben Taskar. Thanks also to everyone who gave me pointers, information, or help of various kinds, including Tom Griffiths, David Heckerman, Hannah Hickey, Albert-LГЎszlГі BarabГЎsi, Yann LeCun, Barbara Mones, Mike Morgan, Peter Norvig, Judea Pearl, Gregory Piatetsky-Shapiro, and Sebastian Seung..