Wow.. The power of machine learning is perhaps best explained by a low-tech analogy: farming. In an industrial society, goods are made in factories, which means that engineers have to figure out exactly how to assemble them from their parts, how to make those parts, and so on-all the way to raw materials. ItвЂ™s a lot of work. Computers are the most complex goods ever invented, and designing them, the factories that make them, and the programs that run on them is a ton of work. But thereвЂ™s another, much older way in which we can get some of the things we need: by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. Why canвЂ™t technology be more like this? It can, and thatвЂ™s the promise of machine learning. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine-learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.. Cyberwar is an instance of asymmetric warfare, where one side canвЂ™t match the otherвЂ™s conventional military power but can still inflict grievous damage. A handful of terrorists armed with little more than box cutters can knock down the Twin Towers and kill thousands of innocents. All the biggest threats to US security today are in the realm of asymmetric warfare, and thereвЂ™s an effective weapon against all of them: information. If the enemy canвЂ™t hide, he canвЂ™t survive. The good news is that we have plenty of information, and thatвЂ™s also the bad news.. Even more astonishing than the breadth of applications of machine learning is that itвЂ™s thesame algorithms doing all of these different things. Outside of machine learning, if you have two different problems to solve, you need to write two different programs. They might use some of the same infrastructure, like the same programming language or the same database system, but a program to, say, play chess is of no use if you want to process credit-card applications. In machine learning, the same algorithm can do both, provided you give it the appropriate data to learn from. In fact, just a few algorithms are responsible for the great majority of machine-learning applications, and weвЂ™ll take a look at them in the next few chapters.. Bottom line: learning is a race between the amount of data you have and the number of hypotheses you consider. More data exponentially reduces the number of hypotheses that survive, but if you start with a lot of them, you may still have some bad ones left at the end. As a rule of thumb, if the learner only considers an exponential number of hypotheses (for example, all possible conjunctive concepts), then the dataвЂ™s exponential payoff cancels it and youвЂ™re OK, provided you have plenty of examples and not too many attributes. On the other hand, if it considers a doubly exponential number (for example, all possible rule sets), then the data cancels only one of the exponentials and youвЂ™re still in trouble. You can even figure out in advance how many examples youвЂ™ll need to be pretty sure that the learnerвЂ™s chosen hypothesis is very close to the true one, provided it fits all the data; in other words, for the hypothesis to be probably approximately correct. HarvardвЂ™s Leslie Valiant received the Turing Award, the Nobel Prize of computer science, for inventing this type of analysis, which he describes in his book entitled, appropriately enough,Probably Approximately Correct.. For each pair of facts, we construct the rule that allows us to infer the second fact from the first one and generalize it by NewtonвЂ™s principle. When the same general rule is induced over and over again, we can have some confidence that itвЂ™s true.. We can also induce rules purely from other rules. If we know that all philosophers are human and mortal, we can induce that all humans are mortal. (We donвЂ™t induce that all mortals are human because we know other mortal creatures, like cats and dogs. On the other hand, scientists, artists, and so on are also human and mortal, reinforcing the rule.) In general, the more rules and facts we start out with, the more opportunities we have to induce newrules using вЂњinverse deduction.вЂќ And the more rules we induce, the more rules we can induce. ItвЂ™s a virtuous circle of knowledge creation, limited only by overfitting risk and computational cost. But here, too, having initial knowledge helps: if instead of one large hole we have many small ones to fill, our induction steps will be less risky and therefore less likely to overfit. (For example, given the same number of examples, inducing that all philosophers are human is less risky than inducing that all humans are mortal.). In the meantime, one important application of inverse deduction is predicting whether new drugs will have harmful side effects. Failure during animal testing and clinical trials is the main reason new drugs take many years and billions of dollars to develop. By generalizing from known toxic molecular structures, we can form rules that quickly weed out many apparently promising compounds, greatly increasing the chances of successful trials on the remaining ones.. In an early demonstration of the power of backprop, Terry Sejnowski and Charles Rosenberg trained a multilayer perceptron to read aloud. Their NETtalk system scanned the text, selected the correct phonemes according to context, and fed them to a speech synthesizer. NETtalk not only generalized accurately to new words, which knowledge-based systems could not, but it learned to speak in a remarkably human-like way. Sejnowski used to mesmerize audiences at research meetings by playing a tape of NETtalkвЂ™s progress: babbling at first, then starting to make sense, then speaking smoothly with only the occasional error. (You can find samples on YouTube by typing вЂњsejnowski nettalk.вЂќ). The path to optimal learning begins with a formula that many people have heard of: BayesвЂ™ theorem. But here weвЂ™ll see it in a whole new light and realize that itвЂ™s vastly more powerful than youвЂ™d guess from its everyday uses. At heart, BayesвЂ™ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patientвЂ™s symptoms in areasonable amount of time can take a lot of smarts. Once we know how to do all these things, weвЂ™ll be ready to learn the Bayesian way. For Bayesians, learning is вЂњjustвЂќ another application of BayesвЂ™ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner. Bayesians have invented fiendishly clever kinds of models. So letвЂ™s get started.. ThereвЂ™s a big snag in all of this, unfortunately. Just because a Bayesian network lets us compactly represent a probability distribution doesnвЂ™t mean we can also reason efficiently with it. Suppose you want to computeP(Burglary | Bob called, Claire didnвЂ™t). By BayesвЂ™ theorem, you know this is justP(Burglary) P(Bob called, Claire didnвЂ™t | Burglary) / P(Bob called, Claire didnвЂ™t), or equivalently,P(Burglary, Bob called, Claire didnвЂ™t) / P(Bob called, Claire didnвЂ™t). If you had the full table with the probabilities of all states, you could obtain both of these probabilities by adding up the corresponding lines in the table. For example,P(Bob called, Claire didnвЂ™t) is the sum of the probabilities of all the lines where Bob calls and Claire doesnвЂ™t. But the Bayesian network doesnвЂ™t give you the full table. You could always construct it from the individual tables, but that takes exponential time and space. What we really want is to computeP(Burglary | Bob called, Claire didnвЂ™t) without building the full table. That, in a nutshell, is the problem of inference in Bayesian networks.. Practical successes aside, SVMs also turned a lot of machine-learning conventional wisdom on its head. For example, they gave the lie to the notion, sometimes misidentified with OccamвЂ™s razor, that simpler models are more accurate. On the contrary, an SVM can have an infinite number of parameters and still not overfit, provided it has a large enough margin.. HereвЂ™s a challenge: you have fifteen minutes to combine decision trees, multilayer perceptrons, classifier systems, NaГЇve Bayes, and SVMs into a single algorithm possessing the best properties of each. Quick-what can you do? Clearly, it canвЂ™t involve the details of the individual algorithms; thereвЂ™s no time for that. But how about the following? Think of each learner as an expert on a committee. Each looks carefully at the instance to be classified-what is the diagnosis for this patient?-and confidently makes its prediction. YouвЂ™re not an expert yourself, but youвЂ™re the chair of the committee, and your job is to combine their recommendations into a final decision. What you have on your hands is in fact a new classification problem, where instead of the patientвЂ™s symptoms, the input is the expertsвЂ™ opinions. But you can apply machine learning to this problem in the same way the experts applied it to the original one. We call this metalearning because itвЂ™s learning about the learners. The metalearner can itself be any learner, from a decision tree to a simple weighted vote. To learn the weights, or the decision tree, we replace the attributes of each original example by the learnersвЂ™ predictions. Learners that often predict the correct class will get high weights, and inaccurate ones will tend to be ignored. With a decision tree, the choice of whether to use a learner can be contingent on other learnersвЂ™ predictions. Either way, to obtain a learnerвЂ™s prediction for a given training example, we must first apply it to the original training setexcluding that example and use the resulting classifier-otherwise the committee risks being dominated by learners that overfit, since they can predict the correct class just by remembering it. The Netflix Prize winner used metalearning to combine hundreds of different learners. Watson uses it to choose its final answer from the available candidates. Nate Silver combines polls in a similar way to predict election results.. Sasha IssenbergвЂ™sThe Victory Lab (Broadway Books, 2012) dissects the use of data analysis in politics.вЂњHow President ObamaвЂ™s campaign used big data to rally individual votes,вЂќ by the same author (MIT Technology Review, 2013), tells the story of its greatest success to date. Nate SilverвЂ™sThe Signal and the Noise (Penguin Press, 2012) has a chapter on his poll aggregation method.. вЂњThe unreasonable effectiveness of data,вЂќ by Alon Halevy, Peter Norvig, and Fernando Pereira (IEEE Intelligent Systems, 2009), argues for machine learning as the new discovery paradigm. BenoГ®t Mandelbrot explores the fractal geometry of nature in the eponymous book* (Freeman, 1982). James GleickвЂ™sChaos (Viking, 1987) discusses and depicts the Mandelbrot set. The Langlands program, a research effort that seeks to unify different subfields of mathematics, is described inLove and Math, by Edward Frenkel (Basic Books, 2014).The Golden Ticket, by Lance Fortnow (Princeton University Press, 2013), is an introduction to NP-completeness and the P = NP problem.The Annotated Turing,* by Charles Petzold (Wiley, 2008), explains Turing machines by revisiting TuringвЂ™s original paper on them..