We can think of machine learning as the inverse of programming, in the same way that the square root is the inverse of the square, or integration is the inverse of differentiation. Just as we can askвЂњWhat number squared gives 16?вЂќ or вЂњWhat is the function whose derivative isx + 1?вЂќ we can ask, вЂњWhat is the algorithm that produces this output?вЂќ We will soon see how to turn this insight into concrete learning algorithms.. The National Security Agency (NSA) has become infamous for its bottomless appetite for data: by one estimate, every day it intercepts over a billion phone calls and other communications around the globe. Privacy issues aside, however, it doesnвЂ™t have millions of staffers to eavesdrop on all these calls and e-mails or even just keep track of whoвЂ™s talking to whom. The vast majority of calls are perfectly innocent, and writing a program to pick out the few suspicious ones is very hard. In the old days, the NSA used keyword matching, but thatвЂ™s easy to get around. (Just call the bombing a вЂњweddingвЂќ and the bomb the вЂњwedding cake.вЂќ) In the twenty-first century, itвЂ™s a job for machine learning. Secrecy is the NSAвЂ™s trademark, but its director has testified to Congress that mining of phone logs has already halted dozens of terrorism threats.. The computations taking place within the brainвЂ™s architecture are also similar throughout. All information in the brain is represented in the same way, via the electrical firing patterns of neurons. The learning mechanism is also the same: memories are formed by strengthening the connections between neurons that fire together, using a biochemical process known as long-term potentiation. All this is not just true of humans: different animals have similar brains. Ours is unusually large, but seems to be built along the same principles as other animalsвЂ™.. In a famous 1959 essay, the physicist and Nobel laureate Eugene Wigner marveled at what he calledвЂњthe unreasonable effectiveness of mathematics in the natural sciences.вЂќ By what miracle do laws induced from scant observations turn out to apply far beyond them? How can the laws be many orders of magnitude more precise than the data they are based on? Most of all, why is it that the simple, abstract language of mathematics can accurately capture so much of our infinitely complex world? Wigner considered this a deep mystery, in equal parts fortunate and unfathomable. Nevertheless, it is so, and the Master Algorithm is a logical extension of it.. Unlike the theories of a given field, which only have power within that field, the Master Algorithm has power across all fields. Within field X, it has less power than field XвЂ™s prevailing theory, but across all fields-when we consider the whole world-it has vastly more power than any other theory. The Master Algorithm is the germ of every theory; all we need to add to it to obtain theory X is the minimum amount of data required to induce it. (In the case of physics, that would be just the results of perhaps a few hundred key experiments.) The upshot is that, pound for pound, the Master Algorithm may well be the best starting point for a theory of everything weвЂ™ll ever have.Pace Stephen Hawking, it may ultimately tell us more about the mind of God than string theory.. Having a branch for each value of an attribute is fine if the attribute is discrete, but what about numeric attributes? If we had a branch for every value of a continuous variable, the tree would be infinitely wide. A simple solution is to pick a few key thresholds by entropy and use those. For example, is the patientвЂ™s temperature above or below 100 degrees Fahrenheit? That, combined with other symptoms, may be all the doctor needs to know about the patientвЂ™s temperature to decide if he has an infection.. In an early demonstration of the power of backprop, Terry Sejnowski and Charles Rosenberg trained a multilayer perceptron to read aloud. Their NETtalk system scanned the text, selected the correct phonemes according to context, and fed them to a speech synthesizer. NETtalk not only generalized accurately to new words, which knowledge-based systems could not, but it learned to speak in a remarkably human-like way. Sejnowski used to mesmerize audiences at research meetings by playing a tape of NETtalkвЂ™s progress: babbling at first, then starting to make sense, then speaking smoothly with only the occasional error. (You can find samples on YouTube by typing вЂњsejnowski nettalk.вЂќ). As you stare uncomprehendingly at it, your Google Glass helpfully flashes:вЂњBayesвЂ™ theorem.вЂќ Now the crowd starts to chant вЂњMore data! More data!вЂќ A stream of sacrificial victims is being inexorably pushed toward the altar. Suddenly, you realize that youвЂ™re in the middle of it-too late. As the crank looms over you, you scream, вЂњNo! I donвЂ™t want to be a data point! Let me gooooo!вЂќ. PageRank, the algorithm that gave rise to Google, is itself a Markov chain. Larry PageвЂ™s idea was that web pages with many incoming links are probably more important than pages with few, and links from important pages should themselves count for more. This sets up an infinite regress, but we can handle it with a Markov chain. Imagine a web surfer going from page to page by randomly following links: the states of this Markov chain are web pages instead of characters, making it a vastly larger problem, but the math is the same. A pageвЂ™s score is then the fraction of the time the surfer spends on it, or equivalently, his probability of landing on the page after wandering around for a long time.. In reality, we usually let SVMs violate some constraints, meaning classify some examples incorrectly or by less than the margin, because otherwise they would overfit. If thereвЂ™s a noisy negative example somewhere in the middle of the positive region, we donвЂ™t want the frontier to wind around inside the positive region just to get that example right. But the SVM pays a penalty for each example it gets wrong, which encourages it to keep those to a minimum. SVMs are like the sandworms inDune: big, tough, and able to survive a few explosions from slithering over landmines but not too many.. So far weвЂ™ve only seen how to learn one level of clusters, but the world is, of course, much richer than that, with clusters within clusters all the way down to individual objects: living things cluster into plants and animals, animals into mammals, birds, fishes, and so on, all the way down to Fido the family dog. No problem: once weвЂ™ve learned one set of clusters, we can treat them as objects and cluster them in turn, and so on up to the cluster of all things. Alternatively, we can start with a coarse clustering and then further divide each cluster into subclusters: RobbyвЂ™s toys divide into stuffed animals, constructions toys, and so on; stuffed animals into teddy bears, plush kittens, and so on. Children seem to start out in the middle and then work their way up and down. For example, they learndog before they learnanimal orbeagle. This might be a good strategy for Robby, as well.. Notice that the network has a separate feature for each pair of people:Alice and Bob both have the flu, Alice and Chris both have the flu, and so on. But we canвЂ™t learn a separate weight for each pair, because we only have one data point per pair (whether itвЂ™s infected or not), and we wouldnвЂ™t be able to generalize to members of the network we havenвЂ™t diagnosed yet (do Yvette and Zach both have the flu?). What we can do instead is learn a single weight for all features of the same form, based on all the instances of it that weвЂ™ve seen. In effect,X and Y have the flu is a template for features that can be instantiated with each pair of acquaintances (Alice and Bob, Alice and Chris, etc.). The weights for all the instances of a template areвЂњtied together,вЂќ in the sense that they all have the same value, and thatвЂ™s how we can generalize despite having only one example (the whole network). In nonrelational learning, the parameters of a model are tied in only one way: across all the independent examples (e.g., all the patients weвЂ™ve diagnosed). In relational learning, every feature template we create ties the parameters of all its instances.. YouвЂ™ve reached the final stage of your quest. You knock on the door of the Tower of Support Vectors. A menacing-looking guard opens it, and you suddenly realize that you donвЂ™t know the password. вЂњKernel,вЂќ you blurt out, trying to keep the panic from your voice. The guard bows and steps aside. Regaining your composure, you step in, mentally kicking yourself for your carelessness. The entire ground floor of the tower is taken up by a lavishly appointed circular chamber, with what seems to be a marble representation of an SVM occupying pride of place at the center. As you walk around it, you notice a door on the far side. It must lead to the central tower-the Tower of the Master Algorithm. The door seems unguarded. You decide to take a shortcut. Slipping through the doorway, you walk down a short corridor and find yourself in an even larger pentagonal chamber, with a door in each wall. In the center, a spiral staircase rises as high as the eye can see. You hear voices above and duck into the doorway opposite. This one leads to the Tower of Neural Networks. Once again youвЂ™re in a circular chamber, this one with a sculpture of a multilayer perceptron as the centerpiece. Its parts are different from the SVMвЂ™s, but their arrangement is remarkably similar. Suddenly you see it: an SVM is just a multilayer perceptron with a hidden layer composed of kernels instead of S curves and an output thatвЂ™s a linear combination instead of another S curve.. If youвЂ™d like to learn more about machine learning in general, one good place to start is online courses. Of these, the closest in content to this book is, not coincidentally, the one I teach (www.coursera.org/course/machlearning). Two other options are Andrew NgвЂ™s course (www.coursera.org/course/ml)and Yaser Abu-MostafaвЂ™s (http://work.caltech.edu/telecourse.html). The next step is to read a textbook. The closest to this book, and one of the most accessible, is Tom MitchellвЂ™sMachine Learning* (McGraw-Hill, 1997). More up-to-date, but also more mathematical, are Kevin MurphyвЂ™sMachine Learning: A Probabilistic Perspective* (MIT Press, 2012), Chris BishopвЂ™sPattern Recognition and Machine Learning* (Springer, 2006), andAn Introduction to Statistical Learning with Applications in R,* by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani (Springer, 2013). My articleвЂњA few useful things to know about machine learningвЂќ (Communications of the ACM, 2012) summarizes some of theвЂњfolk knowledgeвЂќ of machine learning that textbooks often leave implicit and was one of the starting points for this book. If you know how to program and are itching to give machine learning a try, you can start from a number of open-source packages, such as Weka (www.cs.waikato.ac.nz/ml/weka).The two main machine-learning journals areMachine Learning and theJournal of Machine Learning Research. Leading machine-learning conferences, with yearly proceedings, include the International Conference on Machine Learning, the Conference on Neural Information Processing Systems, and the International Conference on Knowledge Discovery and Data Mining. A large number of machine-learning talks are available on http://videolectures.net. The www.KDnuggets.com website is a one-stop shop for machine-learning resources, and you can sign up for its newsletter to keep up-to-date with the latest developments.. Prologue.