If youвЂ™re a student of any age-a high schooler wondering what to major in, a college undergraduate deciding whether to go into research, or a seasoned professional considering a career change-my hope is that this book will spark in you an interest in this fascinating field. The world has a dire shortage of machine-learning experts, and if you decide to join us, you can look forward to not only exciting times and material rewards but also a unique opportunity to serve society. And if youвЂ™re already studying machine learning, I hope the book will help you get the lay of the land; if in your travels you chance upon the Master Algorithm, that alone makes it worth writing.. In retrospect, we can see that the progression from computers to the Internet to machine learning was inevitable: computers enable the Internet, which creates a flood of data and the problem of limitless choice; and machine learning uses the flood of data to help solve the limitless choice problem. The Internet by itself is not enough to move demand fromвЂњone size fits allвЂќ to the long tail of infinite variety. Netflix may have one hundred thousand DVD titles in stock, but if customers donвЂ™t know how to find the ones they like, they will default to choosing the hits. ItвЂ™s only when Netflix has a learning algorithm to figure out your tastes and recommend DVDs that the long tail really takes off.. The best way for a company to ensure that learners like its products is to run them itself. Whoever has the best algorithms and the most data wins. A new type of network effect takes hold: whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle (or a vicious one, if youвЂ™re the competition). Switching from Google to Bing may be easier than switching from Windows to Mac, but in practice you donвЂ™t because Google, with its head start and larger market share, knows better what you want, even if BingвЂ™s technology is just as good. And pity a new entrant into the search business, starting with zero data against engines with over a decade of learning behind them.. ItвЂ™s true that some things are predictable and some arenвЂ™t, and the first duty of the machine learner is to distinguish between them. But the goal of the Master Algorithm is to learn everything thatcan be known, and thatвЂ™s a vastly wider domain than Taleb and others imagine. The housing bust was far from a black swan; on the contrary, it was widely predicted. Most banksвЂ™ models failed to see it coming, but that was due to well-understood limitations of those models, not limitations of machine learning in general. Learning algorithms are quite capable of accurately predicting rare, never-before-seen events; you could even say that thatвЂ™s what machine learning is all about. WhatвЂ™s the probability of a black swan if youвЂ™ve never seen one? How about itвЂ™s the fraction of known species that belatedly turned out to have black specimens? This is only a crude example; weвЂ™ll see many deeper ones in this book.. For analogizers, the key to learning is recognizing similarities between situations and thereby inferring other similarities. If two patients have similar symptoms, perhaps they have the same disease. The key problem is judging how similar two things are. The analogizersвЂ™ master algorithm is the support vector machine, which figures out which experiences to remember and how to combine them to make new predictions.. The power of rule sets is a double-edged sword. On the upside, you know you can always find a rule set that perfectly matches the data. But before you start feeling lucky, realize that youвЂ™re at severe risk of finding a completely meaningless one. Remember the вЂњno free lunchвЂќ theorem: you canвЂ™t learn without knowledge. And assuming that the concept can be defined by a set of rules is tantamount to assuming nothing.. In practice, Valiant-style analysis tends to be very pessimistic and to call for more data than you have. So how do you decide whether to believe what a learner tells you? Simple: you donвЂ™t believe anything until youвЂ™ve verified it on data thatthe learner didnвЂ™t see. If the patterns the learner hypothesized also hold true on new data, you can be pretty confident that theyвЂ™re real. Otherwise you know the learner overfit. This is just the scientific method applied to machine learning: itвЂ™s not enough for a new theory to explain past evidence because itвЂ™s easy to concoct a theory that does that; the theory must also make new predictions, and you only accept it after theyвЂ™ve been experimentally verified. (And even then only provisionally, because future evidence could still falsify it.). Your friend Ben is also pretty good, but heвЂ™s had a bit too much to drink. His darts are all over, but he loudly points out that on average heвЂ™s hitting the bullвЂ™s-eye. (Maybe he should have been a statistician.) This is the low-bias, high-variance case, shown in the bottom right corner. BenвЂ™s girlfriend, Ashley, is very steady, butshe has a tendency to aim too high and to the right. She has low variance and high bias (top left corner). Cody, whoвЂ™s visiting from out of town and has never played darts before, is both all over and off center. He has both high bias and high variance (top right).. Eliminating sex would leave evolutionaries with only mutation to power their engine. If the size of the population is substantially larger than the number of genes, chances are that every point mutation is represented in it, and the search becomes a type of hill climbing: try all possible one-step variations, pick the best one, and repeat. (Or pick several of the best variations, in which case itвЂ™s called beam search.) Symbolists, in particular, use this all the time to learn sets of rules, although they donвЂ™t think of it as a form of evolution. To avoid getting trapped in local maxima, hill climbing can be enhanced with randomness (make a downhill move with some probability) and random restarts (after a while, jump to a random state and continue from there). Doing this is enough to find good solutions to problems; whether the benefit of adding crossover to it justifies the extra computational cost remains an open question.. The Bayesian method is not just applicable to learning Bayesian networks and their special cases. (Conversely, despite their name, Bayesian networks arenвЂ™t necessarily Bayesian: frequentists can learn them, too, as we just saw.) We can put a prior distribution on any class of hypotheses-sets of rules, neural networks, programs-and then update it with the hypothesesвЂ™ likelihood given the data. BayesiansвЂ™ view is that itвЂ™s up to you what representation you choose, but then you have to learn it using BayesвЂ™ theorem. In the 1990s, they mounted a spectacular takeover of the Conference on Neural Information Processing Systems (NIPS for short), the main venue for connectionist research. The ringleaders (so to speak) were David MacKay, Radford Neal, and Michael Jordan. MacKay, a Brit who was a student of John HopfieldвЂ™s at Caltech and later became chief scientific advisor to the UKвЂ™s Department of Energy, showed how to learn multilayer perceptrons the Bayesian way. Neal introduced the connectionists to MCMC, and Jordan introduced them to variational inference. Finally, they pointed out that in the limit you could вЂњintegrate outвЂќ the neurons in a multilayer perceptron, leaving a type of Bayesian model that made no reference to them. Before long, the wordneural in the title of a paper submitted to NIPS became a good predictor of rejection. Some researchers joked that the conference should change its name to BIPS, for Bayesian Information Processing Systems.. Planetary-scale machine learning. Google + Master Algorithm = Skynet?. But the journey is far from over. We donвЂ™t have the Master Algorithm yet, just a glimpse of what it might look like. What if something fundamental is still missing, something all of us in the field, steeped in its history, canвЂ™t see? We need new ideas, and ideas that are not just variations on the ones we already have. ThatвЂ™s why Iwrote this book: to start you thinking. I teach an evening class on machine learning at the University of Washington. In 2007, soon after the Netflix Prize was announced, I proposed it as one of the class projects. Jeff Howbert, a student in the class, got hooked and continued to work on it after the class was over. He wound up being a member of one of the two winning teams, two years after learning about machine learning for the first time. Now itвЂ™s your turn. To learn more about machine learning, check out the section on further readings at the end of the book. Download some data sets from the UCI repository (archive.ics.uci.edu/ml/) and start playing. When youвЂ™re ready, check out Kaggle.com, a whole website dedicated to running machine-learning competitions, and pick one or two to enter. Of course, itвЂ™ll be more fun if you recruit a friend or two to work with you. If youвЂ™re hooked, like Jeff was, and wind up becoming a professional data scientist, welcome to the most fascinating job in the world. If you find yourself dissatisfied with todayвЂ™s learners, invent new ones-or just do it for fun. My fondest wish is that your reaction to this book will be like my reaction tothe first AI book I read, over twenty years ago: thereвЂ™s so much to do here, I donвЂ™t know where to start. If one day you invent the Master Algorithm, please donвЂ™t run to the patent office with it. Open-source it. The Master Algorithm is too important to be owned by any one person or organization. Its applications will multiply faster than you can license it. But if you decide instead to do a startup, remember to give a share in it to every man, woman, and child on Earth.. Further Readings. Model Ensembles: Foundations and Algorithms,* by Zhi-Hua Zhou (Chapman and Hall, 2012), is an introduction to metalearning. The original paper on stacking isвЂњStacked generalization,вЂќ* by David Wolpert (Neural Networks, 1992). Leo Breiman introduced bagging inвЂњBagging predictorsвЂќ* (Machine Learning, 1996) and random forests inвЂњRandom forestsвЂќ* (Machine Learning, 2001). Boosting is described inвЂњExperiments with a new boosting algorithm,вЂќ by Yoav Freund and Rob Schapire (Proceedings of the Thirteenth International Conference on Machine Learning, 1996)..