Once we look at machine learning this way, two things immediately jump out. The first is that the more data we have, the more we can learn. No data? Nothing to learn. Big data? Lots to learn. ThatвЂ™s why machine learning has been turning up everywhere, driven by exponentially growing mountains of data. If machine learning was something you bought in the supermarket, its carton would say: вЂњJust add data.вЂќ. Sets of rules are popular with retailers who are deciding which goods to stock. Typically, they use a more exhaustive approach thanвЂњdivide and conquer,вЂќ looking for all rules that strongly predict the purchase of each item. Walmart was a pioneer in this area. One of their early findings was that if you buy diapers you are also likely to buy beer. Huh? One interpretation of this is that Mom sends Dad to the supermarket to buy diapers, and as emotional compensation, Dad buys a case of beer to go with them. Knowing this, the supermarket can now sell more beer by putting it next to the diapers, which would never have occurred to it without rule mining. The вЂњbeer and diapersвЂќ rule has acquired legendary status among data miners (although some claim the legend is of the urban variety). Either way, itвЂ™s a long way from the digital circuit design problems Michalski had in mind when he first started thinking about rule induction in the 1960s. When you invent a new learning algorithm, you canвЂ™t even begin to imagine all the things it will be used for.. Geoff Hinton went on to try many variations on Boltzmann machines over the following decades. Hinton, a psychologist turned computer scientist and great-great-grandson of George Boole, the inventor of the logical calculus used in all digital computers, is the worldвЂ™s leading connectionist. He has tried longer and harder to understand how the brain works than anyone else. He tells of coming home from work one day in a state of great excitement, exclaiming вЂњI did it! IвЂ™ve figured out how the brain works!вЂќ His daughter replied, вЂњOh, Dad, not again!вЂќHintonвЂ™s latest passion is deep learning, which weвЂ™ll meet later in this chapter. He was also involved in the development of backpropagation, an even better algorithm than Boltzmann machines for solving the credit-assignment problem that weвЂ™ll look at next. Boltzmann machines could solve the credit-assignment problem in principle, but in practice learning was very slow and painful, making this approach impractical for most applications. The next breakthrough involved getting rid of another oversimplification that dated all the way back to McCulloch and Pitts.. Nurturing nature. HereвЂ™s the crucial point: Bob calling depends onBurglary andEarthquake, but only throughAlarm. BobвЂ™s call isconditionally independent ofBurglary andEarthquake givenAlarm, and so is ClaireвЂ™s. If the alarm doesnвЂ™t go off, your neighbors sleep soundly, and the burglar proceeds undisturbed. Also, Bob and Claire are independent givenAlarm. Without this independence structure, youвЂ™d need to learn 25 = 32 probabilities, one for each possible state of the five variables. (Or 31, if youвЂ™re a stickler for details, since the last one can be left implicit.) With the conditional independencies, all you need is 1 + 1 + 4 + 2 + 2 = 10, a savings of 68 percent. And thatвЂ™s just in this tiny example; with hundreds or thousands of variables, the savings would be very close to 100 percent.. YouвЂ™d think that Bayesians and symbolists would get along great, given that they both believe in a first-principles approach to learning, rather than a nature-inspired one. Far from it. Symbolists donвЂ™t like probabilities and tell jokes like вЂњHow many Bayesians does it take to change a lightbulb? TheyвЂ™re not sure. Come to think of it, theyвЂ™re not sure the lightbulb is burned out.вЂќ More seriously, symbolists point to the high price we pay for probability. Inference suddenly becomes a lot more expensive, all those numbers are hard to understand, we have to deal with priors, and hordes of zombie hypotheses chase us around forever. The ability to compose pieces of knowledge on the fly, so dear to symbolists, is gone. Worst of all, we donвЂ™t know how to put probability distributions on many of the things we need to learn. A Bayesian network is a distribution over a vector of variables, but what about distributions over networks, databases, knowledge bases, languages, plans, and computer programs, to name a few? All of these are easily handled in logic, and an algorithm that canвЂ™t learn them is clearly not the Master Algorithm.. Principal-component analysis (PCA), as this process is known, is one of the key tools in the scientistвЂ™s toolkit. You could say PCA is to unsupervised learning what linear regression is to the supervised variety. The famous hockey-stick curve of global warming, for example, is the result of finding the principal component of various temperature-related data series (tree rings, ice cores, etc.) and assuming itвЂ™s the temperature. Biologists use PCA to summarize the expression levels of thousands of different genes into a few pathways. Psychologists have found that personality boils down to five dimensions-extroversion, agreeableness, conscientiousness, neuroticism, and openness to experience-which they can infer from your tweets and blog posts. (Chimps supposedly have one more dimension-reactivity-but Twitter data for them is not available.) Applying PCA to congressional votes and poll data shows that, contrary to popular belief, politics is not mainly about liberals versus conservatives. Rather, people differ along two main dimensions: one for economic issues and one for social ones. Collapsing these into a single axis mixes together populists and libertarians, who are polar opposites, and creates the illusion of lots of moderates in the middle. Trying to appeal to them is an unlikely winning strategy. On the other hand, if liberals and libertarians overcame their mutual aversion, they could ally themselves on social issues, where both favor individual freedom.. One of the most popular algorithms for nonlinear dimensionality reduction, called Isomap, does just this. It connects each data point in a high-dimensional space (a face, say) to all nearby points (very similar faces), computes the shortest distances between all pairs of points along the resulting network and finds the reduced coordinates that best approximate these distances. In contrast to PCA, facesвЂ™ coordinates in this space are often quite meaningful: one may represent which direction the face is facing (left profile, three quarters, head on, etc.); another how the face looks (very sad, a little sad, neutral, happy, very happy, etc.); and so on. From understanding motion in video to detecting emotion in speech, Isomap has a surprising ability to zero in on the most important dimensions of complex data.. The evaluation component is a scoring function that says how good a model is. Symbolists use accuracy or information gain. Connectionists use a continuous error measure, such as squared error, which is the sum of the squares of the differences between the predicted values and the true ones. Bayesians use the posterior probability. Analogizers (at least of the SVM stripe) use the margin. In addition to how well the model fits the data, all tribes take into account other desirable properties, such as the modelвЂ™s simplicity.. Chief among these tools is the Master Algorithm. Whether it arrives sooner or later, and whether or not it looks like Alchemy, is less important than what it encapsulates: the essential capabilities of a learning algorithm, and where theyвЂ™ll take us. We can equally well think of the Master Algorithm as a composite picture of current and future learners, which we can conveniently use in our thought experiments in lieu of the specific algorithm inside product X or website Y, which the respective companies are unlikely to share withus anyway. Seen in this light, the learners we interact with every day are embryonic versions of the Master Algorithm, and our task is to understand them and shape their growth to better serve our needs.. One Algorithm to rule them all, One Algorithm to find them,. The trajectory weвЂ™re on is not a singularity but a phase transition. Its critical point-the Turing point-will come when machine learning overtakes the natural variety. Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is a product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: thespeed at which humans can absorb change. This too will eventually be removed, but not because weвЂ™ll decide to hand things off to our вЂњmind children,вЂќ as Hans Moravec calls them, and go gently into the good night. Humans are not a dying twig on the tree of life. On the contrary, weвЂ™re about to start branching.. First of all, I thank my companions in scientific adventure: students, collaborators, colleagues, and everyone in the machine-learning community. This is your book as much as mine. I hope you will forgive my many oversimplifications and omissions, and the somewhat fanciful way in which parts of the book are written.. Sasha IssenbergвЂ™sThe Victory Lab (Broadway Books, 2012) dissects the use of data analysis in politics.вЂњHow President ObamaвЂ™s campaign used big data to rally individual votes,вЂќ by the same author (MIT Technology Review, 2013), tells the story of its greatest success to date. Nate SilverвЂ™sThe Signal and the Noise (Penguin Press, 2012) has a chapter on his poll aggregation method.. The Cyc project is described inвЂњCyc: Toward programs with common sense,вЂќ* by Douglas Lenat et al. (Communications of the ACM, 1990). Peter Norvig discusses Noam ChomskyвЂ™s criticisms of statistical learning in вЂњOn Chomsky and the two cultures of statistical learningвЂќ (http://norvig.com/chomsky.html). Jerry FodorвЂ™sThe Modularity of Mind (MIT Press, 1983) summarizes his views on how the mind works.вЂњWhat big data will never explain,вЂќ by Leon Wieseltier (New Republic, 2013), andвЂњPundits, stop sounding ignorant about data,вЂќ by Andrew McAfee (Harvard Business Review, 2013), give a flavor of the controversy surrounding what big data can and canвЂ™t do. Daniel Kahneman explains why algorithms often beat intuitions in Chapter 21 ofThinking, Fast and Slow (Farrar, Straus and Giroux, 2011). David Patterson makes the case for the role of computing and data in the fight against cancer inвЂњComputer scientists may have what it takes to help cure cancerвЂќ (New York Times, 2011)..