In retrospect, we can see that the progression from computers to the Internet to machine learning was inevitable: computers enable the Internet, which creates a flood of data and the problem of limitless choice; and machine learning uses the flood of data to help solve the limitless choice problem. The Internet by itself is not enough to move demand fromвЂњone size fits allвЂќ to the long tail of infinite variety. Netflix may have one hundred thousand DVD titles in stock, but if customers donвЂ™t know how to find the ones they like, they will default to choosing the hits. ItвЂ™s only when Netflix has a learning algorithm to figure out your tastes and recommend DVDs that the long tail really takes off.. The larger outcome is that democracy works better because the bandwidth of communication between voters and politicians increases enormously. In these days of high-speed Internet, the amount of information your elected representatives get from you is still decidedly nineteenth century: a hundred bits or so every two years, as much as fits on a ballot. This is supplemented by polling and perhaps the occasional e-mail or town-hall meeting, but thatвЂ™s still precious little. Big data and machine learning change the equation. In the future, provided voter models are accurate, elected officials will be able to ask voters what they want a thousand times a day and act accordingly-without having to pester the actual flesh-and-blood citizens.. Machine learners versus knowledge engineers. Since perceptrons can only learn linear boundaries, they canвЂ™t learn XOR. And if they canвЂ™t do even that, theyвЂ™re not a very good model of how the brain learns, or a viable candidate for the Master Algorithm.. A living cell is a quintessential example of a nonlinear system. The cell performs all of its functions by turning raw materials into end products through a complex web of chemical reactions. We can discover the structure of this network using symbolist methods like inverse deduction, as we saw in the last chapter, but to build a complete model of a cell we need to get quantitative, learning the parameters that couple the expression levels of different genes, relate environmental variables to internal ones, and so on. This is difficult because there is no simple linear relationship between these quantities. Rather, the cell maintains its stability through interlocking feedback loops, leading to very complex behavior. Backpropagation is well suited to this problem because of its ability to efficiently learn nonlinear functions. If we had a complete map of the cellвЂ™s metabolic pathways and enough observations of all the relevant variables, backprop could in principle learn a detailed model of the cell, with a multilayer perceptron to predict each variable as a function of its immediate causes.. We can get even fancier by allowing rules for intermediate concepts to evolve, and then chaining these rules at performance time. For example, we could evolve the rulesIf the e-mail contains the word loanthen itвЂ™s a scam andIf the e-mail is a scam then itвЂ™s spam. Since a ruleвЂ™s consequent is no longer alwaysspam, this requires introducing additional bits in rule strings to represent their consequents. Of course, the computer doesnвЂ™t literally use the wordscam; it just comes up with some arbitrary bit string to represent the concept, but thatвЂ™s good enough for our purposes. Sets of rules like this, which Holland called classifier systems, are one of the workhorses of the machine-learning tribe he founded: the evolutionaries. Like multilayer perceptrons, classifier systems face the credit-assignment problem-what is the fitness of rules for intermediate concepts?-and Holland devised the so-called bucket brigade algorithm to solve it. Nevertheless, classifier systems are much less widely used than multilayer perceptrons.. No one is sure why sex is pervasive in nature, either. Several theories have been proposed, but none is widely accepted. The leader of the pack is the Red Queen hypothesis, popularized by Matt Ridley in the eponymous book. As the Red Queen said to Alice inThrough the Looking Glass,вЂњIt takes all the running you can do, to keep in the same place.вЂќ In this view, organisms are in a perpetual arms race with parasites, and sex helps keep the population varied, so that no single germ can infect all of it. If this is the answer, then sex is irrelevant to machine learning, at least until learned programs have to vie with computer viruses for processor time and memory. (Intriguingly, Danny Hillis claims that deliberately introducing coevolving parasites into a genetic algorithm can help it escape local maxima by gradually ratcheting up the difficulty, but no one has followedup on this yet.) Christos Papadimitriou and colleagues have shown that sex optimizes not fitness but what they call mixability: a geneвЂ™s ability to do well on average when combined with other genes. This can be useful when the fitness function is either not known or not constant, as in natural selection, but in machine learning and optimization, hill climbing tends to do better.. The hypothesis can be as complex as a whole Bayesian network, or as simple as the probability that a coin will come up heads. In the latter case, the data is just the outcome of a series of coin flips. If, say, we obtain seventy heads in a hundred flips, a frequentist would estimate the probability of heads as 0.7. This is justified by the so-called maximum likelihood principle: of all the possible probabilities of heads, 0.7 is the one under which seeing seventy heads in a hundred flips is most likely. The likelihood of a hypothesis isP(data | hypothesis), and the principle says we should pick the hypothesis that maximizes it. Bayesians do something more subtle, though. They point out that we never know for sure which hypothesis is the true one, and so we shouldnвЂ™t just pick one hypothesis, like a value of 0.7 for the probability of heads; rather, we should compute the posterior probability of every possible hypothesis and entertain all of them when making predictions. The sum of the probabilities of all the hypotheses must be one, so if one becomes morelikely, the others become less. For a Bayesian, in fact, there is no such thing as the truth; you have a prior distribution over hypotheses, after seeing the data it becomes the posterior distribution, as given by BayesвЂ™ theorem, and thatвЂ™s all.. The reason lazy learning wins is that forming a global model, such as a decision tree, is much harder than just figuring out where specific query points lie, one at a time. Imagine trying to define what a face is with a decision tree. You could say it has two eyes, a nose, and a mouth, but what is an eye and how do you find it in an image? What if the personвЂ™s eyes are closed? Reliably defining a face all the way down to individual pixels is extremely difficult, particularly given all the different expressions, poses, contexts, and lighting conditions a face could appear in. Instead, nearest-neighbor takes a shortcut: if the image in its database most similar to the one Jane just uploaded is of a face, then so is JaneвЂ™s. For this to work, the database needs to contain an image thatвЂ™s similar enough to the new one-for example, a face with similar pose, lighting, and so on-so the bigger the database, the better. For a simple two-dimensional problem like guessing the border between two countries, a tiny database suffices. For a very hard problem like identifying faces, where the color of each pixel is a dimension of variation, we need a huge database. But these days we have them. Learning from them may be too costly for an eager learner, which explicitly draws the border between faces and nonfaces. For nearest-neighbor, however, the border is implicit in the locations of the data points and the distance measure, and the only cost is at query time.. Superficially, an SVM looks a lot like weightedk-nearest-neighbor: the frontier between the positive and negative classes is defined by a set of examples and their weights, together with a similarity measure. A test example belongs to the positive class if, on average, it looks more like the positive examples than the negative ones. The average is weighted, and the SVM remembers only the key examples required to pin down the frontier. If you look back at the Posistan/Negaland example, once we throw away all the towns that arenвЂ™t on the border, all thatвЂ™s left is this map:. A cluster is a set of similar entities, or at a minimum, a set of entities that are more similar to each other than to members of other clusters. ItвЂ™s human nature to cluster things, and itвЂ™s often the first step on the road to knowledge. When we look up at the night sky, we canвЂ™t help seeing clusters of stars, and then we fancifully name them after shapes they resemble. Noticing that certain sets of elements had very similar chemical properties was the first step in discovering the periodic table. Each of those sets is now a column in it. Everything we perceive is a cluster, from friendsвЂ™ faces to speech sounds. Without them, weвЂ™d be lost: children canвЂ™t learn a language before they learn to identify the characteristic sounds itвЂ™s made of, which they do in their first year of life, and all the words they then learn mean nothing without the clusters of real things they refer to. Confronted with big data-a very large number of objects-our first recourse is to group them into a more manageable number of clusters. A whole market is too coarse, and individual customers are too fine, so marketers divide markets into segments, which is their word for clusters. Even objects themselves are at bottom clusters of their observations, from all the different angles light falls on MommyвЂ™s face to all the different sound waves baby hears as the wordmommy. And we canвЂ™t think without objects, which is perhaps why quantum mechanics is so unintuitive: we want to visualize the subatomic world as particles colliding, or waves interfering, but itвЂ™s not really either.. YouвЂ™d probably be disappointed if you looked at the principal components of a face data set, though. TheyвЂ™re not what youвЂ™d expect, such as facial expressions or features, but more like ghostly faces, blurred beyond recognition. This is because PCA is a linear algorithm, and so all that the principal components can be is weighted pixel-by-pixel averages of real faces. (Also known as eigenfaces because theyвЂ™re eigenvectors of the centered covariance matrix of the data-but I digress.) To really understand faces, and most shapes in the world, we need something else: nonlinear dimensionalityreduction.. The first step accomplished, you hurry on to the Bayesian district. Even from a distance, you can see how it clusters around the Cathedral of BayesвЂ™ Theorem. MCMC Alley zigzags randomly along the way. This is going to take a while. You take a shortcut onto Belief Propagation Street, but it seems to loop around forever. Then you see it: the Most Likely Avenue, rising majestically toward the Posterior Probability Gate. Rather than average over all models, you can head straight for the most probable one, confident that the resulting predictions will be almost the same. And you can let genetic search pick the modelвЂ™s structure and gradient descent its parameters. With a sigh of relief, you realize thatвЂ™s all the probabilistic inference youвЂ™ll need, at least until itвЂ™s time to answer questions using the model.. A company like this could quickly become one of the most valuable in the world. As Alexis Madrigal of theAtlantic points out, today your profile can be bought for half a cent or less, but the value of a user to the Internet advertising industry is more like $1,200 per year. GoogleвЂ™s sliver of your data is worth about $20, FacebookвЂ™s $5, and so on. Add to that all the slivers that no one has yet, and the fact that the whole is more than the sum of the parts-a model of you based on all your data is much better than a thousand models based on a thousand slivers-and weвЂ™relooking at easily over a trillion dollars per year for an economy the size of the United States. It doesnвЂ™t take a large cut of that to make a Fortune 500 company. If you decide to take up the challenge and wind up becoming a billionaire, remember where you first got the idea.. Judea PearlвЂ™s pioneering work on Bayesian networks appears in his bookProbabilistic Reasoning in Intelligent Systems* (Morgan Kaufmann, 1988).вЂњBayesian networks without tears,вЂќ* by Eugene Charniak (AI Magazine, 1991), is a largely nonmathematical introduction to them.вЂњProbabilistic interpretation for MYCINвЂ™s certainty factors,вЂќ* by David Heckerman (Proceedings of the Second Conference on Uncertainty in Artificial Intelligence, 1986), explains when sets of rules with confidence estimates are and arenвЂ™t a reasonable approximation to Bayesian networks. вЂњModule networks: Identifying regulatory modules and their condition-specific regulators from gene expression data,вЂќ by Eran Segal et al. (Nature Genetics, 2003), is an example of using Bayesian networks to model gene regulation.вЂњMicrosoft virus fighter: Spam may be more difficult to stop than HIV,вЂќ by Ben Paynter (Fast Company, 2012), tells how David Heckerman took inspiration from spam filters and used Bayesian networks to design a potential AIDS vaccine. The probabilistic orвЂњnoisyвЂќ OR is explained in PearlвЂ™s book.* вЂњProbabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base,вЂќ by M. A. Shwe et al. (Parts I and II,Methods of Information in Medicine, 1991), describes a noisy-OR Bayesian network for medical diagnosis. GoogleвЂ™s Bayesian network for ad placement is described in Section 26.5.4 of Kevin MurphyвЂ™sMachine Learning* (MIT Press, 2012). MicrosoftвЂ™s player rating system is described in вЂњTrueSkillTM: A Bayesian skill rating system,вЂќ* by Ralf Herbrich, Tom Minka, and Thore Graepel (Advances in Neural Information Processing Systems 19, 2007)..