Traditionally, the only way to get a computer to do something-from adding two numbers to flying an airplane-was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we donвЂ™t have to program computers; they program themselves.. In retrospect, we can see that the progression from computers to the Internet to machine learning was inevitable: computers enable the Internet, which creates a flood of data and the problem of limitless choice; and machine learning uses the flood of data to help solve the limitless choice problem. The Internet by itself is not enough to move demand fromвЂњone size fits allвЂќ to the long tail of infinite variety. Netflix may have one hundred thousand DVD titles in stock, but if customers donвЂ™t know how to find the ones they like, they will default to choosing the hits. ItвЂ™s only when Netflix has a learning algorithm to figure out your tastes and recommend DVDs that the long tail really takes off.. The National Security Agency (NSA) has become infamous for its bottomless appetite for data: by one estimate, every day it intercepts over a billion phone calls and other communications around the globe. Privacy issues aside, however, it doesnвЂ™t have millions of staffers to eavesdrop on all these calls and e-mails or even just keep track of whoвЂ™s talking to whom. The vast majority of calls are perfectly innocent, and writing a program to pick out the few suspicious ones is very hard. In the old days, the NSA used keyword matching, but thatвЂ™s easy to get around. (Just call the bombing a вЂњweddingвЂќ and the bomb the вЂњwedding cake.вЂќ) In the twenty-first century, itвЂ™s a job for machine learning. Secrecy is the NSAвЂ™s trademark, but its director has testified to Congress that mining of phone logs has already halted dozens of terrorism threats.. In physics, the same equations applied to different quantities often describe phenomena in completely different fields, like quantum mechanics, electromagnetism, and fluid dynamics. The wave equation, the diffusion equation, PoissonвЂ™s equation: once we discover it in one field, we can more readily discover it in others; and once weвЂ™ve learned how to solve it in one field, we know how to solve it in all. Moreover, all these equations are quite simple and involve the same few derivatives of quantities with respect to space and time. Quite conceivably, they are all instances of a master equation, and all the Master Algorithm needs to do is figure out how to instantiate it for different data sets.. So, if the Master Algorithm exists, what is it? A seemingly obvious candidate is memorization: just remember everything youвЂ™ve seen; after a while youвЂ™ll have seen everything there is to see, and therefore know everything there is to know. The problem with this is that, as Heraclitus said, you never step in the same river twice. ThereвЂ™s far more to see than you ever could. No matter how many snowflakes youвЂ™ve examined, the next one will be different. Even if you had been present at the Big Bang and everywhere since, you would still have seen only a tiny fraction of what you could see in the future. If you had witnessed life on Earth up to ten thousand years ago, that would not have prepared you for what was to come. Someone who grew up in one city doesnвЂ™t become paralyzed when they move to another, but a robot capable only of memorization would. Besides, knowledge is not just a long list of facts. Knowledge is general, and has structure. вЂњAll humans are mortalвЂќ is much more succinct than sevenbillion statements of mortality, one for each human. Memorization gives us none of these things.. Philosophers have debated HumeвЂ™s problem of induction ever since he posed it, but no one has come up with a satisfactory answer. Bertrand Russell liked to illustrate the problem with the story of the inductivist turkey. On his first morning at the farm, the turkey was fed at 9:00 a.m., but being a good inductivist, he didnвЂ™t jump to conclusions. He first collected many observations on many different days under many different circumstances. Having been fed consistently at 9:00 a.m. for many consecutive days, he finally concluded that yes, he would always be fed at 9:00 a.m. Then came the morning of Christmas eve, and his throat was cut.. Whatever you want to predict, thereвЂ™s a good chance someone has used a decision tree for it. MicrosoftвЂ™s Kinect uses decision trees to figure out where various parts of your body are from the output of its depth camera; it can then use their motions to control the Xbox game console. In a 2002 head-to-head competition, decision trees correctly predicted three out of every four Supreme Court rulings, while a panel of experts got less than 60 percent correct. Thousands of decision tree users canвЂ™t be wrong, you think, and sketch one to predict your friendвЂ™s reply when you ask her out:. This is how RosenblattвЂ™s perceptron algorithm learns weights.. One consequence of crossing over program trees instead of bit strings is that the resulting programs can have any size, making the learning more flexible. The overall tendency is for bloat, however, with larger and larger trees growing as evolution goes on longer (also known asвЂњsurvival of the fattestвЂќ). Evolutionaries can take comfort from the fact that human-written programs are no different (Microsoft Windows: forty-five million lines of code and counting), and that human-made code doesnвЂ™t allow a solution as simple as adding a complexity penalty to the fitness function.. The problems for genetic programming do not end there. Indeed, even its successes might not be as genetic as evolutionaries would like. Take circuit design, which was genetic programmingвЂ™s emblematic success. As a rule, even relatively simple designs require an enormous amount of search, and itвЂ™s not clear how much the results owe to brute force rather than genetic smarts. To address the growing chorus of critics, Koza included in his 1992 bookGenetic Programming experiments showing that genetic programming beat randomly generating candidates on Boolean circuit synthesis problems, but the margin of victory was small. Then, at the 1995 International Conference on Machine Learning (ICML) in Lake Tahoe, California, Kevin Lang published a paper showing that hill climbing beat genetic programming on the same problems, often by a large margin. Koza and other evolutionaries had repeatedly tried to publish papers in ICML, a leading venue in the field, but to their increasing frustration they kept being rejected due to insufficient empirical validation. Already frustrated with his papers being rejected, seeing LangвЂ™s paper made Koza blow his top. On short order, he produced a twenty-three-page paper in two-column ICML format refuting LangвЂ™s conclusions and accusing the ICML reviewers of scientific misconduct. He then placed a copy on every seat in the conference auditorium. Depending on your point of view, either LangвЂ™s paper or KozaвЂ™s response was the last straw; regardless, the Tahoe incident marked the final divorce between the evolutionaries and the rest of the machine-learning community, with the evolutionaries moving out of the house. Genetic programmers started their own conference, which merged with the genetic algorithms conference to form GECCO, the Genetic and Evolutionary Computing Conference. For its part, the machine-learning mainstream largely forgot them. A saddГ©nouement, but not the first time in history that sex is to blame for a breakup.. This is ironic, since Laplace was also the father of probability theory, which he believed was just common sense reduced to calculation. At the heart of his explorations in probability was a preoccupation with HumeвЂ™s question. For example, how do we know the sun will rise tomorrow? It has done so every day until today, but thatвЂ™s no guarantee it will continue. LaplaceвЂ™s answer had two parts. The first is what we now call the principle of indifference, or principle of insufficient reason. We wake up oneday-at the beginning of time, letвЂ™s say, which for Laplace was five thousand years or so ago-and after a beautiful afternoon, we see the sun go down. Will it come back? WeвЂ™ve never seen the sun rise, and there is no particular reason to believe it will or wonвЂ™t. Therefore we should consider the two scenarios equally likely and say that the sun will rise again with a probability of one-half. But, Laplace went on, if the past is any guide to the future, every day that the sun rises should increase our confidence that it will continue to do so. After five thousand years, the probability that the sun will rise yet again tomorrow should be very close to one, but not quite there, since we can never be completely certain. From this thought experiment, Laplace derived his so-called rule of succession, which estimates the probability that the sun will rise again after having risenn times as (n + 1) / (n + 2). Whenn = 0, this is justВЅ; and asn increases, so does the probability, approaching 1 whenn approaches infinity.. Markov assumed (wrongly but usefully) that the probabilities are the same at every position in the text. Thus we need to estimate only three probabilities:P(Vowel1 = True),P(Voweli+1 = True | Voweli = True), andP(Voweli+1= True | Voweli = False). (Since probabilities sum to one, from these we can immediately obtainP(Vowel1 = False), etc.) As with NaГЇve Bayes, we can have as many variables as we want without the number of probabilities we need to estimate going through the roof, but now the variables actually depend on each other.. But something funny happened on the way to world domination. Researchers using Bayesian models kept noticing that you got better results by tweaking the probabilities in illegal ways. For example, raisingP(words) to some power in speech recognizers improved accuracy, but then it wasnвЂ™t BayesвЂ™ theorem any more. What was going on? The culprit, it turns out, was the false independence assumptions that generative models make. The simplified graph structure makes the models learnable and is worth keeping, but then weвЂ™re better off just learning the best parameters we can for the task at hand, irrespective of whether theyвЂ™re probabilities. The real strength of, say, NaГЇve Bayes is that it provides a small, informative set of features from which to predict the class and a fast, robust way to learn the corresponding parameters. In a spam filter, each feature is the occurrence of a particular word in spam, and the corresponding parameter is how often it occurs; and similarly for nonspam. Viewed in this way, NaГЇve Bayes can be optimal, in the sense of making the best predictions possible, even in many cases where its independence assumptions are wildly violated. When I realized this and published a paper about it in 1996, peopleвЂ™s suspicion of NaГЇve Bayes melted away, helping it to take off. But it was also a step on the way to a different kind of model, which in the last two decades has increasingly replaced Bayesian networks in machine learning: Markov networks.. Logic and probability: The star-crossed couple. Recommender systems, as theyвЂ™re also called, are big business: a third of AmazonвЂ™s business comes from its recommendations, as does three-quarters of NetflixвЂ™s. ItвЂ™s a far cry from the early days of nearest-neighbor, when it was considered impractical because of its memory requirements. Back then, computer memories were made of small iron rings, one per bit, and storing even a few thousand examples was taxing. How times have changed. Nevertheless, itвЂ™s not necessarily smart to remember all the examples youвЂ™ve seen and then have to search through them, particularly since most are probably irrelevant. If you look back at the map of Posistan and Negaland, you may notice that if Positiville disappeared, nothing would change. The metro areas of nearby cities would expand into the land formerly occupied by Positiville, but since theyвЂ™re all Posistan cities, the border with Negaland would stay the same. Theonly cities that really matter are the ones across the border from a city in the other country; all others we can omit. So a simple way to make nearest-neighbor more efficient is to delete all the examples that are correctly classified by their neighbors. This and other tricks enable nearest-neighbor methods to be used in some surprising areas, like controlling robot arms in real time. But needless to say, theyвЂ™re still not the first choice for things like high-frequency trading, where computers buy and sell stocks in fractions of a second. In a race between a neural network, which can be applied to an example with only a fixed number of additions, multiplications, and sigmoids and an algorithm that needs to search a large database for the exampleвЂ™s nearest neighbors, the neural network is sure to win..