Traditionally, the only way to get a computer to do something-from adding two numbers to flying an airplane-was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we donвЂ™t have to program computers; they program themselves.. Another potential source of objections to the Master Algorithm is the notion, popularized by the psychologist Jerry Fodor, that the mind is composed of a set of modules with only limited communication between them. For example, when you watch TV yourвЂњhigher brainвЂќ knows that itвЂ™s only light flickering on a flat surface, but your visual system still sees three-dimensional shapes. Even if we believe in the modularity of mind, however, that does not imply that different modules use different learning algorithms. The same algorithm operatingon, say, visual and verbal information may suffice.. Crucially, the Master Algorithm is not required to start from scratch in each new problem. That bar is probably too high forany learner to meet, and itвЂ™s certainly very unlike what people do. For example, language does not exist in a vacuum; we couldnвЂ™t understand a sentence without our knowledge of the world it refers to. Thus, when learning to read, the Master Algorithm can rely on having previously learned to see, hear, and control a robot. Likewise, a scientist does not just blindly fit models to data; he can bring all his knowledge of the field to bear on the problem. Therefore, when making discoveries in biology, the Master Algorithm can first read all the biology it wants, relying on having previously learned to read. The Master Algorithm is not just a passive consumer of data; it can interact with its environment and actively seek the data it wants, like Adam, the robot scientist, or like any child exploring her world.. Overfitting is seriously exacerbated by noise. Noise in machine learning just means errors in the data, or random events that you canвЂ™t predict. Suppose that your friend really does like to go clubbing when thereвЂ™s nothing interesting on TV, but you misremembered occasion number 3 and wrote down that therewas something good on TV that night. If you now try to come up with a set of rules that makes an exception for that night, youвЂ™ll probably wind up with a worse answer than if youвЂ™d just ignored it. Or suppose that your friend had a hangover from going out the previous night and said no when ordinarily she would have said yes. Unless you know about the hangover, learning a set of rules that gets this example right is actually counterproductive: youвЂ™re better off вЂњmisclassifyingвЂќ it as a no. It gets worse: noise can make it impossible to come up withany consistent set of rules. Notice that occasions 2 and 3 are in fact indistinguishable: they have exactly the same attributes. If your friend said yes on occasion 2 and no on occasion 3, thereвЂ™s no rule that will get them both right.. Consider the grandmother cell, a favorite thought experiment of cognitive neuroscientists. The grandmother cell is a neuron in your brain that fires whenever you see your grandmother, and only then. Whether or not grandmother cells really exist is an open question, but letвЂ™s design one for use in machine learning. A perceptron learns to recognize your grandmother as follows. The inputs to the cell are either the raw pixels in the image or various hardwired features of it, likebrown eyes, which takes the value 1 if the image contains a pair of brown eyes and 0 otherwise. In the beginning, all the connections from features to the neuron have small random weights, like the synapses in your brain at birth. Then we show the perceptron a series of images, some of your grandmother and some not. If it fires upon seeing an image of your grandmother, or doesnвЂ™t fire upon seeing something else, then no learning needs to happen. (If it ainвЂ™t broke, donвЂ™t fix it.) But if the perceptron fails to fire when itвЂ™s looking at your grandmother, that means the weighted sum of its inputs should have been higher, so we increase the weights of the inputs that are on. (For example, if your grandmother has brown eyes, the weight of that feature goes up.) Conversely, if the perceptron fires when it shouldnвЂ™t, we decrease the weights of the active inputs. ItвЂ™s the errors that drive the learning. Over time, the features that are indicative of your grandmother acquire high weights, and the ones that arenвЂ™t get low weights. Once the perceptron always fires upon seeing your grandmother, and only then, the learning is complete.. Nonlinear models are important far beyond the stock market. Scientists everywhere use linear regression because thatвЂ™s what they know, but more often than not the phenomena they study are nonlinear, and a multilayer perceptron can model them. Linear models are blind to phase transitions; neural networks soak them up like a sponge.. We havenвЂ™t seen any deep learning yet, though. The next clever idea is to stack sparse autoencoders on top of each other like a club sandwich. The hidden layer of the first autoencoder becomes the input/output layer of the second one, and so on. Because the neurons are nonlinear, each hidden layer learnsa more sophisticated representation of the input, building on the previous one. Given a large set of face images, the first autoencoder learns to encode local features like corners and spots, the second uses those to encode facial features like the tip of a nose or the iris of an eye, the third onelearns whole noses and eyes, and so on. Finally, the top layer can be a conventional perceptron that learns to recognize your grandmother from the high-level features provided by the layer below it-much easier than using only the crude information provided by a single hidden layer or than trying tobackpropagate through all the layers at once. The Google Brain network ofNew York Times fame is a nine-layer sandwich of autoencoders and other ingredients that learns to recognize cats from YouTube videos. At one billion connections, it was at the time the largest network ever learned. ItвЂ™s no surprise that Andrew Ng, one of the projectвЂ™s principals, is also one of the leading proponents of the idea that human intelligence boils down to a single algorithm, and all we need to do is figure it out. Ng, whose affability belies a fierce ambition, believes that stacked sparse autoencoders can take us closer to solving AI than anything that came before.. In machine learning, as elsewhere in computer science, thereвЂ™s nothing better than getting such a combinatorial explosion to work for you instead of against you. WhatвЂ™s clever about genetic algorithms is that each string implicitly contains an exponential number of building blocks, known as schemas, and so the search is a lot more efficient than it seems. This is because every subset of the stringвЂ™s bits is a schema, representing some potentially fit combination of properties, and a string has an exponential number of subsets. We can represent a schema by replacing the bits in the string that arenвЂ™t part of it with *. For example, the string 110 contains the schemas ***, **0, *1*, 1**, *10, 11*, 1*0, and 110. We get a different schema for every different choice of bits to include; since we have two choices for each bit (include/donвЂ™t include), we have 2n schemas. Conversely, a particular schema may be represented in many different strings in a population, and is implicitly evaluated every time they are. Suppose that a hypothesisвЂ™s probability of surviving into the next generation is proportional to its fitness. Holland showed that, in this case, the fitter a schemaвЂ™s representatives in one generation are compared to the average, the more of them we can expect to see in the next generation. So, while the genetic algorithm explicitly manipulates strings, it implicitly searches the much larger space of schemas. Over time, fitter schemas come to dominate the population, and so unlike the drunkard, the genetic algorithm finds its way home.. ThatвЂ™s just a statement of the theorem, not a proof, of course. But the proof is surprisingly simple. We can illustrate it with an example from medical diagnosis, one of the вЂњkiller appsвЂќ of Bayesian inference. Suppose youвЂ™re a doctor, and youвЂ™ve diagnosed a hundred patients in the last month. Fourteen of them had the flu, twenty had a fever, and eleven had both. The conditional probability of fever given flu is therefore eleven out of fourteen, or 11/14. Conditioning reduces the size of the universe that weвЂ™re considering, in this case from all patients to only patients with the flu.In the universe of all patients, the probability of fever is 20/100; in the universe of flu-stricken patients, itвЂ™s 11/14. The probability that a patient has the fluand a fever is the fraction of patients that have the flu times the fraction ofthose that have a fever:P(flu, fever) = P(flu)Г— P(fever | flu) = 14/100Г— 11/14 = 11/100. But we could equally well have done this the other way around:P(flu, fever) = P(fever)Г— P(flu | fever). Therefore, since theyвЂ™re both equalto P(flu,fever), P(fever)Г— P(flu | fever) = P(flu) Г— P(fever | flu). Divide both sides byP(fever), and you getP(flu | fever) = P(flu)Г— P(fever | flu) / P(fever). ThatвЂ™s it! ThatвЂ™s BayesвЂ™ theorem, with flu as the cause and fever as the effect.. Markov assumed (wrongly but usefully) that the probabilities are the same at every position in the text. Thus we need to estimate only three probabilities:P(Vowel1 = True),P(Voweli+1 = True | Voweli = True), andP(Voweli+1= True | Voweli = False). (Since probabilities sum to one, from these we can immediately obtainP(Vowel1 = False), etc.) As with NaГЇve Bayes, we can have as many variables as we want without the number of probabilities we need to estimate going through the roof, but now the variables actually depend on each other.. ThereвЂ™s a saving grace, however, and some major reasons to prefer the Bayesian way. The saving grace is that, most of the time, almost all hypotheses wind up with a tiny posterior probability, and we can safely ignore them. In fact, just considering the single most probable hypothesis is usually a very good approximation. Suppose our prior distribution for the coin flip problem is that all probabilities of heads are equally likely. The effect of seeing the outcomes of successive flips is to concentrate the distribution more and more on the hypotheses that best agree with the data. For example, ifh ranges over the possible probabilities of heads and a coin comes out heads 70 percent of the time, weвЂ™ll see something like this:. Help desks are currently the most popular application of case-based reasoning. Most still employ a human intermediary, but IPsoftвЂ™s Eliza talks directly to the customer. Eliza, who comes complete with a 3-D interactive video persona, has solved over twenty million customer problems to date, mostly for blue-chip US companies. вЂњGreetings from Robotistan, outsourcingвЂ™s cheapest new destination,вЂќ is how an outsourcing blog recently put it. And, just as outsourcing keeps climbing the skills ladder, so does analogical learning. The first robo-lawyers that argue for a particular verdict based on precedents have already been built. One such system correctly predicted the outcomes of over 90 percent of the trade secret cases it examined. Perhaps in a future cyber-court, in session somewhere on AmazonвЂ™s cloud, a robo-lawyer will beat the speeding ticket that RoboCop issued to your driverless car, all while you go to the beach, and LeibnizвЂ™s dream of reducing all argument to calculation will finally have come true.. How much of your brain does your job use? The more it does, the safer you are. In the early days of AI, the common view was that computers would replace blue-collar workers before white-collar ones, because white-collar work requires more brains. But thatвЂ™s not quite how things turned out. Robots assemble cars, but they havenвЂ™t replaced construction workers. On the other hand, machine-learning algorithms have replaced credit analysts and direct marketers. As it turns out, evaluating credit applications is easier for machines than walking arounda construction site without tripping, even though for humans itвЂ™s the other way around. The common theme is that narrowly defined tasks are easily learned from data, but tasks that require a broad combination of skills and knowledge arenвЂ™t. Most of your brain is devoted to vision and motion, which is a sign that walking around is much more complex than it seems; we just take it for granted because, having been honed to perfection by evolution, itвЂ™s mostly done subconsciously. The company Narrative Science has an AI system that can write pretty good summaries of baseball games, but not novels, because-pace George F. Will-thereвЂ™s a lot more to life than to baseball games. Speech recognition is hard for computers because itвЂ™s hard to fill in the blanks-literally, the sounds speakers routinely elide-when you have no idea what the person is talking about. Algorithms can predict stock fluctuations but have no clue how they relate to politics. The more context a job requires, the less likely a computer will be able to do it soon. Common sense is important not just because your mom taught you so, but because computers donвЂ™t have it.. Google + Master Algorithm = Skynet?. The transformation of science by data-intensive computing is surveyed inThe Fourth Paradigm, edited by Tony Hey, Stewart Tansley, and Kristin Tolle (Microsoft Research, 2009).вЂњMachine science,вЂќ by James Evans and Andrey Rzhetsky (Science, 2010), discusses some of the different ways computers can make scientific discoveries.Scientific Discovery: Computational Explorations of the Creative Processes,* by Pat Langley et al. (MIT Press, 1987), describes a series of approaches to automating the discovery of scientific laws. The SKICAT project is described inвЂњFrom digitized images to online catalogs,вЂќ by Usama Fayyad, George Djorgovski, and Nicholas Weir (AI Magazine, 1996).вЂњMachine learning in drug discovery and development,вЂќ* by Niki Wale (Drug Development Research, 2001), gives an overview of just that. Adam, the robot scientist, is described inвЂњThe automation of science,вЂќ by Ross King et al. (Science, 2009)..