Traditionally, the only way to get a computer to do something-from adding two numbers to flying an airplane-was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we donвЂ™t have to program computers; they program themselves.. If youвЂ™re a machine-learning expert, youвЂ™re already familiar with much of what the book covers, but youвЂ™ll also find in it many fresh ideas, historical nuggets, and useful examples and analogies. Most of all, I hope the book will provide a new perspective on machine learning and maybe even start you thinking in new directions. Low-hanging fruit is all around us, and it behooves us to pick it, but we also shouldnвЂ™t lose sight of the bigger rewards that lie just beyond. (Apropos of which, I hope youвЂ™ll forgive my poetic license in using the termmaster algorithm to refer to a general-purpose learner.). The other reason machine learners are theГјber-geeks is that the world has far fewer of them than it needs, even by the already dire standards of computer science. According to tech guru Tim OвЂ™Reilly, вЂњdata scientistвЂќ is the hottest job title in Silicon Valley. The McKinsey Global Institute estimates that by 2018 the United States alone will need 140,000 to 190,000 more machine-learning experts than will be available, and 1.5 million more data-savvy managers. Machine learningвЂ™s applications have exploded too suddenly for education to keep up, and it has a reputation for being a difficult subject. Textbooks are liable to giveyou math indigestion. This difficulty is more apparent than real, however. All of the important ideas in machine learning can be expressed math-free. As you read this book, you may even find yourself inventing your own learning algorithms, with nary an equation in sight.. Even more astonishing than the breadth of applications of machine learning is that itвЂ™s thesame algorithms doing all of these different things. Outside of machine learning, if you have two different problems to solve, you need to write two different programs. They might use some of the same infrastructure, like the same programming language or the same database system, but a program to, say, play chess is of no use if you want to process credit-card applications. In machine learning, the same algorithm can do both, provided you give it the appropriate data to learn from. In fact, just a few algorithms are responsible for the great majority of machine-learning applications, and weвЂ™ll take a look at them in the next few chapters.. Crucially, the Master Algorithm is not required to start from scratch in each new problem. That bar is probably too high forany learner to meet, and itвЂ™s certainly very unlike what people do. For example, language does not exist in a vacuum; we couldnвЂ™t understand a sentence without our knowledge of the world it refers to. Thus, when learning to read, the Master Algorithm can rely on having previously learned to see, hear, and control a robot. Likewise, a scientist does not just blindly fit models to data; he can bring all his knowledge of the field to bear on the problem. Therefore, when making discoveries in biology, the Master Algorithm can first read all the biology it wants, relying on having previously learned to read. The Master Algorithm is not just a passive consumer of data; it can interact with its environment and actively seek the data it wants, like Adam, the robot scientist, or like any child exploring her world.. This is how RosenblattвЂ™s perceptron algorithm learns weights.. So does backprop solve the machine-learning problem? Can we just throw together a big pile of neurons, wait for it to do its magic, and on the way to the bank collect a Nobel Prize for figuring out how the brain works? Alas, life is not that easy. Suppose your network has only one weight, and this is the graph of the error as a function of it:. [РљР°СЂС‚РёРЅРєР°: pic_12.jpg]. No one is sure why sex is pervasive in nature, either. Several theories have been proposed, but none is widely accepted. The leader of the pack is the Red Queen hypothesis, popularized by Matt Ridley in the eponymous book. As the Red Queen said to Alice inThrough the Looking Glass,вЂњIt takes all the running you can do, to keep in the same place.вЂќ In this view, organisms are in a perpetual arms race with parasites, and sex helps keep the population varied, so that no single germ can infect all of it. If this is the answer, then sex is irrelevant to machine learning, at least until learned programs have to vie with computer viruses for processor time and memory. (Intriguingly, Danny Hillis claims that deliberately introducing coevolving parasites into a genetic algorithm can help it escape local maxima by gradually ratcheting up the difficulty, but no one has followedup on this yet.) Christos Papadimitriou and colleagues have shown that sex optimizes not fitness but what they call mixability: a geneвЂ™s ability to do well on average when combined with other genes. This can be useful when the fitness function is either not known or not constant, as in natural selection, but in machine learning and optimization, hill climbing tends to do better.. ThereвЂ™s a big snag in all of this, unfortunately. Just because a Bayesian network lets us compactly represent a probability distribution doesnвЂ™t mean we can also reason efficiently with it. Suppose you want to computeP(Burglary | Bob called, Claire didnвЂ™t). By BayesвЂ™ theorem, you know this is justP(Burglary) P(Bob called, Claire didnвЂ™t | Burglary) / P(Bob called, Claire didnвЂ™t), or equivalently,P(Burglary, Bob called, Claire didnвЂ™t) / P(Bob called, Claire didnвЂ™t). If you had the full table with the probabilities of all states, you could obtain both of these probabilities by adding up the corresponding lines in the table. For example,P(Bob called, Claire didnвЂ™t) is the sum of the probabilities of all the lines where Bob calls and Claire doesnвЂ™t. But the Bayesian network doesnвЂ™t give you the full table. You could always construct it from the individual tables, but that takes exponential time and space. What we really want is to computeP(Burglary | Bob called, Claire didnвЂ™t) without building the full table. That, in a nutshell, is the problem of inference in Bayesian networks.. Given all this, itвЂ™s not surprising that analogy plays a prominent role in machine learning. It got off to a slow start, though, and was initially overshadowed by neural networks. Its first algorithmic incarnation appeared in an obscure technical report written in 1951 by two Berkeley statisticians, Evelyn Fix andJoe Hodges, and was not published in a mainstream journal until decades later. But in the meantime, other papers on Fix and HodgesвЂ™s algorithm started to appear and then to multiply until it was one of the most researched in all of computer science. The nearest-neighbor algorithm, as itвЂ™s called, is the first stop on our tour of analogy-based learning. The second is support vector machines, an idea that took machine learning by storm around the turn of the millennium and was only recently overshadowed by deep learning. The third and last is full-blown analogical reasoning, which has been a staple of psychology and AI for several decades, and a background theme in machine learning for nearly as long.. In general, we have to deal with many constraints at once (one per example, in the case of SVMs). Suppose you wanted to get as close as possible to the North Pole but couldnвЂ™t leave your room. Each of the roomвЂ™s four walls is a constraint, and the solution is to follow the compass until you bump into the corner where the northeast and northwest walls meet. We say that these two walls are the active constraints because theyвЂ™re what prevents you from reaching the optimum, namely the North Pole. If your room has a wall facing exactly north, thatвЂ™s the sole active constraint, and the solution is a point in the middle of it. And if youвЂ™re Santa and your room is already over the North Pole, all constraints are inactive, and you can just sit there pondering the optimal toy distribution problem instead. (Traveling salesmen have it easy compared to Santa.) In an SVM, the active constraints are the support vectors since their margin is already the smallest itвЂ™s allowed to be; moving the frontier would violate one or more constraints. All other examples are irrelevant, and their weight is zero.. In my PhD thesis, I designed an algorithm that unifies instance-based and rule-based learning in this way. A rule doesnвЂ™t just match entities that satisfy all its preconditions; it matches any entity thatвЂ™s more similar to it than to any other rule, in the sense that it comes closer to satisfying its conditions. For instance, someone with a cholesterol level of 220 mg/dL comes closer than someone with 200 mg/dLto matching the ruleIf your cholesterol is above 240 mg/dL, youвЂ™re at risk of a heart attack. RISE, as I called the algorithm, learns by starting with each training example as a rule and then gradually generalizing each rule to absorb the nearest examples. The end result is usually a combination of very general rules, which between them match most examples, with more specific rules that match exceptions to those, and so on all the way to aвЂњlong tailвЂќ of specific memories. RISE made better predictions than the best rule-based and instance-based learners of the time, and my experiments showed that this was precisely because it combined the best features of both. Rules can be matched analogically, and so theyвЂ™re no longer brittle. Instances can select different features in different regions of space and so combat the curse of dimensionality much better than nearest-neighbor, which can only select the same features everywhere.. Learning an MLN means discovering formulas that are true in the world more often than random chance would predict, and figuring out the weights for those formulas that cause their predicted probabilities to match their observed frequencies. Once weвЂ™ve learned an MLN, we can use it to answer questions like вЂњWhat is the probability that Bob has the flu, given that heвЂ™s friends with Alice and she has the flu?вЂќ And guess what? It turns out that the probability is given by an S curve applied to the weighted sum of features, much as in a multilayer perceptron. And an MLN with long chains of rules can represent a deep neural network, with one layer per link in the chain.. Finally, we can turn Alchemy into a metalearner like stacking by encoding the individual classifiers as MLNs and adding or learning formulas to combine them. This is what DARPA did in its PAL project. PAL, the Personalized Assistant that Learns, was the largest AI project in DARPA history and the progenitor of Siri. PALвЂ™s goal was to build an automated secretary. It used Markov logic as its overarching representation, combining the outputs from different modules into the final decisions on what to do. This also allowed PALвЂ™s modules to learn from each other by evolving toward a consensus..