Take cancer. Curing it is hard because cancer is not one disease, but many. Tumors can be triggered by a dizzying array of causes, and they mutate as they metastasize. The surest way to kill a tumor is to sequence its genome, figure out which drugs will work against it-without harming you, givenyour genome and medical history-and perhaps even design a new drug specifically for your case. No doctor can master all the knowledge required for this. Sounds like a perfect job for machine learning: in effect, itвЂ™s a more complicated and challenging version of the searches that Amazon and Netflix do every day, except itвЂ™s looking for the right treatment for you instead of the right book or movie. Unfortunately, while todayвЂ™s learning algorithms can diagnose many diseases with superhuman accuracy, curing cancer is well beyond their ken. If we succeed in our quest for the Master Algorithm, it will no longer be.. In retrospect, we can see that the progression from computers to the Internet to machine learning was inevitable: computers enable the Internet, which creates a flood of data and the problem of limitless choice; and machine learning uses the flood of data to help solve the limitless choice problem. The Internet by itself is not enough to move demand fromвЂњone size fits allвЂќ to the long tail of infinite variety. Netflix may have one hundred thousand DVD titles in stock, but if customers donвЂ™t know how to find the ones they like, they will default to choosing the hits. ItвЂ™s only when Netflix has a learning algorithm to figure out your tastes and recommend DVDs that the long tail really takes off.. Our quest will take us across the territory of each of the five tribes. The border crossings, where they meet, negotiate and skirmish, will be the trickiest part of the journey. Each tribe has a different piece of the puzzle, which we must gather. Machine learners, like all scientists, resemble the blind men and the elephant: one feels the trunk and thinks itвЂ™s a snake, another leans against the leg and thinks itвЂ™s a tree, yet another touches the tusk and thinks itвЂ™s a bull. Our aim is to touch each part without jumping to conclusions; and once weвЂ™ve touched all of them, we will try to picture the whole elephant. ItвЂ™s far from obvious how to combine all the pieces into one solution-impossible, according to some-but this is what we will do.. For each pair of facts, we construct the rule that allows us to infer the second fact from the first one and generalize it by NewtonвЂ™s principle. When the same general rule is induced over and over again, we can have some confidence that itвЂ™s true.. Just knowing which genes regulate which genes and how proteins organize the cellвЂ™s web of chemical reactions is not enough, though. We also need to know how much of each molecular species is produced. DNA microarrays and other experiments can provide this type of quantitative information, but inverse deduction, with its вЂњall or noneвЂќ logical character, is not very good at dealing with it. For that we need the connectionist methods that weвЂ™ll meet in the next chapter.. As with rule learning, we donвЂ™t want to induce a tree that perfectly predicts the classes of all the training examples, because it would probably overfit. As before, we can use significance tests or a penalty on the size of the tree to prevent this.. [РљР°СЂС‚РёРЅРєР°: pic_10.jpg]. Backprop is an instance of a strategy that is very common in both nature and technology: if youвЂ™re in a hurry to get to the top of the mountain, climb the steepest slope you can find. The technical term for this is gradient ascent (if you want to get to the top) or gradient descent (if youвЂ™re looking for the valley bottom). Bacteria can find food by swimming up the concentration gradientof, say, glucose molecules, and they can flee from poisons by swimming down their gradient. All sorts of things, from aircraft wings to antenna arrays, can be optimized by gradient descent. Backprop is an efficient way to do it in a multilayer perceptron: keep tweaking the weights so as to lower the error, and stop when all tweaks fail. With backprop, you donвЂ™t have to figure out how to tweak each neuronвЂ™s weights from scratch, which would be too slow; you can do it layer by layer, tweaking each neuron based on how you tweaked the neurons it connects to. If you had to throw out your entire machine-learning toolkit in an emergency save for one tool, gradient descent is probably the one youвЂ™d want to hold on to.. A complete model of a cell. The problem is worse than it seems, because Bayesian networks in effect haveвЂњinvisibleвЂќ arrows to go along with the visible ones.Burglary andEarthquake are a priori independent, but the alarm going off entangles them: the alarm makes you suspect a burglary, but if now you hear on the radio that thereвЂ™s been an earthquake, you assume thatвЂ™s what caused the alarm. The earthquake hasexplained away the alarm, making a burglary less likely, and the two are therefore dependent. In a Bayesian network, all parents of the same variable are interdependent in this way, and this in turn introduces further dependencies, making the resulting graph often much denser than the original one.. Analogizers took this line of reasoning to its logical conclusion, as weвЂ™ll see in the next chapter. In the first decade of the new millennium, they in turn took over NIPS. Now the connectionists dominate once more, under the banner of deep learning. Some say that research goes in cycles, but itвЂ™s more like a spiral, with loops winding around the direction of progress. In machine learning, the spiral converges to the Master Algorithm.. When you arrange books on a shelf so that books on similar topics are close to each other, youвЂ™re doing a kind of dimensionality reduction, from the vast space of topics to the one-dimensional shelf. Unavoidably, some books that are closely related will wind up far apart on the shelf, but you can still order them in a way that minimizes such occurrences. ThatвЂ™s what dimensionality reduction algorithms do.. An important precursor of reinforcement learning was a checkers-playing program created by Arthur Samuel, an IBM researcher, in the 1950s. Board games are a great example of a reinforcement learning problem: you have to make a long series of moves without any feedback, and the whole reward or punishment comes at the very end, in the form of a win or loss. Yet SamuelвЂ™s program was able to teach itself to play as well as most humans. It did not directly learn which move to make in each board position because that would have been too difficult. Rather, it learned how to evaluate each board position-how likely am I to win starting from this position?-and chose the move that led to the best position. Initially, the only positions it knew how to evaluate were the final ones: a win, a tie, or a loss. But once it knew that a certain position was a win, it also knew that positions from which it could move to it were good, and so on. Thomas J. Watson Sr., IBMвЂ™s president, predicted that when the program was demonstrated IBM stock would go up by fifteen points. It did. The lesson was not lost on IBM, which went on to build a chess champion and aJeopardy! one.. In perception and memory, a chunk is just a symbol that stands for a pattern of other symbols, like AI stands for artificial intelligence. Newell and Rosenbloom adapted this notion to the theory of problem solving that Newell and Simon had developed earlier. Newell and Simon asked experimental subjects to solve problems-for example, derive one mathematical formula from another on the blackboard-while narrating aloud how they were going about it. They found that humans solve problems by decomposing them into subproblems, subsubproblems, and so on and systematically reducing the differences between the initial state (the first formula, say) and the goal state (the second formula). Doing so requires searching for a sequence of actions that will work, however, and that takes time. Newell and RosenbloomвЂ™s hypothesis was that each time we solve a subproblem, we form a chunk that allows us to go directly from the state before we solve it to the state after. A chunk in this sense has two parts: the stimulus (a pattern you recognize in the external world or in your short-term memory) and the response (the sequence of actions you execute as a result). Once youвЂ™ve learned a chunk, you store it in long-term memory. Next time you have to solve the same subproblem, you can just apply the chunk, and save the time spent searching. This happens at all levels until you have a chunk for the whole problem and can solve it automatically. To tie your shoelaces, you tie the starting knot, make a loop with one end, wrap the other end around it, and pull it through the hole in the middle. Each of these is far from trivial for a five-year-old, but once youвЂ™ve acquired the corresponding chunks, youвЂ™re almost there.. Learning an MLN means discovering formulas that are true in the world more often than random chance would predict, and figuring out the weights for those formulas that cause their predicted probabilities to match their observed frequencies. Once weвЂ™ve learned an MLN, we can use it to answer questions like вЂњWhat is the probability that Bob has the flu, given that heвЂ™s friends with Alice and she has the flu?вЂќ And guess what? It turns out that the probability is given by an S curve applied to the weighted sum of features, much as in a multilayer perceptron. And an MLN with long chains of rules can represent a deep neural network, with one layer per link in the chain..