Prologue. A different theory of everything. Rationalists believe that the senses deceive and that logical reasoning is the only sure path to knowledge. Empiricists believe that all reasoning is fallible and that knowledge must come from observation and experimentation. The French are rationalists; the Anglo-Saxons (as the French call them) are empiricists. Pundits, lawyers, and mathematicians are rationalists; journalists, doctors, and scientists are empiricists.Murder, She Wrote is a rationalist TV crime show;CSI: Crime Scene Investigation is an empiricist one. In computer science, theorists and knowledge engineers are rationalists; hackers and machine learners are empiricists.. Rationalism versus empiricism is a favorite question of philosophers. Plato was an early rationalist, and Aristotle an early empiricist. But the debate really took off during the Enlightenment, with a trio of great thinkers on each side: Descartes, Spinoza, and Leibniz were the leading rationalists; Locke, Berkeley, and Hume were their empiricist counterparts. Trusting in their powers of reasoning, the rationalists concocted theories of the universe that-to put it gently-did not stand the test of time, but they also invented fundamental mathematical techniques like calculus and analytical geometry. The empiricists were altogether more practical, and their influence is everywhere from the scientific method to the Constitution of the United States.. [РљР°СЂС‚РёРЅРєР°: pic_4.jpg]. The next step is to turn it into an algorithm.. More immediately, we know we can use inverse deduction to infer the structure of the cellвЂ™s networks from data and previous knowledge, but thereвЂ™s a combinatorial explosion of ways to apply it, and we need a strategy. Since metabolic networks were designed by evolution, perhaps simulating it in our learning algorithms is the way to go. In the next chapter, weвЂ™ll see how to do just that.. When backprop first hit the streets, connectionists had visions of quickly learning larger and larger networks until, hardware permitting, they amounted to artificial brains. It didnвЂ™t turn out that way. Learning networks with one hidden layer was fine, but after that things soon got very difficult. Networks with a few layers worked only if they were carefully designed for the application (character recognition, say). Beyond that, backprop broke down. As we add layers, the error signal becomes more and more diffuse, like a river branching into smaller and smaller tributaries, until weвЂ™re down to individual raindrops that just donвЂ™t register. Learning with dozens or hundreds of hidden layers, like the brain, remained a distant dream, and by the mid-1990s, the excitement for multilayer perceptrons had petered out. A hard core of connectionists soldiered on, but by and large the attention of the machine-learning field moved elsewhere. (WeвЂ™ll survey those lands in Chapters 6 and 7.). Today, however, connectionism is resurgent. WeвЂ™re learning deeper networks than ever before, and theyвЂ™re setting new standards in vision, speech recognition, drug discovery, and other areas. The new field of deep learning is on the front page of theNew York Times. Look under the hood, andвЂ¦ surprise: itвЂ™s the trusty old backprop engine, still humming. What changed? Nothing much, say the critics: just faster computers and bigger data. To which Hinton and others reply: exactly, we were right all along!. The hypothesis can be as complex as a whole Bayesian network, or as simple as the probability that a coin will come up heads. In the latter case, the data is just the outcome of a series of coin flips. If, say, we obtain seventy heads in a hundred flips, a frequentist would estimate the probability of heads as 0.7. This is justified by the so-called maximum likelihood principle: of all the possible probabilities of heads, 0.7 is the one under which seeing seventy heads in a hundred flips is most likely. The likelihood of a hypothesis isP(data | hypothesis), and the principle says we should pick the hypothesis that maximizes it. Bayesians do something more subtle, though. They point out that we never know for sure which hypothesis is the true one, and so we shouldnвЂ™t just pick one hypothesis, like a value of 0.7 for the probability of heads; rather, we should compute the posterior probability of every possible hypothesis and entertain all of them when making predictions. The sum of the probabilities of all the hypotheses must be one, so if one becomes morelikely, the others become less. For a Bayesian, in fact, there is no such thing as the truth; you have a prior distribution over hypotheses, after seeing the data it becomes the posterior distribution, as given by BayesвЂ™ theorem, and thatвЂ™s all.. Analogical reasoning has a distinguished intellectual pedigree. Aristotle expressed it in his law of similarity: if two things are similar, the thought of one will tend to trigger the thought of the other. Empiricists like Locke and Hume followed suit. Truth, said Nietzche, is a mobile army of metaphors. Kant was also a fan. William James believed thatвЂњthis sense of sameness is the very keel and backbone of our thinking.вЂќ Some contemporary psychologists even argue that human cognition in its entirety is a fabric of analogies. We rely on it to find our way around a new town and to understand expressions like вЂњsee the lightвЂќ and вЂњstand tall.вЂќ Teenagers who insert вЂњlikeвЂќ
into every sentence they say would probably, like, agree that analogy is important, dude.. This type of metalearning is called stacking and is the brainchild of David Wolpert, whom we met in Chapter 3 as the author of theвЂњno free lunchвЂќ theorem. An even simpler metalearner is bagging, invented by the statistician Leo Breiman. Bagging generates random variations of the training set by resampling, applies the same learner to each one, and combines the results by voting. The reason to do this is that it reduces variance: the combined model is much less sensitive to the vagaries of the data than any single one, making this a remarkably easy way to improve accuracy. If the models are decision trees and we further vary them by withholding a random subset of the attributes from consideration at each node, the result is a so-called random forest. Random forests are some of the most accurate classifiers around. MicrosoftвЂ™s Kinect uses them to figure out what youвЂ™re doing, and they regularly win machine-learning competitions.. How much of your brain does your job use? The more it does, the safer you are. In the early days of AI, the common view was that computers would replace blue-collar workers before white-collar ones, because white-collar work requires more brains. But thatвЂ™s not quite how things turned out. Robots assemble cars, but they havenвЂ™t replaced construction workers. On the other hand, machine-learning algorithms have replaced credit analysts and direct marketers. As it turns out, evaluating credit applications is easier for machines than walking arounda construction site without tripping, even though for humans itвЂ™s the other way around. The common theme is that narrowly defined tasks are easily learned from data, but tasks that require a broad combination of skills and knowledge arenвЂ™t. Most of your brain is devoted to vision and motion, which is a sign that walking around is much more complex than it seems; we just take it for granted because, having been honed to perfection by evolution, itвЂ™s mostly done subconsciously. The company Narrative Science has an AI system that can write pretty good summaries of baseball games, but not novels, because-pace George F. Will-thereвЂ™s a lot more to life than to baseball games. Speech recognition is hard for computers because itвЂ™s hard to fill in the blanks-literally, the sounds speakers routinely elide-when you have no idea what the person is talking about. Algorithms can predict stock fluctuations but have no clue how they relate to politics. The more context a job requires, the less likely a computer will be able to do it soon. Common sense is important not just because your mom taught you so, but because computers donвЂ™t have it.. ItвЂ™s not hard to state general principles like military necessity, proportionality, and sparing civilians. But thereвЂ™s a gulf between them and concrete actions, which the soldierвЂ™s judgment has to bridge. AsimovвЂ™s three laws of robotics quickly run into trouble when robots try to apply them in practice, as his stories memorably illustrate. General principles are usually contradictory, if not self-contradictory, and they have to be lest they turn all shades of gray into black and white. When does military necessity outweigh sparing civilians? There is no universal answer and no way to program a computer with all the eventualities. Machine learning, however, provides an alternative. First, teach the robot to recognize the relevant concepts, for example with data sets of situations where civilians were and were not spared, armed response was and was not proportional, and so on. Then give it a code of conduct in the form of rules involving these concepts. Finally, let the robot learn how to apply the code by observing humans: the soldier opened fire in this case but not in that case. By generalizing from these examples, the robot can learn an end-to-end model of ethical decisionmaking, in the form of, say, a large MLN. Once the robotвЂ™s decisions agree with a humanвЂ™s as often as one human agrees with another, the training is complete, meaning the model is ready for download into thousands of robot brains. Unlike humans, robots donвЂ™t lose their heads in the heat of combat. If a robot malfunctions, the manufacturer is responsible. If it makes a wrong call, its teachers are.. Thek-means algorithm was originally proposed by Stuart Lloyd at Bell Labs in 1957, in a technical report entitledвЂњLeast squares quantization in PCMвЂќ* (which later appeared as a paper in theIEEE Transactions on Information Theory in 1982). The original paper on the EM algorithm isвЂњMaximum likelihood from incomplete data via the EM algorithm,вЂќ* by Arthur Dempster, Nan Laird, and Donald Rubin (Journal of the Royal Statistical Society B, 1977). Hierarchical clustering and other methods are described inFinding Groups in Data: An Introduction to Cluster Analysis,* by Leonard Kaufman and Peter Rousseeuw (Wiley, 1990)..