Some learners learn knowledge, and some learn skills.вЂњAll humans are mortalвЂќ is a piece of knowledge. Riding a bicycle is a skill. In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical: all humans are mortal, but only 4 percent are Americans. Skills are often in the form of procedures: ifthe road curves left, turn the wheel left; if a deer jumps in front of you, slam on the brakes. (Unfortunately, as of this writing GoogleвЂ™s self-driving cars still confuse windblown plastic bags with deer.) Often, the procedures are quite simple, and itвЂ™s the knowledge at their core thatвЂ™s complex. If you can tell which e-mails are spam, you know which ones to delete. If you can tell how good a board position in chess is, you know which move to make (the one that leads to the best position).. With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. (Or fortunately, since otherwise life would be very boring-in fact, there would be no life.) Machine learning opens up a vast new world of nonlinear models. ItвЂ™s like turning on the lights in a room where only a sliver of moonlight filtered before.. TheвЂњno free lunchвЂќ theorem. Inverting an operation is often difficult because the inverse is not unique. For example, a positive number has two square roots, one positive and one negative (22 = (-2)2 = 4). Most famously, integrating the derivative of a function only recovers the function up to a constant. The derivative of a function tells us how much that function goes up or down at each point. Adding up all those changes gives us the function back, except we donвЂ™t know where it started; we can вЂњslideвЂќ the integrated function up or down without changing the derivative. To make life easy, we can вЂњclamp downвЂќ the function by assuming the additive constant is zero. Inverse deduction has a similar problem, and NewtonвЂ™s principle is one solution. For example, fromAll Greek philosophers are human andAll Greek philosophers are mortal we can induce thatAll humans are mortal, or just thatAll Greeks are mortal. But why settle for the more modest generalization? Instead, we can assume that all humans are mortal until we meet an exception. (Which, according to Ray Kurzweil, will be soon.). Autoencoders were known in the 1980s, but they were very hard to learn, even though they had a single hidden layer. Figuring out how to pack a lot of information into the same few bits is a hellishly difficult problem (one code for your grandmother, a slightly different one for your grandfather, another one for Jennifer Aniston, etc). The landscape in hyperspace is just too rugged to get to a good peak; the hidden units need to learn what amounts to too many exclusive-ORs of the inputs. So autoencoders didnвЂ™t really catch on. The trick that took over a decade to discover was to make the hidden layer larger than the input and output ones. Huh? Actually, thatвЂ™s only half the trick: the other half is to force all but a few of the hidden units to be off at any given time. This still prevents the hidden layer from just copying the input, and-crucially-it makes learning much easier. If we allow different bits to represent different inputs, the inputs no longer have to compete to set the same bits. Also, the network now has many more parameters, so the hyperspace youвЂ™re in has many more dimensions, and you have many more ways to get out of what would otherwise be local maxima. This is called a sparse autoencoder, and itвЂ™s a neat trick.. Stacked autoencoders are not the only kind of deep learner. Another is based on Boltzmann machines, and another-convolutional neural networks-on a model of the visual cortex. Despite their remarkable successes, however, all of these are still a far cry from the brain. The Google network can recognize cat faces seen head on; humans can recognize cats in any pose and even when the face is hard to make out. The Google network is still pretty shallow; only three of its nine layers are autoencoders. A multilayer perceptron is a passable model of the cerebellum, the part of the brain responsible for low-level motor control, but the cortex is another story. ItвЂ™s missing the backward connections needed to propagate errors, for one, and yet itвЂ™s where the real learning wizardry resides. In his bookOn Intelligence, Jeff Hawkins advocated designing algorithms closely based on the organization of the cortex, but so far none of these algorithms can compete with todayвЂ™s deep networks.. Like many other early machine-learning researchers, Holland started out working on neural networks, but his interests took a different turn when, while a graduate student at the University of Michigan, he read Ronald FisherвЂ™s classic treatiseThe Genetical Theory of Natural Selection. In it, Fisher, who was also the founder of modern statistics, formulated the first mathematical theory of evolution. Brilliant as it was, Holland felt that FisherвЂ™s theory left out the essence of evolution. Fisher considered each gene in isolation, but an organismвЂ™s fitness is a complex function of all its genes. If genes are independent, the relative frequencies of their variants rapidly converge to the maximum fitness point and remain in equilibrium thereafter. But if genes interact, evolution-the search for maximum fitness-is vastly more complex. With one thousand genes, each with two variants, the genome has 21000 possible states, and no planet in the universe is remotely large or ancient enough to have tried them all out. Yet on Earth evolution has managed to come up with some remarkably fit organisms, and DarwinвЂ™s theory of natural selection explains how, at least qualitatively. Holland decided to turn it into an algorithm.. In this regard, genetic algorithms are a lot like selective breeding. Darwin openedThe Origin of Species with a discussion of it, as a stepping-stone to the more difficult concept of natural selection. All the domesticated plants and animals we take for granted today are the result of selecting and mating, generation after generation, the organisms that best served our purposes: the corn with the largest corncobs, the sweetest fruit trees, the shaggiest sheep, the hardiest horses. Genetic algorithms do the same, except they breed programs instead of living creatures, and a generation is a few seconds of computer time instead of a creatureвЂ™s lifetime.. On the other hand, you may be wondering why weвЂ™re not done at this point. Surely if weвЂ™ve combined natureвЂ™s two master algorithms, evolution and the brain, thatвЂ™s all we could ask for. Unfortunately, what we have so far is only a very crude cartoon of how nature learns, good enough for a lot of applications but still a pale shadow of the real thing. For example, the development of the embryo is a crucial part of life,
but thereвЂ™s no analog of it in machine learning: the вЂњorganismвЂќ is a very straightforward function of the genome, and we may be missing something important there. But another reason is that we wouldnвЂ™t be satisfied even if we had completely figured out how nature learns. For one thing, itвЂ™s too slow. Evolution takes billions of years to learn, and the brain takes a lifetime. Culture is better: I can distill a lifetime of learning into a book, and you can read it in a few hours. But learning algorithms should be able to learn in minutes or seconds. He who learns fastest wins, whether itвЂ™s the Baldwin effect speeding up evolution, verbal communication speeding up human learning, or computers discovering patterns at the speed of light. Machine learning is the latest chapter in the arms race of life on Earth, and swifter hardware is only half the equation. The other half is smarter software.. As you stare uncomprehendingly at it, your Google Glass helpfully flashes:вЂњBayesвЂ™ theorem.вЂќ Now the crowd starts to chant вЂњMore data! More data!вЂќ A stream of sacrificial victims is being inexorably pushed toward the altar. Suddenly, you realize that youвЂ™re in the middle of it-too late. As the crank looms over you, you scream, вЂњNo! I donвЂ™t want to be a data point! Let me gooooo!вЂќ. Humans, it turns out, are not very good at Bayesian inference, at least when verbal reasoning is involved. The problem is that we tend to neglect the causeвЂ™s prior probability. If you test positive for HIV, and the test only gives 1 percent false positives, should you panic? At first sight, it seems like your chances of having AIDS are now 99 percent. Yikes! But letвЂ™s keep a cool head and apply BayesвЂ™ theorem step-by-step:P(HIV | positive) = P(HIV)Г— P(positive | HIV) / P(positive).P(HIV) is the prevalence of HIV in the general population, which is about 0.3 percent in the United States.P(positive) is the probability that the test comes out positive whether or not you have AIDS; letвЂ™s say thatвЂ™s 1 percent. SoP(HIV | positive) = 0.003Г— 0.99 / 0.01 = 0.297. ThatвЂ™s very different from 0.99! The reason is that HIV is rare in the general population. The test coming out positive increases your chances of having AIDS by two orders of magnitude, but theyвЂ™re still less than half. If you test positive for HIV, the right thing to dois to stay calm and take another, more definitive test. Chances are youвЂ™ll be fine.. But something funny happened on the way to world domination. Researchers using Bayesian models kept noticing that you got better results by tweaking the probabilities in illegal ways. For example, raisingP(words) to some power in speech recognizers improved accuracy, but then it wasnвЂ™t BayesвЂ™ theorem any more. What was going on? The culprit, it turns out, was the false independence assumptions that generative models make. The simplified graph structure makes the models learnable and is worth keeping, but then weвЂ™re better off just learning the best parameters we can for the task at hand, irrespective of whether theyвЂ™re probabilities. The real strength of, say, NaГЇve Bayes is that it provides a small, informative set of features from which to predict the class and a fast, robust way to learn the corresponding parameters. In a spam filter, each feature is the occurrence of a particular word in spam, and the corresponding parameter is how often it occurs; and similarly for nonspam. Viewed in this way, NaГЇve Bayes can be optimal, in the sense of making the best predictions possible, even in many cases where its independence assumptions are wildly violated. When I realized this and published a paper about it in 1996, peopleвЂ™s suspicion of NaГЇve Bayes melted away, helping it to take off. But it was also a step on the way to a different kind of model, which in the last two decades has increasingly replaced Bayesian networks in machine learning: Markov networks.. [РљР°СЂС‚РёРЅРєР°: pic_25.jpg]. Machine learning is both a science and a technology, and both characteristics give us hints on how to unify it. On the science side, unifying theories often begin with a deceptively simple observation. Two seemingly unrelated phenomena turn out to be just two faces of the same coin, and like the first domino to fall, that realization sets off a cascade of others. An apple falling to the ground, the moon hanging in the sky: both are caused by gravity, and-apocryphal story or not-once Newton figured out how, gravity turned out to also account for the tides, the precession of the equinoxes, the trajectories of comets, and much else. In everyday experience, electricity and magnetism are never seen together: a lightning spark here, a rock that attracts iron objects there, both quite rare. But once Maxwell figured out how a changing electric field gives rise to magnetism and vice versa, it became clear that light itself is an intimate marriage of the two, and today we know that, far from rare, electromagnetism pervades all matter. MendeleevвЂ™s periodic table not only organized all the known elements into just two dimensions, it also predicted where new elements would be found. DarwinвЂ™s observations aboard theBeagle suddenly began to make sense when MalthusвЂ™sEssay on Population suggested natural selection as the organizing principle. When Crick and Watson hit on the double helix structure as an explanation for the puzzling properties of DNA, they immediately saw how it might replicate itself, and biologyвЂ™s transition from stamp collecting (in RutherfordвЂ™s pejorative words) to unified science had begun. In each of these cases, a bewildering variety of observations turned out to have a common cause, and once scientists identified it, they could in turn use it to predict many new phenomena. Similarly, even though the learners weвЂ™ve met in this book seem quite disparate-some based on the brain, some on evolution, some on abstract mathematical principles-they in fact have much in common, and the resulting theory of learning yields many new insights.. вЂњSupport vector machines and kernel methods: The new generation of learning machines,вЂќ* by Nello Cristianini and Bernhard SchГ¶lkopf (AI Magazine, 2002), is a mostly nonmathematical introduction to SVMs. The paper that started the SVM revolution wasвЂњA training algorithm for optimal margin classifiers,вЂќ* by Bernhard Boser, Isabel Guyon, and Vladimir Vapnik (Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992). The first paper applying SVMs to text classification wasвЂњText categorization with support vector machines,вЂќ* by Thorsten Joachims (Proceedings of the Tenth European Conference on Machine Learning, 1998). Chapter 5 ofAn Introduction to Support Vector Machines,* by Nello Cristianini and John Shawe-Taylor (Cambridge University Press, 2000), is a brief introduction to constrained optimization in the context of SVMs..