The argument from neuroscience. Moving on to more sophisticated alternatives, what about the queries that any good database engine can answer, or the simple algorithms in a statistical package? ArenвЂ™t those enough? These are bigger Lego bricks, but theyвЂ™re still only bricks. A database engine never discovers anything new; it just tells you what it knows. Even if all the humans in a database are mortal, it doesnвЂ™t occur to it to generalize mortality to other humans. (Database engineers would blanch at the thought.) Much of statistics is about testing hypotheses, but someone has to formulate them in the first place. Statistical packages can do linear regression and other simple procedures, but these have a very low limit on what they can learn, no matter how much data you feed them.The better packages cross into the gray zone between statistics and machine learning, but there are still many kinds of knowledge they canвЂ™t discover.. Bayesians are concerned above all with uncertainty. All learned knowledge is uncertain, and learning itself is a form of uncertain inference. The problem then becomes how to deal with noisy, incomplete, and even contradictory information without falling apart. The solution is probabilistic inference, and the master algorithm is BayesвЂ™ theorem and its derivates. BayesвЂ™ theorem tells us how to incorporate new evidence into our beliefs, and probabilistic inference algorithms do that as efficiently as possible.. Physicist makes brain out of glass. This curve, which looks like an elongated S, is variously known as the logistic, sigmoid, or S curve. Peruse it closely, because itвЂ™s the most important curve in the world. At first the output increases slowly with the input, so slowly it seems constant. Then it starts to change faster, then very fast, then slower and slower until it becomes almost constant again. The transfer curve of a transistor, which relates its input and output voltages, is also an S curve. So both computers and the brain are filled with S curves. But it doesnвЂ™t end there. The S curve is the shape of phase transitions of all kinds: the probability of an electron flipping its spin as a function of the applied field, the magnetization of iron, the writing of a bit of memory to a hard disk, an ion channel opening in a cell, ice melting, water evaporating, the inflationary expansion of the early universe, punctuated equilibria in evolution, paradigm shifts in science, the spread of new technologies, white flight from multiethnic neighborhoods, rumors, epidemics, revolutions, the fall of empires, and much more.The Tipping Point could equally well (if less appealingly) be entitledThe S Curve. An earthquake is a phase transition in the relative position of two adjacent tectonic plates. A bump in the night is just the sound of the microscopic tectonic plates in your houseвЂ™s walls shifting, so donвЂ™t be scared. Joseph Schumpeter said that the economy evolves by cracks and leaps: S curves are the shape of creative destruction. The effect of financial gains and losses on your happiness follows an S curve, so donвЂ™t sweat the big stuff. The probability that a random logical formula is satisfiable-the quintessential NP-complete problem-undergoes a phase transition from almost 1 to almost 0 as the formulaвЂ™s length increases. Statistical physicists spend their lives studying phase transitions.. Beware of attaching too much meaning to the weights backprop finds, however. Remember that there are probably many very different ones that are just as good. Learning in multilayer perceptrons is a chaotic process in the sense that starting in slightly different places can cause you to wind up at very different solutions. The phenomenon is the same whether the slight difference is in the initial weights or the training data and manifests itself in all powerful learners, not just backprop.. BackpropвЂ™s applications are now too many to count. As its fame has grown, more of its history has come to light. It turns out that, as is often the case in science, backprop was invented more than once. Yann LeCun in France and others hit on it at around the same time as Rumelhart. A paper on backprop was rejected by the leading AI conference in the early 1980s because, according to the reviewers, Minsky and Papert had already proved that perceptrons donвЂ™t work. In fact, Rumelhart is credited with inventing backprop by the Columbus test: Columbus was not the first person to discover America, but the last. It turns out that Paul Werbos, a graduate student at Harvard, had proposed a similar algorithm in his PhD thesis in 1974. And in a supreme irony, Arthur Bryson and Yu-Chi Ho, two control theorists, had done the same even earlier: in 1969, the same year that Minsky and Papert publishedPerceptrons! Indeed, the history of machine learning itself shows why we need learning algorithms. If algorithms that automatically find related papers in the scientific literature had existed in 1969, they could have potentially helped avoid decades of wasted time and accelerated who knows what discoveries.. A genetic algorithm works by mimicking this process. In each generation, it mates the fittest individuals, producing two offspring from each pair of parents by crossing over their bit strings at a random point. After applying point mutations to the new strings, it lets them loose in its virtual world. Each one returns with a fitness score, and the process repeats. Each generation is fitter than the previous one, and the process terminates when the desired fitness is reached or time runs out.. Survival of the fittest programs. Despite their successes, and the insights theyвЂ™ve provided on issues like gradualism versus punctuated equilibria, genetic algorithms have left one great mystery unsolved: the role of sex in evolution. Evolutionaries set great store by crossover, but members of the other tribes think itвЂ™s not worth the trouble. None of HollandвЂ™s theoretical results show that crossover actually helps; mutation suffices to exponentially increase the frequency of the fittest schemas in the population over time. And the вЂњbuilding blocksвЂќ intuition is appealing but quickly runs into trouble, even when genetic programming is used. As larger blocks evolve, crossover also becomes increasingly likely to break them up. Also, once a highly fit individual appears, its descendants tend to quickly take over the population, crowding out potentially better schemas that were trapped in overall less fit individuals. This effectively reduces the search to variations of the fitness champ. Researchers have come up with a number of schemes for preserving diversity in the population, but the results so far are inconclusive. Engineers certainly use building blocks extensively, but combining them involves, well, a lot of engineering; itвЂ™s not just a matter of throwing them together any old way, and itвЂ™s not clear crossover can do the trick.. The path to optimal learning begins with a formula that many people have heard of: BayesвЂ™ theorem. But here weвЂ™ll see it in a whole new light and realize that itвЂ™s vastly more powerful than youвЂ™d guess from its everyday uses. At heart, BayesвЂ™ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patientвЂ™s symptoms in areasonable amount of time can take a lot of smarts. Once we know how to do all these things, weвЂ™ll be ready to learn the Bayesian way. For Bayesians, learning is вЂњjustвЂќ another application of BayesвЂ™ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner. Bayesians have invented fiendishly clever kinds of models. So letвЂ™s get started.. In reality, we usually let SVMs violate some constraints, meaning classify some examples incorrectly or by less than the margin, because otherwise they would overfit. If thereвЂ™s a noisy negative example somewhere in the middle of the positive region, we donвЂ™t want the frontier to wind around inside the positive region just to get that example right. But the SVM pays a penalty for each example it gets wrong, which encourages it to keep those to a minimum. SVMs are like the sandworms inDune: big, tough, and able to survive a few explosions from slithering over landmines but not too many.. A neural network stole my job. Three Algorithms for the Scientists under the sky,. Chapter Two.