To see the future of science, take a peek inside a lab at the Manchester Institute of Biotechnology, where a robot by the name of Adam is hard at work figuring out which genes encode which enzymes in yeast. Adam has a model of yeast metabolism and general knowledge of genes and proteins. It makes hypotheses, designs experiments to test them, physically carries them out, analyzes the results, and comes up with new hypotheses until itвЂ™s satisfied. Today, human scientists still independently check AdamвЂ™s conclusions before they believe them, but tomorrow theyвЂ™ll leave it to robot scientists to check each otherвЂ™s hypotheses.. In politics, as in business and war, there is nothing worse than seeing your opponent make moves that you donвЂ™t understand and donвЂ™t know what to do about until itвЂ™s too late. ThatвЂ™s what happened to the Romney campaign. They could see the other side buying ads in particular cable stations in particular towns but couldnвЂ™t tell why; their crystal ball was too fuzzy. In the end, Obama won every battleground state save North Carolina and by larger margins than even the most accurate pollsters had predicted. The most accurate pollsters, in turn, were the ones (like Nate Silver) who used the most sophisticated prediction techniques; they were less accurate than the Obama campaign because they had fewer resources. But they were a lot more accurate than the traditional pundits, whose predictions were based on their expertise.. How much of the character of physical law percolates up to higher domains like biology and sociology remains to be seen, but the study of chaos provides many tantalizing examples of very different systems with similar behavior, and the theory of universality explains them. The Mandelbrot set is a beautiful example of how a very simple iterative procedure can give rise to an inexhaustible variety of forms. If the mountains, rivers, clouds, and trees of the world are all the result of such procedures-and fractal geometry shows they are-perhaps those procedures are just different parametrizations of a single one that we can induce from them.. If gene C is expressed, gene D is not.. The higher an inputвЂ™s weight, the stronger the corresponding synapse. The cell body adds up all the weighted inputs, and the axon applies a step function to the result. The axonвЂ™s box in the diagram shows the graph of a step function: 0 for low values of the input, abruptly changing to 1 when the input reaches the threshold.. The Bayesian method is not just applicable to learning Bayesian networks and their special cases. (Conversely, despite their name, Bayesian networks arenвЂ™t necessarily Bayesian: frequentists can learn them, too, as we just saw.) We can put a prior distribution on any class of hypotheses-sets of rules, neural networks, programs-and then update it with the hypothesesвЂ™ likelihood given the data. BayesiansвЂ™ view is that itвЂ™s up to you what representation you choose, but then you have to learn it using BayesвЂ™ theorem. In the 1990s, they mounted a spectacular takeover of the Conference on Neural Information Processing Systems (NIPS for short), the main venue for connectionist research. The ringleaders (so to speak) were David MacKay, Radford Neal, and Michael Jordan. MacKay, a Brit who was a student of John HopfieldвЂ™s at Caltech and later became chief scientific advisor to the UKвЂ™s Department of Energy, showed how to learn multilayer perceptrons the Bayesian way. Neal introduced the connectionists to MCMC, and Jordan introduced them to variational inference. Finally, they pointed out that in the limit you could вЂњintegrate outвЂќ the neurons in a multilayer perceptron, leaving a type of Bayesian model that made no reference to them. Before long, the wordneural in the title of a paper submitted to NIPS became a good predictor of rejection. Some researchers joked that the conference should change its name to BIPS, for Bayesian Information Processing Systems.. In my PhD thesis, I designed an algorithm that unifies instance-based and rule-based learning in this way. A rule doesnвЂ™t just match entities that satisfy all its preconditions; it matches any entity thatвЂ™s more similar to it than to any other rule, in the sense that it comes closer to satisfying its conditions. For instance, someone with a cholesterol level of 220 mg/dL comes closer than someone with 200 mg/dLto matching the ruleIf your cholesterol is above 240 mg/dL, youвЂ™re at risk of a heart attack. RISE, as I called the algorithm, learns by starting with each training example as a rule and then gradually generalizing each rule to absorb the nearest examples. The end result is usually a combination of very general rules, which between them match most examples, with more specific rules that match exceptions to those, and
so on all the way to aвЂњlong tailвЂќ of specific memories. RISE made better predictions than the best rule-based and instance-based learners of the time, and my experiments showed that this was precisely because it combined the best features of both. Rules can be matched analogically, and so theyвЂ™re no longer brittle. Instances can select different features in different regions of space and so combat the curse of dimensionality much better than nearest-neighbor, which can only select the same features everywhere.. Suppose we decide that letting Robby roam around in the real world is too slow and cumbersome a way to learn. Instead, like a would-be pilot learning in a flight simulator, weвЂ™ll have him look at computer-generated images. We know what clusters the images come from, but weвЂ™re not telling Robby. Instead, we create each image by first choosing a cluster at random (toys, say) and then synthesizing an example of that cluster (small, fluffy, brown teddy bear with big black eyes, round ears, and a bow tie). We also choose the properties of the example at random: the size comes from a normal distribution with a mean of ten inches, the fur is brown with 80 percent probability and white otherwise, and so on. After Robby has seen lots of images generated in this way, heshould have learned to cluster them into people, furniture, toys, and so on, because people are more like people than furniture and so on. But the interesting question is: If we look at it from RobbyвЂ™s point of view, whatвЂ™s the best algorithm to discover the clusters? The answer is surprising: NaГЇve Bayes, which we first met as an algorithm for supervised learning. The difference is that now Robby doesnвЂ™t know the classes, so heвЂ™ll have to guess them!. You gaze intently at the map, trying to decipher its secret. The fifteen pieces all match quite precisely, but you need to figure out how they combine to form just three: the representation, evaluation, and optimization components of the Master Algorithm. Every learner has these three elements, but they vary from tribe to tribe.. Despite its successes, Alchemy has some significant shortcomings. It does not yet scale to truly big data, and someone without a PhD in machine learning will find it hard to use. Because of these problems, itвЂ™s not yet ready for prime time. But letвЂ™s see what we can do about them.. Sex, lies, and machine learning. One Algorithm to bring them all and in the darkness bind them,. This does not mean that there is nothing to worry about, however. The first big worry, as with any technology, is that AI could fall into the wrong hands. If a criminal or prankster programs an AI to take over the world, weвЂ™d better have an AI police capable of catching it and erasing it before it gets too far. The best insurance policy against vast AIs gone amok is vaster AIs keeping the peace.. The trajectory weвЂ™re on is not a singularity but a phase transition. Its critical point-the Turing point-will come when machine learning overtakes the natural variety. Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is a product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: thespeed at which humans can absorb change. This too will eventually be removed, but not because weвЂ™ll decide to hand things off to our вЂњmind children,вЂќ as Hans Moravec calls them, and go gently into the good night. Humans are not a dying twig on the tree of life. On the contrary, weвЂ™re about to start branching.. Model Ensembles: Foundations and Algorithms,* by Zhi-Hua Zhou (Chapman and Hall, 2012), is an introduction to metalearning. The original paper on stacking isвЂњStacked generalization,вЂќ* by David Wolpert (Neural Networks, 1992). Leo Breiman introduced bagging inвЂњBagging predictorsвЂќ* (Machine Learning, 1996) and random forests inвЂњRandom forestsвЂќ* (Machine Learning, 2001). Boosting is described inвЂњExperiments with a new boosting algorithm,вЂќ by Yoav Freund and Rob Schapire (Proceedings of the Thirteenth International Conference on Machine Learning, 1996)..