Inevitably, however, there is a serpent in this Eden. ItвЂ™s called the complexity monster. Like the Hydra, the complexity monster has many heads. One of them is space complexity: the number of bits of information an algorithm needs to store in the computerвЂ™s memory. If the algorithm needs more memory than the computer can provide, itвЂ™s useless and must be discarded. Then thereвЂ™s the evil sister, time complexity: how long the algorithm takes to run, that is, how many steps of using and reusing the transistors it has to go through before it produces the desired results. If itвЂ™s longer than we can wait, the algorithm is again useless. But the scariest face of the complexity monster is human complexity. When algorithms become too intricate for our poor human brains to understand, when the interactions between different parts of the algorithm are too many and too involved, errors creep in, we canвЂ™t find them and fix them, and the algorithm doesnвЂ™t do what we want. Even if we somehow make it work, it winds up being needlessly complicated for the people using it and doesnвЂ™t play well with other algorithms, storing up trouble for later.. Examining the cortex under a microscope leads to the same conclusion. The same wiring pattern is repeated everywhere. The cortex is organized into columns with six distinct layers, feedback loops running to another brain structure called the thalamus, and a recurring pattern of short-range inhibitory connections and longer-range excitatory ones. A certain amount of variation is present, but it looks more like different parameters or settings of the same algorithm than different algorithms. Low-level sensory areas have more noticeable differences, but as the rewiring experiments show, these are not crucial. The cerebellum, the evolutionarily older part of the brain responsible for low-level motor control, has a clearly different and very regular architecture, built out of much smaller neurons, so it would seem that at least motor learning uses a different algorithm. If someoneвЂ™s cerebellum is injured, however, the cortex takes over its function. Thus it seems that evolution kept the cerebellum around not because it does something the cortex canвЂ™t, but just because itвЂ™s more efficient.. If the world were just a blooming, buzzing confusion, there would be reason to doubt the existence of a universal learner. But if everything we experience is the product of a few simple laws, then it makes sense that a single algorithm can induce all that can be induced. All the Master Algorithm has to do is provide a shortcut to the lawsвЂ™ consequences, replacing impossibly long mathematical derivations with much shorter ones based on actual observations.. Another line of evidence comes from optimization, the branch of mathematics concerned with finding the input to a function that produces its highest output. For example, finding the sequence of stock purchases and sales that maximizes your total returns is an optimization problem. In optimization, simple functions often give rise to surprisingly complex solutions. Optimization plays a prominent role in almost every field of science, technology, and business, including machine learning. Each field optimizes within the constraints defined by optimizations in other fields. We try to maximize our happiness within economic constraints, which are firmsвЂ™ best solutions within the constraints of the available technology-which in turn consists of the best solutions we could find within the constraints of biology and physics. Biology, in turn, is the result of optimization by evolution within the constraints of physics and chemistry, and the laws of physics themselves are solutions to optimization problems. Perhaps, then, everything that exists is the progressive solution of an overarching optimization problem, and the Master Algorithm follows from the statement of that problem.. The first is that, in reality, we never have enough data to completely determine the world. Even ignoring the uncertainty principle, precisely knowing the positions and velocities of all particles in the world at some point in time is not remotely feasible. And because the laws of physics are chaotic, uncertainty compounds over time, and pretty soon they determine very little indeed. To accurately describe the world, we need a fresh batch of data at regular intervals. In effect, the laws of physics only tell us what happens locally. This drastically reduces their power.. For symbolists, all intelligence can be reduced to manipulating symbols, in the same way that a mathematician solves equations by replacing expressions by other expressions. Symbolists understand that you canвЂ™t learn from scratch: you need some initial knowledge to go with the data. TheyвЂ™ve figured out how to incorporate preexisting knowledge into learning, and how to combine different pieces of knowledge on the fly in order to solve new problems. Their master algorithm is inverse deduction, which figures out what knowledge is missing in order to make a deduction go through, and then makes it as general as possible.. ThatвЂ™s only the beginning, however. Most cancers involve a combination of mutations, or can only be cured by drugs that havenвЂ™t been discovered yet. The next step is to learn rules with more complex conditions, involving the cancerвЂ™s genome, the patientвЂ™s genome and medical history, known side effects of drugs, and so on. But ultimately what we need is a model of how the entire cell works, enabling us to simulate on the computer the effect of a specific patientвЂ™s mutations, as well as the effect of different combinations of drugs, existing or speculative. Our main sources of informationfor building such models are DNA sequencers, gene expression microarrays, and the biological literature. Combining these is where inverse deduction can shine.. Adam, the robot scientist we met in Chapter 1, gives a preview. AdamвЂ™s goal is to figure out how yeast cells work. It starts with basic knowledge of yeast genetics and metabolism and a trove of gene expression data from yeast cells. It then uses inverse deduction to hypothesize which genes are expressed as which proteins, designs microarray experiments to test them, revises its hypotheses, and repeats. Whether each gene is expressed depends on other genes and conditions in the environment, and the resulting web of interactions can be represented as a set of rules, such as:. Most of all, the goal of machine learning is to find the best possible learning algorithm, by any means available, and evolution and the brain are unlikely to provide it. The products of evolution have many obvious faults. For example, the mammalian optic nerve attaches to the front of the retina instead of the back, causing an unnecessary-and egregious-blind spot right next to the fovea, the area of sharpest vision.. One solution is to combineThe Timesreports it andThe Journalreports it into a single megavariable with four values:YesYes if they both do,YesNo if theTimes reports a landing and theJournal doesnвЂ™t, and so on. This turns the graph into a chain of three variables, and all is well. However, every time you add a news source, the number of values of the megavariable doubles. If instead of two news sources you have fifty, the megavariable has 250 values. So this method can only get you so far, and no other known method does any better.. Markov weighs the evidence. Nearest-neighbor is the simplest and fastest learning algorithm ever invented. In fact, you could even say itвЂ™s the fastest algorithm of any kind that could ever be invented. It consists of doing exactly nothing, and therefore takes zero time to run. CanвЂ™t beat that. If you want to learn to recognize faces and have a vast database of images labeled face/not face, just let it sit there. DonвЂ™t worry, be happy. Without knowing it, those images already implicitly form a model of what a face is. Suppose youвЂ™re Facebook and you want to automatically identify faces in photos people upload as a prelude to tagging them with their friendsвЂ™ names. ItвЂ™s nice to not have to do anything, given that Facebook users upload upward of three hundred million photos per day. Applying any of the learners weвЂ™ve seen so far to them, with the possible exception of NaГЇve Bayes, would take a truckload of computers. And NaГЇve Bayes is not smart enough to recognize faces.. With nearest-neighbor, each data point is its own little classifier, predicting the class for all the query examples it wins. Nearest-neighbor is like an army of ants, in which each soldier by itself does little, but together they can move mountains. If an antвЂ™s load is too heavy, it can share it with its neighbors. In the same spirit, in thek-nearest-neighbor algorithm, a test example is classified by finding itsk nearest neighbors and letting them vote. If the nearest image to the new upload is a face but the next two nearest ones arenвЂ™t, three-nearest-neighbor decides that the new upload is not a face after all. Nearest-neighbor is prone to overfitting: if we have the wrong class for a data point, it spreads to its entire metro area.K-nearest-neighbor is more robust because it only goes wrong if a majority of thek nearest neighbors is noisy. The price, of course, is that its vision is blurrier: fine details of the frontier get washed away by the voting. Whenk goes up, variance decreases, but bias increases.. Another disturbing example is what happens with our good old friend, the normal distribution, aka a bell curve. What a normal distribution says is that data is essentially located at a point (the mean of the distribution), but with some fuzz around it (given by the standard deviation). Right? Not in hyperspace. With a high-dimensional normal distribution, youвЂ™re more likely to get a sample far from the mean than close to it. A bell curve in hyperspace looks more like a doughnut than a bell. And when nearest-neighbor walks into this topsy-turvy world, it gets hopelessly confused. All examples look equally alike, and at the same time theyвЂ™re too far from each other to make useful predictions. If you sprinkle examples uniformly at random inside a high-dimensional hypercube, most are closer to a face of the cube than to their nearest neighbor. In medieval maps, uncharted areas were marked with dragons, sea serpents, and other fantastical creatures, or just with the phrasehere be dragons. In hyperspace, the dragons are everywhere, including at your front door. Try to walk to your next-door neighborвЂ™s house, and youвЂ™ll never get there; youвЂ™ll be forever lost in strange lands, wondering where all the familiar things went.. Niles Eldredge and Stephen Jay Gould propose their theory of punctuated equilibria inвЂњPunctuated equilibria: An alternative to phyletic gradualism,вЂќ inModels in Paleobiology, edited by T. J. M. Schopf (Freeman, 1972). Richard Dawkins critiques it in Chapter 9 ofThe Blind Watchmaker (Norton, 1986). The exploration-exploitation dilemma is discussed in Chapter 2 ofReinforcement Learning,* by Richard Sutton and Andrew Barto (MIT Press, 1998). John Holland proposes his solution, and much else, inAdaptation in Natural and Artificial Systems* (University of Michigan Press, 1975)..