In biology, learning algorithms figure out where genes are located in a DNA molecule, where superfluous bits of RNA get spliced out before proteins are synthesized, how proteins fold into their characteristic shapes, and how different conditions affect the expression of different genes. Rather than testing thousands of new drugs in the lab, learners predict whether they will work, and only the most promising get tested. They also weed out molecules likely to have nasty side effects, like cancer. This avoids expensive failures, like candidate drugs being nixed only after human trials have begun.. Starting with restrictive assumptions and gradually relaxing them if they fail to explain the data is typical of machine learning, and the process is usually carried out automatically by the learner, without any help from you. First, it tries all single factors, then all conjunctions of two factors, then all conjunctions of three, and so on. But now we run into a problem: there area lot of conjunctive concepts and not enough time to try them all out.. Most importantly for us, S curves lead to a new solution to the credit-assignment problem. If the universe is a symphony of phase transitions, letвЂ™s model it with one. ThatвЂ™s what the brain does: it tunes the system of phase transitions inside to the one outside. So letвЂ™s replace the perceptronвЂ™s step function with an S curve and see what happens.. The path to optimal learning begins with a formula that many people have heard of: BayesвЂ™ theorem. But here weвЂ™ll see it in a whole new light and realize that itвЂ™s vastly more powerful than youвЂ™d guess from its everyday uses. At heart, BayesвЂ™ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patientвЂ™s symptoms in areasonable amount of time can take a lot of smarts. Once we know how to do all these things, weвЂ™ll be ready to learn the Bayesian way. For Bayesians, learning is вЂњjustвЂќ another application of BayesвЂ™ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner. Bayesians have invented fiendishly clever kinds of models. So letвЂ™s get started.. This is ironic, since Laplace was also the father of probability theory, which he believed was just common sense reduced to calculation. At the heart of his explorations in probability was a preoccupation with HumeвЂ™s question. For example, how do we know the sun will rise tomorrow? It has done so every day until today, but thatвЂ™s no guarantee it will continue. LaplaceвЂ™s answer had two parts. The first is what we now call the principle of indifference, or principle of insufficient reason. We wake up oneday-at the beginning of time, letвЂ™s say, which for Laplace was five thousand years or so ago-and after a beautiful afternoon, we see the sun go down. Will it come back? WeвЂ™ve never seen the sun rise, and there is no particular reason to believe it will or wonвЂ™t. Therefore we should consider the two scenarios equally likely and say that the sun will rise again with a probability of one-half. But, Laplace went on, if the past is any guide to the future, every day that the sun rises should increase our confidence that it will continue to do so. After five thousand years, the probability that the sun will rise yet again tomorrow should be very close to one, but not quite there, since we can never be completely certain. From this thought experiment, Laplace derived his so-called rule of succession, which estimates the probability that the sun will rise again after having risenn times as (n + 1) / (n + 2). Whenn = 0, this is justВЅ; and asn increases, so does the probability, approaching 1 whenn approaches infinity.. Pearl realized that itвЂ™s OK to have a complex network of dependencies among random variables, provided each variable depends directly on only a few others. We can represent these dependencies with a graph like the ones we saw for Markov chains and HMMs, except now the graph can have any structure (as long as the arrows donвЂ™t form closed loops). One of PearlвЂ™s favorite examples is burglar alarms. The alarm at your house should go off if a burglar attempts to break in, but it could also be triggered by an earthquake. (In Los Angeles, where Pearl lives, earthquakes are almost as frequent as burglaries.) If youвЂ™re working late one night and your neighbor Bob calls to say he just heard your alarm go off, but your neighbor Claire doesnвЂ™t, should you call the police? HereвЂ™s the graph of dependencies:. The inference problem. One solution is to combineThe Timesreports it andThe Journalreports it into a single megavariable with four values:YesYes if they both do,YesNo if theTimes reports a landing and theJournal doesnвЂ™t, and so on. This turns the graph into a chain of three variables, and all is well. However, every time you add a news source, the number of values of the megavariable doubles. If instead of two news sources you have fifty, the megavariable has 250 values. So this method can only get you so far, and no other known method does any better.. Putting together birds of a feather. HereвЂ™s an interesting experiment. Take the video stream from RobbyвЂ™s eyes, treat each frame as a point in the space of images, and reduce that set of images to a single dimension. What will you discover? Time. Like a librarian arranging books on a shelf, time places each image next to its most similar ones. Perhaps our perception of it is just a natural result of our brainsвЂ™ dimensionality reduction prowess. In the road network of memory, time is the main thoroughfare, and we soon find it. Time, in other words, is the principal component of memory.. Clustering and dimensionality reduction get us closer to human learning, but thereвЂ™s still something very important missing. Children donвЂ™t just passively observe the world; they do things. They pick up objects they see, play with them, run around, eat, cry, and ask questions. Even the most advanced visual system is of no use to Robby if it doesnвЂ™t help him interact with the environment. Robby needs to know not just whatвЂ™s where but what to do at each moment. In principle we could teach him using step-by-step instructions, pairing sensor readings with the appropriate actions to take in response, but this is viable only for narrow tasks. The actions you take depend on your goals, not just whatever you are currently perceiving, and those goals can be far in the future. Step-by-step supervision shouldnвЂ™t be needed, in any case. Parents donвЂ™t teach their children to crawl, walk, or run; they figure it out on their own. But none of the learning algorithms weвЂ™ve seen so far can do this.. You can download the learner IвЂ™ve just described from alchemy.cs.washington.edu. We christened it Alchemy to remind ourselves that, despite all its successes, machine learning is still in the alchemy stage of science. If you do download it, youвЂ™ll see that it includes a lot more than the basic algorithm IвЂ™ve described butalso that it is still missing a few things I said the universal learner ought to have, like crossover. Nevertheless, letвЂ™s use the name Alchemy to refer to our candidate universal learner for simplicity.. The sobering (or perhaps reassuring) thought is that no learner in the world today has access to all this data (not even the NSA), and even if it did, it wouldnвЂ™t know how to turn it into a real likeness of you. But suppose you took all your data and gave it to the-real, future-Master Algorithm, already seeded with everything we could teach it about human life. It would learn a model of you, and you could carry that model in a thumb drive in your pocket, inspect it at will, and use it for everything you pleased. It would surely be a wonderful tool for introspection, like looking at yourself in the mirror, but it would be a digital mirror that showed not just your looks but all things observable about you-a mirror that could come alive and conversewith you. What would you ask it? Some of the answers you might not like, but that would be all the more reason to ponder them. And some would give you new ideas, new directions. The Master AlgorithmвЂ™s model of you might even help you become a better person.. Acknowledgments. Judea PearlвЂ™s pioneering work on Bayesian networks appears in his bookProbabilistic Reasoning in Intelligent Systems* (Morgan Kaufmann, 1988).вЂњBayesian networks without tears,вЂќ* by Eugene Charniak (AI Magazine, 1991), is a largely nonmathematical introduction to them.вЂњProbabilistic interpretation for MYCINвЂ™s certainty factors,вЂќ* by David Heckerman (Proceedings of the Second Conference on Uncertainty in Artificial Intelligence, 1986), explains when sets of rules with confidence estimates are and arenвЂ™t a reasonable approximation to Bayesian networks. вЂњModule networks: Identifying regulatory modules and their condition-specific regulators from gene expression data,вЂќ by Eran Segal et al. (Nature Genetics, 2003), is an example of using Bayesian networks to model gene regulation.вЂњMicrosoft virus fighter: Spam may be more difficult to stop than HIV,вЂќ by Ben Paynter (Fast Company, 2012), tells how David Heckerman took inspiration from spam filters and used Bayesian networks to design a potential AIDS vaccine. The probabilistic orвЂњnoisyвЂќ OR is explained in PearlвЂ™s book.* вЂњProbabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base,вЂќ by M. A. Shwe et al. (Parts I and II,Methods of Information in Medicine, 1991), describes a noisy-OR Bayesian network for medical diagnosis. GoogleвЂ™s Bayesian network for ad placement is described in Section 26.5.4 of Kevin MurphyвЂ™sMachine Learning* (MIT Press, 2012). MicrosoftвЂ™s player rating system is described in вЂњTrueSkillTM: A Bayesian skill rating system,вЂќ* by Ralf Herbrich, Tom Minka, and Thore Graepel (Advances in Neural Information Processing Systems 19, 2007)..