You may not know it, but machine learning is all around you. When you type a query into a search engine, itвЂ™s how the engine figures out which results to show you (and which ads, as well). When you read your e-mail, you donвЂ™t see most of the spam, because machine learning filtered it out. Go to Amazon.com to buy a book or Netflix to watch a video, and a machine-learning system helpfully recommends some you might like. Facebook uses machine learning to decide which updates to show you, and Twitter does the same for tweets. Whenever you use a computer, chances are machine learning is involved somewhere.. Where are we headed?. Our quest will take us across the territory of each of the five tribes. The border crossings, where they meet, negotiate and skirmish, will be the trickiest part of the journey. Each tribe has a different piece of the puzzle, which we must gather. Machine learners, like all scientists, resemble the blind men and the elephant: one feels the trunk and thinks itвЂ™s a snake, another leans against the leg and thinks itвЂ™s a tree, yet another touches the tusk and thinks itвЂ™s a bull. Our aim is to touch each part without jumping to conclusions; and once weвЂ™ve touched all of them, we will try to picture the whole elephant. ItвЂ™s far from obvious how to combine all the pieces into one solution-impossible, according to some-but this is what we will do.. But perhaps thatвЂ™s not such a big deal? With enough data, wonвЂ™t most cases be in the вЂњtrivialвЂќ category? No. We saw in the previous chapter why memorization wonвЂ™t work as a universal learner, but now we can make it more quantitative. Suppose you have a database with a trillion records, each with a thousand Boolean fields (i.e., each field is the answer to a yes/no question). ThatвЂ™s pretty big. What fraction of the possible cases have you seen? (Take a guess before you read on.) Well, the number of possible answers is two for each question, so for two questions itвЂ™s two times two (yes-yes, yes-no, no-yes, and no-no), for three questions itвЂ™s two cubed (2 Г— 2 Г— 2 = 23), and for a thousand questions itвЂ™s two raised to the power of a thousand (21000). The trillion records in your database are one-gazillionth of 1 percent of 21000, whereвЂњgazillionthвЂќ means вЂњzero point 286 zeros followed by 1.вЂќ Bottom line: no matter how much data you have-tera- or peta- or exa- or zetta- or yottabytes-youвЂ™ve basically seennothing. The chances that the new case you need to make a decision on is already in the database are so vanishingly small that, without generalization, you wonвЂ™t even get off the ground.. In the meantime, the practical consequence of theвЂњno free lunchвЂќ theorem is that thereвЂ™s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.. One such rule is:If Socrates is human, then heвЂ™s mortal. This does the job, but is not very useful because itвЂ™s specific to Socrates. But now we apply NewtonвЂ™s principle and generalize the rule to all entities:If an entity is human, then itвЂ™s mortal. Or, more succinctly:All humans are mortal. Of course, it would be rash to induce this rule from Socrates alone, but we know similar facts about other humans:. According to the decision tree above, youвЂ™re either a Republican, a Democrat, or an independent; you canвЂ™t be more than one, or none of the above. Sets of concepts with this property are called sets of classes, and the algorithm that predicts them is a classifier. A single concept implicitly defines two classes: the concept itself andits negation. (For example, spam and nonspam.) Classifiers are the most widespread form of machine learning.. The S curve is not just important as a model in its own right; itвЂ™s also the jack-of-all-trades of mathematics. If you zoom in on its midsection, it approximates a straight line. Many phenomena we think of as linear are in fact S curves, because nothing can grow without limit. Because of relativity, andcontra Newton, acceleration does not increase linearly with force, but follows an S curve centered at zero. So does electric current as a function of voltage in the resistors found in electronic circuits, or in a light bulb (until the filament melts, which is itself another phase transition). If you zoom out from an S curve, it approximates a step function, with the output suddenly changing from zero to one at the threshold. So depending on the input voltages, the same curve represents the workings of a transistor in both digital computers and analog devices like amplifiers and radio tuners. The early part of an S curve is effectively an exponential, and near the saturation point it approximates exponential decay. When someone talks about exponential growth, ask yourself: How soon will it turn into an S curve? When will the population bomb peter out, MooreвЂ™s law lose steam, or the singularity fail to happen? Differentiate an S curve and you get a bell curve: slow, fast, slow becomes low, high, low. Add a succession of staggered upward and downward S curves, and you get something close to a sine wave. In fact, every function can be closely approximated by a sum of S curves: when the function goes up, you add an S curve; when it goes down, you subtract one. ChildrenвЂ™s learning is not a steady improvement but an accumulation of S curves. So is technological change. Squint at the New York City skyline and you can see a sum of S curves unfolding across the horizon, each as sharp as a skyscraperвЂ™s corner.. Autoencoders were known in the 1980s, but they were very hard to learn, even though they had a single hidden layer. Figuring out how to pack a lot of information into the same few bits is a hellishly difficult problem (one code for your grandmother, a slightly different one for your grandfather, another one for Jennifer Aniston, etc). The landscape in hyperspace is just too rugged to get to a good peak; the hidden units need to learn what amounts to too many exclusive-ORs of the inputs. So autoencoders didnвЂ™t really catch on. The trick that took over a decade to discover was to make the hidden layer larger than the input and output ones. Huh? Actually, thatвЂ™s only half the trick: the other half is to force all but a few of the hidden units to be off at any given time. This still prevents the hidden layer from just copying the input, and-crucially-it makes learning much easier. If we allow different bits to represent different inputs, the inputs no longer have to compete to set the same bits. Also, the network now has many more parameters, so the hyperspace youвЂ™re in has many more dimensions, and you have many more ways to get out of what would otherwise be local maxima. This is called a sparse autoencoder, and itвЂ™s a neat trick.. Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learningвЂ™s version of the nature versus nurture controversy, and there are good arguments on both sides.. The same idea still works if the graph is a tree instead of a chain. If instead of a platoon youвЂ™re in command of a whole army, you can ask each of your company commanders how many soldiers are behind him and add up their answers. Each company commander in turn asks each of his platoon commanders, and so on. But if the graph forms loops, youвЂ™re in trouble. If thereвЂ™s a liaison officer whoвЂ™s a member of two platoons, he gets counted twice; in fact, everyone behind him gets counted twice. This is what happens in the вЂњaliens have landedвЂќ scenario, if you want to compute, say, the probability of panic:. Decision trees are not immune to the curse of dimensionality either. LetвЂ™s say the concept youвЂ™re trying to learn is a sphere: points inside it are positive, and points outside it are negative. A decision tree can approximate a sphere by the smallest cube it fits inside. Not perfect, but not too bad either: only the corners of the cube get misclassified. But in high dimensions, almost the entire volume of the hypercube lies outside the hypersphere. For every example you correctly classify as positive, you incorrectly classify many negative ones as positive, causing your accuracy to plummet.. Again, you can probably surmise just by looking at this plot that the cities are on a bay, and if you draw a line running through them, you can locate each city using just one number: how far it is from San Francisco along that line. But PCA canвЂ™t find this curve; instead, it draws a straight line running down the middle of the bay, where there are no cities at all. Far from elucidating the shape of the data, PCA obscures it.. Self-improvement aside, probably the first thing youвЂ™d want your model to do is negotiate the world on your behalf: let it loose in cyberspace, looking for all sorts of things for you. From all the worldвЂ™s books, it would suggest a dozen you might want to read next, with more insight than Amazon could dream of. Likewise for movies, music, games,clothes, electronics-you name it. It would keep your refrigerator stocked at all times, natch. It would filter your e-mail, voice mail, Facebook posts, and Twitter feed and, when appropriate, reply on your behalf. It would take care of all the little annoyances of modern life for you, like checkingcredit-card bills, disputing improper charges, making arrangements, renewing subscriptions, and filling out tax returns. It would find a remedy for your ailment, run it by your doctor, and order it from Walgreens. It would bring interesting job opportunities to your attention, propose vacation spots, suggest which candidates to vote for on the ballot, and screen potential dates. And, after the match was made, it would team up with your dateвЂ™s model to pick some restaurants you might both like. Which is where things start to getreally interesting.. Judea PearlвЂ™s pioneering work on Bayesian networks appears in his bookProbabilistic Reasoning in Intelligent Systems* (Morgan Kaufmann, 1988).вЂњBayesian networks without tears,вЂќ* by Eugene Charniak (AI Magazine, 1991), is a largely nonmathematical introduction to them.вЂњProbabilistic interpretation for MYCINвЂ™s certainty factors,вЂќ* by David Heckerman (Proceedings of the Second Conference on Uncertainty in Artificial Intelligence, 1986), explains when sets of rules with confidence estimates are and arenвЂ™t a reasonable approximation to Bayesian networks. вЂњModule networks: Identifying regulatory modules and their condition-specific regulators from gene expression data,вЂќ by Eran Segal et al. (Nature Genetics, 2003), is an example of using Bayesian networks to model gene regulation.вЂњMicrosoft virus fighter: Spam may be more difficult to stop than HIV,вЂќ by Ben Paynter (Fast Company, 2012), tells how David Heckerman took inspiration from spam filters and used Bayesian networks to design a potential AIDS vaccine. The probabilistic orвЂњnoisyвЂќ OR is explained in PearlвЂ™s book.* вЂњProbabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base,вЂќ by M. A. Shwe et al. (Parts I and II,Methods of Information in Medicine, 1991), describes a noisy-OR Bayesian network for medical diagnosis. GoogleвЂ™s Bayesian network for ad placement is described in Section 26.5.4 of Kevin MurphyвЂ™sMachine Learning* (MIT Press, 2012). MicrosoftвЂ™s player rating system is described in вЂњTrueSkillTM: A Bayesian skill rating system,вЂќ* by Ralf Herbrich, Tom Minka, and Thore Graepel (Advances in Neural Information Processing Systems 19, 2007)..