Once the inevitable happens and learning algorithms become the middlemen, power becomes concentrated in them. GoogleвЂ™s algorithms largely determine what information you find, AmazonвЂ™s what products you buy, and Match.comвЂ™s who you date. The last mile is still yours-choosing from among the options the algorithms present you with-but 99.9 percent of the selection was done by them. The success or failure of acompany now depends on how much the learners like its products, and the success of a whole economy-whether everyone gets the best products for their needs at the best price-depends on how good the learners are.. Most of all, we have to worry about what the Master Algorithm could do in the wrong hands. The first line of defense is to make sure the good guys get it first-or, if itвЂ™s not clear who the good guys are, to make sure itвЂ™s open-sourced. The second is to realize that, no matter how good the learning algorithm is, itвЂ™s only as good as the data it gets. He who controls the data controls the learner. Your reaction to the datafication of life should not be to retreat to a log cabin-the woods, too, are full of sensors-but to aggressively seek control of the data that matters to you. ItвЂ™s good to have recommenders that find what you want and bring it to you; youвЂ™d feel lost without them. But they should bring you whatyou want, not what someone else wants you to have. Control of data and ownership of the models learned from it is what many of the twenty-first centuryвЂ™s battles will be about-between governments, corporations, unions, and individuals. But you also have an ethical duty to share data for the common good. Machine learning alone will not cure cancer; cancer patients will, by sharing their data for the benefit of future patients.. NewtonвЂ™s Principle: Whatever is true of everything weвЂ™ve seen is true of everything in the universe.. Or:. In his storyвЂњFunes the Memorious,вЂќ Jorge Luis Borges tells of meeting a youth with perfect memory. This might at first seem like a great fortune, but it is in fact an awful curse. Funes can remember the exact shape of the clouds in the sky at an arbitrary time in the past, but he has trouble understanding that a dog seen from the side at 3:14 p.m. is the same dog seen from the front at 3:15 p.m. His own face in the mirror surprises him every time he sees it. Funes canвЂ™t generalize; to him, two things are the same only if they look the same down to every last detail. An unrestricted rule learner is like Funes and is equally unable to function. Learning is forgetting the details as much as it is remembering the important parts. Computers are the ultimate idiot savants: they can remember everything with no trouble at all, but thatвЂ™s not what we want them to do.. [РљР°СЂС‚РёРЅРєР°: pic_7.jpg]. The key input to a genetic algorithm, as HollandвЂ™s creation came to be known, is a fitness function. Given a candidate program and some purpose it is meant to fill, the fitness function assigns the program a numeric score reflecting how well it fits the purpose. In natural selection, itвЂ™s questionable whether fitness can be interpreted this way: while the fitness of a wing for flight makes intuitive sense, evolution as a whole has no known purpose. Nevertheless, in machine learning having something like a fitness function is a no-brainer. If we need a program that can diagnose a patient, one that correctly diagnoses 60 percent of the patients in our database is better than one that only gets it right 55 percent of the time, and thus a possible fitness function is the fraction of correctly diagnosed cases.. We can get even fancier by allowing rules for intermediate concepts to evolve, and then chaining these rules at performance time. For example, we could evolve the rulesIf the e-mail contains the word loanthen itвЂ™s a scam andIf the e-mail is a scam then itвЂ™s spam. Since a ruleвЂ™s consequent is no longer alwaysspam, this requires introducing additional bits in rule strings to represent their consequents. Of course, the computer doesnвЂ™t literally use the wordscam; it just comes up with some arbitrary bit string to represent the concept, but thatвЂ™s good enough for our purposes. Sets of rules like this, which Holland called classifier systems, are one of the workhorses of the machine-learning tribe he founded: the evolutionaries. Like multilayer perceptrons, classifier systems face the credit-assignment problem-what is the fitness of rules for intermediate concepts?-and Holland devised the so-called bucket brigade algorithm to solve it. Nevertheless, classifier systems are much less widely used than multilayer perceptrons.. In reality, a doctor doesnвЂ™t diagnose the flu just based on whether you have a fever; she takes a whole bunch of symptoms into account, including whether you have a cough, a sore throat, a runny nose, a headache, chills, and so on. So what we really need to compute isP(flu | fever, cough, sore throat, runny nose, headache, chills,вЂ¦). By BayesвЂ™ theorem, we know that this is proportional toP(fever, cough, sore throat, runny nose, headache, chills,вЂ¦| flu). But now we run into a problem. How are we supposed to estimate this probability? If each symptom is a Boolean variable (you either have it or you donвЂ™t) and the doctor takesn symptoms into account, a patient could have 2n possible combinations of symptoms. If we have, say, twenty symptoms and a database of ten thousand patients, weвЂ™ve only seen a small fraction of the roughly one million possible combinations. Worse still, to accurately estimate the probability of a particular combination, we need at least tens of observations of it, meaning the database would need to include tens of millions of patients. Add another ten symptoms, and weвЂ™d need more patients than there are people on Earth. With a hundred symptoms, even if we were somehow able to magically get the data, there wouldnвЂ™t be enough space on all the hard disks in the world to store all the probabilities. And if a patient walks in with a combination of symptoms we havenвЂ™t seen before, we wonвЂ™t know how to diagnose him. WeвЂ™re face-to-face with our old foe: the combinatorial explosion.. You might have noticed a certain resemblance betweenk-means and EM, in that they both alternate between assigning entities to clusters and updating the clustersвЂ™ descriptions. This is not an accident:k-means itself is a special case of EM, which you get when all the attributes haveвЂњnarrowвЂќ normal distributions, that is, normal distributions
with very small variance. When clusters overlap a lot, an entity could belong to, say, cluster A with a probability of 0.7 and cluster B with a probability of 0.3, and we canвЂ™t just decide that it belongs to cluster A without losinginformation. EM takes this into account by fractionally assigning the entity to the two clusters and updating their descriptions accordingly. If the distributions are very concentrated, however, the probability that an entity belongs to the nearest cluster is always approximately 1, and all we haveto do is assign entities to clusters and average the entities in each cluster to obtain its mean, which is just thek-means algorithm.. This type of curve is called a power law, because performance varies as time raised to some negative power. For example, in the figure above, time to completion is proportional to the number of trials raised to minus two (or equivalently, one over the number of trials squared). Pretty much every human skill follows a power law, with different powers for different skills. (In contrast, Windows never gets faster with practice-something for Microsoft to work on.). If we endow Robby the robot with all the learning abilities weвЂ™ve seen so far in this book, heвЂ™ll be pretty smart but still a bit autistic. HeвЂ™ll see the world as a bunch of separate objects, which he can identify, manipulate, and even make predictions about, but he wonвЂ™t understand that the world is a web of interconnections. Robby the doctor would be very good at diagnosing someone with the flu based on his symptoms but unable to suspect that the patient has swine flu because he has been in contact with someone infected with it. Before Google, search engines decided whether a web page was relevant to your query by looking at its content-what else? Brin and PageвЂ™s insight was that the strongest sign a page is relevant is that relevant pages link to it. Similarly, if you want to predict whether a teenager is at risk of starting to smoke, by far the best thing you can do is check whether her close friends smoke. An enzymeвЂ™s shape is as inseparable from the shapes of the molecules it brings together as a lock is from its key. Predator and prey have deeply entwined properties, each evolved to defeat the otherвЂ™s properties. In all of these cases, the best way to understand an entity-whether itвЂ™s a person, an animal, a web page, or a molecule-is to understand how it relates to other entities. This requires a new kind of learning that doesnвЂ™t treat the data as a random sample of unrelated objects but as a glimpse into a complex network. Nodes in the network interact; what you do to one affects the others and comes back to affect you. Relational learners, as theyвЂ™re called, may not quite have social intelligence, but theyвЂ™re the next best thing. In traditional statistical learning, every man is an island, entire of itself. In relational learning, every man is a piece of the continent, a part of the main. Humans arerelational learners, wired to connect, and if we want Robby to grow into a perceptive, socially adept robot, we need to wire him to connect, too.. One of the cleverest metalearners is boosting, created by two learning theorists, Yoav Freund and Rob Schapire. Instead of combining different learners, boosting repeatedly applies the same classifier to the data, using each new model to correct the previous onesвЂ™ mistakes. It does this by assigning weights to the training examples; the weight of each misclassified example is increased after each round of learning, causing later rounds to focus more on it. The nameboosting comes from the notion that this process can boost a classifier thatвЂ™s only slightly better than random guessing, but consistently so, into one thatвЂ™s almost perfect.. YouвЂ™ve reached the final stage of your quest. You knock on the door of the Tower of Support Vectors. A menacing-looking guard opens it, and you suddenly realize that you donвЂ™t know the password. вЂњKernel,вЂќ you blurt out, trying to keep the panic from your voice. The guard bows and steps aside. Regaining your composure, you step in, mentally kicking yourself for your carelessness. The entire ground floor of the tower is taken up by a lavishly appointed circular chamber, with what seems to be a marble representation of an SVM occupying pride of place at the center. As you walk around it, you notice a door on the far side. It must lead to the central tower-the Tower of the Master Algorithm. The door seems unguarded. You decide to take a shortcut. Slipping through the doorway, you walk down a short corridor and find yourself in an even larger pentagonal chamber, with a door in each wall. In the center, a spiral staircase rises as high as the eye can see. You hear voices above and duck into the doorway opposite. This one leads to the Tower of Neural Networks. Once again youвЂ™re in a circular chamber, this one with a sculpture of a multilayer perceptron as the centerpiece. Its parts are different from the SVMвЂ™s, but their arrangement is remarkably similar. Suddenly you see it: an SVM is just a multilayer perceptron with a hidden layer composed of kernels instead of S curves and an output thatвЂ™s a linear combination instead of another S curve.. Principal-component analysis is one of the oldest techniques in machine learning and statistics, having been first proposed by Karl Pearson in 1901 in the paperвЂњOn lines and planes of closest fit to systems of points in spaceвЂќ* (Philosophical Magazine). The type of dimensionality reduction used to grade SAT essays was introduced by Scott Deerwester et al. in the paperвЂњIndexing by latent semantic analysisвЂќ* (Journal of the American Society for Information Science, 1990). Yehuda Koren, Robert Bell, and Chris Volinsky explain how Netflix-style collaborative filtering works inвЂњMatrix factorization techniques for recommender systemsвЂќ* (IEEE Computer, 2009). The Isomap algorithm was introduced inвЂњA global geometric framework for nonlinear dimensionality reduction,вЂќ* by Josh Tenenbaum, Vin de Silva, and John Langford (Science, 2000)..