The second goal of this book is thus to enableyou to invent the Master Algorithm. YouвЂ™d think this would require heavy-duty mathematics and severe theoretical work. On the contrary, what it requires is stepping back from the mathematical arcana to see the overarching pattern of learning phenomena; and for this the layman, approaching the forest from a distance, is in some ways better placed than the specialist, already deeply immersed in the study of particular trees. Once we have the conceptual solution, we can fill in the mathematical details; but that is not for this book, and not the most important part. Thus, as we visit each tribe, our goal is to gather its piece of the puzzle and understand where it fits, mindful that none of the blind men can see the whole elephant. In particular, weвЂ™ll see what each tribe can contribute to curing cancer, and also what itвЂ™s missing. Then, step-by-step, weвЂ™ll assemble all the pieces into the solution-or rather,a solution that is not yet the Master Algorithm, but is the closest anyone has come, and hopefully makes a good launch pad for your imagination. And weвЂ™ll preview the use of this algorithm as a weapon in the fight against cancer. As you read the book, feel free to skim or skip any parts you find troublesome; itвЂ™s the big picture that matters, and youвЂ™ll probably get more out of those parts if you revisit them after the puzzle is assembled.. We can think of machine learning as the inverse of programming, in the same way that the square root is the inverse of the square, or integration is the inverse of differentiation. Just as we can askвЂњWhat number squared gives 16?вЂќ or вЂњWhat is the function whose derivative isx + 1?вЂќ we can ask, вЂњWhat is the algorithm that produces this output?вЂќ We will soon see how to turn this insight into concrete learning algorithms.. BayesвЂ™ theorem is a machine that turns data into knowledge. According to Bayesian statisticians, itвЂ™s theonly correct way to turn data into knowledge. If theyвЂ™re right, either BayesвЂ™ theorem is the Master Algorithm or itвЂ™s the engine that drives it. Other statisticians have serious reservations about the way BayesвЂ™ theorem is used and prefer different ways to learn from data. In the days before computers, BayesвЂ™ theorem could only be applied to very simple problems, and the idea of it as a universal learner would have seemed far-fetched. With big data and big computing to go with it, however, Bayes can find its way in vast hypothesis spaces and has spread to every conceivable field of knowledge. If thereвЂ™s a limit to what Bayes can learn, we havenвЂ™t found it yet.. Priming the knowledge pump. Perceptrons was mathematically unimpeachable, searing in its clarity, and disastrous in its effects. Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.. One consequence of crossing over program trees instead of bit strings is that the resulting programs can have any size, making the learning more flexible. The overall tendency is for bloat, however, with larger and larger trees growing as evolution goes on longer (also known asвЂњsurvival of the fattestвЂќ). Evolutionaries can take comfort from the fact that human-written programs are no different (Microsoft Windows: forty-five million lines of code and counting), and that human-made code doesnвЂ™t allow a solution as simple as adding a complexity penalty to the fitness function.. A basic search engine also uses an algorithm quite similar to NaГЇve Bayes to decide which web pages to return in answer to your query. The main difference is that, instead of spam/not-spam, itвЂ™s trying to predict relevant/not-relevant. The list of prediction problems NaГЇve Bayes has been applied to is practically endless. Peter Norvig, director of research at Google, told me at one point that it was the most widely used learner there, and Google uses machine learning in every nook and cranny of what it does. ItвЂ™s not hard to see why NaГЇve Bayes would be popular among Googlers. Surprising accuracy aside, it scales great; learning a NaГЇve Bayes classifier is just a matter of counting how many times each attribute co-occurs with each class and takes barely longer than reading the data from disk.. Given all this, itвЂ™s not surprising that analogy plays a prominent role in machine learning. It got off to a slow start, though, and was initially overshadowed by neural networks. Its first algorithmic incarnation appeared in an obscure technical report written in 1951 by two Berkeley statisticians, Evelyn Fix andJoe Hodges, and was not published in a mainstream journal until decades later. But in the meantime, other papers on Fix and HodgesвЂ™s algorithm started to appear and then to multiply until it was one of the most researched in all of computer science. The nearest-neighbor algorithm, as itвЂ™s called, is the first stop on our tour of analogy-based learning. The second is support vector machines, an idea that took machine learning by storm around
the turn of the millennium and was only recently overshadowed by deep learning. The third and last is full-blown analogical reasoning, which has been a staple of psychology and AI for several decades, and a background theme in machine learning for nearly as long.. Generally, the fewer support vectors an SVM selects, the better it generalizes. Any training example that is not a support vector would be correctly classified if it showed up as a test example instead because the frontier between positive and negative examples would still be in the same place. So the expected error rate of an SVM is at most the fraction of examples that are support vectors. As the number of dimensions goes up, this fraction tends to go up as well, so SVMs are not immune to the curse of dimensionality. But theyвЂ™re more resistant to it than most.. Machine learners call this process dimensionality reduction because it reduces a large number of visible dimensions (the pixels) to a few implicit ones (expression, facial features). Dimensionality reduction is essential for coping with big data-like the data coming in through your senses every second. A picture may be worth a thousand words, but itвЂ™s also a million times more costly to process and remember. Yet somehow your visual cortex does a pretty good job of whittling it down to a manageable amount of information, enough to navigate the world, recognize people and things, and remember what you saw. ItвЂ™s one of the great miracles of cognition and so natural youвЂ™re not even conscious of doing it.. Nevertheless, itвЂ™s still the case that most shops are pretty close to University Avenue, and if you were allowed only one number to locate a shop, its distance from the Caltrain station along the avenue would be a pretty good choice: after walking that distance, looking around is probably enough to find the shop. So youвЂ™ve just reduced the dimensionality of вЂњshop locations in Palo AltoвЂќ from two to one.. [РљР°СЂС‚РёРЅРєР°: pic_32.jpg]. Rosenbloom and Newell set their chunking program to work on a series of problems, measured the time it took in each trial, and lo and behold, out popped a series of power law curves. But that was only the beginning. Next they incorporated chunking into Soar, a general theory of cognition that Newell had been working on with John Laird, another one of his students. Instead of working only within a predefined hierarchy of goals, the Soar program could define and solve a new subproblem every time it hit a snag. Once it formed a new chunk, Soar generalized it to apply to similar problems, in a manner similar to inverse deduction. Chunking in Soar turned out to be a good model of lots of learning phenomena besides the power law of practice. It could even be applied to learning new knowledge by chunking data and analogies. This led Newell, Rosenbloom, and Laird to hypothesize that chunking is theonly mechanism needed for learning-in other words, the Master Algorithm.. Now that youвЂ™ve toured the machine learning wonderland, letвЂ™s switch gears and see what it all means to you. Like the red pill inThe Matrix, the Master Algorithm is the gateway to a different reality: the one you already live in but didnвЂ™t know it yet. From dating to work, from self-knowledge to the future of society, from data sharing to war, and from the dangers of AI to the next step in evolution, a new world is taking shape, and machine learning is the key that unlocks it. This chapter will help you make the most of it in your life and be ready for what comes next. Machine learning will not single-handedly determine the future, any more than any other technology; itвЂ™s what we decide to do with it that counts, and now you have the tools to decide.. The termsingularity comes from mathematics, where it denotes a point at which a function becomes infinite. For example, the function 1/x has a singularity whenx is 0, because 1 divided by 0 is infinity. In physics, the quintessential example of a singularity is a black hole: a point of infinite density, where a finite amount of matter is crammed into infinitesimal space. The only problem with singularities is that they donвЂ™t really exist. (When did you last divide a cake among zero people, and each one got an infinite slice?) In physics, if a theory predicts something is infinite, somethingвЂ™s wrong with the theory. Case in point, general relativity presumably predicts that black holes have infinite density because it ignores quantum effects. Likewise, intelligence cannot continue to increase forever. Kurzweil acknowledges this, but points to a series of exponential curves in technology improvement (processor speed, memory capacity, etc.) and argues that the limits to this growth are so far away that we need not concern ourselves with them..