Machine learning is something new under the sun: a technology that builds itself. Ever since our remote ancestors started sharpening stones into tools, humans have been designing artifacts, whether theyвЂ™re hand built or mass produced. But learning algorithms are artifacts that design other artifacts. вЂњComputers are useless,вЂќ said Picasso. вЂњThey can only give you answers.вЂќ Computers arenвЂ™t supposed to be creative; theyвЂ™re supposed to do what you tell them to. If what you tell them todo is be creative, you get machine learning. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customerвЂ™s needs. But instead of turning stone into masonry or gold into jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.. The argument from neuroscience. The most important argument for the brain being the Master Algorithm, however, is that itвЂ™s responsible for everything we can perceive and imagine. If something exists but the brain canвЂ™t learn it, we donвЂ™t know it exists. We may just not see it or think itвЂ™s random. Either way, if we implement the brain in a computer, that algorithm can learn everything we can. Thus one route-arguably the most popular one-to inventing the Master Algorithm is to reverse engineer the brain. Jeff Hawkins took a stab at this in his bookOn Intelligence. Ray Kurzweil pins his hopes for the Singularity-the rise of artificial intelligence that greatly exceeds the human variety-on doing just that and takes a stab at it himself in his bookHow to Create a Mind. Nevertheless, this is only one of several possible approaches, as weвЂ™ll see. ItвЂ™s not even necessarily the most promising one, because the brain is phenomenally complex, and weвЂ™re still in the very early stages of deciphering it. On the other hand, if we canвЂ™t figure out the Master Algorithm, the Singularity wonвЂ™t happen any time soon.. The argument from computer science. The problem is not limited to memorizing instances wholesale. Whenever a learner finds a pattern in the data that is not actually true in the real world, we say that it has overfit the data. Overfitting is the central problem in machine learning. More papers have been written about it than about any other topic. Every powerful learner, whether symbolist, connectionist, or any other, has to worry about hallucinating patterns. The only safe way to avoid it is to severely restrict what the learner can learn, for example by requiring that it be a short conjunctive concept. Unfortunately, that throws out the baby with the bathwater, leaving the learner unable to see most of the true patterns that are visible in the data. Thus a good learner is forever walking the narrow path between blindness and hallucination.. As you stare uncomprehendingly at it, your Google Glass helpfully flashes:вЂњBayesвЂ™ theorem.вЂќ Now the crowd starts to chant вЂњMore data! More data!вЂќ A stream of sacrificial victims is being inexorably pushed toward the altar. Suddenly, you realize that youвЂ™re in the middle of it-too late. As the crank looms over you, you scream, вЂњNo! I donвЂ™t want to be a data point! Let me gooooo!вЂќ. If we measure not just the probability of vowels versus consonants, but the probability of each letter in the alphabet following each other, we can have fun generating new texts with the same statistics asOnegin: choose the first letter, then choose the second based on the first, and so on. The result is complete gibberish, of course, but if we let each letter depend on several previous letters instead of just one, it starts to sound more like the ramblings of a drunkard, locally coherent even if globally meaningless. Still not enough to pass the Turing test, but models like this are a key component of machine-translation systems, like Google Translate, which lets you see the whole web in English (or almost), regardless of the language the pages were originally written in.. Constrained optimization is the problem of maximizing or minimizing a function subject to constraints. The universe maximizes entropy subject to keeping energy constant. Problems of this type are widespread in business and technology. For example, we may want to maximize the number of widgets a factory produces, subject to the number of machine tools available, the widgetsвЂ™ specs, and so on. With SVMs, constrained optimization became crucial for machine learning as well. Unconstrained optimization is getting to the top of the mountain, and thatвЂ™s what gradient descent (or, in this case, ascent) does. Constrained optimization is going as high as you can while staying on the road. If the road goes up to the very top, the constrained and unconstrained problems have the same solution. More often, though, the road zigzags up the mountain and then back down without ever reaching the top. You know youвЂ™ve reached the highest point on the road when you canвЂ™t goany higher without driving off the road; in other words, when the path to the top is at right angles to the road. If the road and the path to the top form an oblique angle, you can always get higher by driving farther along the road, even if that doesnвЂ™t get you higher as quickly as aiming straight for the top of the mountain. So the way to solve a constrained optimization problem is to follow not the gradient but the part of it thatвЂ™s parallel to the constraint surface-in this case the road-and stop when that part is zero.. So far weвЂ™ve only seen how to learn one level of clusters, but the world is, of course, much richer than that, with clusters within clusters all the way down to individual objects: living things cluster into plants and animals, animals into mammals, birds, fishes, and so on, all the way down to Fido the family dog. No problem: once weвЂ™ve learned one set of clusters, we can treat them as objects and cluster them in turn, and so on up to the cluster of all things. Alternatively, we can start with a coarse clustering and then further divide each cluster into subclusters: RobbyвЂ™s toys divide into stuffed animals, constructions toys, and so on; stuffed animals into teddy bears, plush kittens, and so on. Children seem to start out in the middle and then work their way up and down. For example, they learndog before they learnanimal orbeagle. This might be a good strategy for Robby, as well.. Whether itвЂ™s data pouring into RobbyвЂ™s brain through his senses or the click streams of millions of Amazon customers, grouping a large number of entities into a smaller number of clusters is only half the battle. The other half is shortening the description of each entity. The very first picture of Mom that Robby sees comprises perhaps a million pixels, each with its own color, but you hardly need a million variables to describe a face. Likewise, each thing you click on at Amazon provides an atom of information about you, but what Amazon would really like to know is your likes and dislikes, not your clicks. The former, which are fairly stable, are somehow immanent in the latter, which grow without limit as you use the site. Little by little, all those clicks should add up to a picture of your taste, in the same way that all those pixels add up to a picture of your face. The question is how todo the adding.. Despite its successes, Alchemy has some significant shortcomings. It does not yet scale to truly big data, and someone without a PhD in machine learning will find it hard to use. Because of these problems, itвЂ™s not yet ready for prime time. But letвЂ™s see what we can do about them.. A better way for all concerned is to focus on your specific, unusual attributes that are highly predictive of a match, in the sense that they pick out people you like that not everyone else does, and therefore have less competition for. Your job (and your prospective dateвЂ™s) is to provide these attributes. The matcherвЂ™s job is to learn from them, in the same way that an old-fashioned matchmaker would. Compared to a village matchmaker, Match.comвЂ™s algorithm has the advantage that it knows vastly more people, but the disadvantage is that it knows them much moresuperficially. A naГЇve learner, such as a perceptron, will be content with broad generalizations like вЂњgentlemen prefer blondes.вЂќ A more sophisticated one will find patterns like вЂњpeople with the same unusual musical tastes are often good matches.вЂќ If Alice and Bob both like BeyoncГ©, thatalone hardly singles them out for each other. But if they both like Bishop Allen, that makes them at least a little bit more likely to be potential soul mates. If theyвЂ™re both fans of a band the learner does not know about, thatвЂ™s even better, but only a relational algorithm like Alchemy can pick it up. The better the learner, the more itвЂ™s worth your time to teach it about you. But as a rule of thumb, you want to differentiate yourself enough so that it wonвЂ™t confuse you with the вЂњaverage personвЂќ (remember Bob Burns from Chapter 8), but not be so unusual that it canвЂ™t fathom you.. In sum, all four kinds of data sharing have problems. These problems all have a common solution: a new type of company that is to your data what your bank is to your money. Banks donвЂ™t steal your money (with rare exceptions). TheyвЂ™re supposed to invest it wisely, and your deposits are FDIC-insured. Many companies today offer to consolidate your data somewhere in the cloud, but theyвЂ™re still a far cry from your personal data bank. If theyвЂ™re cloud providers, they try tolock you in-a big no-no. (Imagine depositing your money with Bank of America and not knowing if youвЂ™ll be able to transfer it to Wells Fargo somewhere down the line.) Some startups offer to hoard your data and then mete it out to advertisers in return for discounts, but to me that misses the point. Sometimes you want to give information to advertisers for free because itвЂ™s in your interests, sometimes you donвЂ™t want to give it at all, and what to share when is a problem that only a good model of you can solve.. To sidestep the problem that infinitely dense points donвЂ™t exist, Kurzweil proposes to instead equate the Singularity with a black holeвЂ™s event horizon, the region within which gravity is so strong that not even light can escape. Similarly, he says, the Singularity is the point beyond which technological evolution is so fast that humans cannot predict or understand what will happen. If thatвЂ™s what the Singularity is, then weвЂ™re already inside it. We canвЂ™t predict in advance what a learner will come up with, and often we canвЂ™t even understand it in retrospect. As a matter of fact, weвЂ™ve always lived in a world that we only partly understood. The main difference is that our world is now partly created by us, which is surely an improvement. The world beyond the Turing point will not be incomprehensible to us, any more than the Pleistocene was. WeвЂ™ll focus on what we can understand, as we always have, and call the rest random (ordivine).. The curse of dimensionality is discussed in Section 2.5 ofThe Elements of Statistical Learning,* by Trevor Hastie, Rob Tibshirani, and Jerry Friedman (2nd ed., Springer, 2009).вЂњWrappers for feature subset selection,вЂќ* by Ron Kohavi and George John (Artificial Intelligence, 1997), compares attribute selection methods.вЂњSimilarity metric learning for a variable-kernel classifier,вЂќ* by David Lowe (Neural Computation, 1995), is an example of a feature weighting algorithm..