Download Free Mobile Apps and Games

can active learning experience be transferred learning unicorn

Wow.. The power of machine learning is perhaps best explained by a low-tech analogy: farming. In an industrial society, goods are made in factories, which means that engineers have to figure out exactly how to assemble them from their parts, how to make those parts, and so on-all the way to raw materials. It’s a lot of work. Computers are the most complex goods ever invented, and designing them, the factories that make them, and the programs that run on them is a ton of work. But there’s another, much older way in which we can get some of the things we need: by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. Why can’t technology be more like this? It can, and that’s the promise of machine learning. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine-learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.. Cyberwar is an instance of asymmetric warfare, where one side can’t match the other’s conventional military power but can still inflict grievous damage. A handful of terrorists armed with little more than box cutters can knock down the Twin Towers and kill thousands of innocents. All the biggest threats to US security today are in the realm of asymmetric warfare, and there’s an effective weapon against all of them: information. If the enemy can’t hide, he can’t survive. The good news is that we have plenty of information, and that’s also the bad news.. Even more astonishing than the breadth of applications of machine learning is that it’s thesame algorithms doing all of these different things. Outside of machine learning, if you have two different problems to solve, you need to write two different programs. They might use some of the same infrastructure, like the same programming language or the same database system, but a program to, say, play chess is of no use if you want to process credit-card applications. In machine learning, the same algorithm can do both, provided you give it the appropriate data to learn from. In fact, just a few algorithms are responsible for the great majority of machine-learning applications, and we’ll take a look at them in the next few chapters.. Bottom line: learning is a race between the amount of data you have and the number of hypotheses you consider. More data exponentially reduces the number of hypotheses that survive, but if you start with a lot of them, you may still have some bad ones left at the end. As a rule of thumb, if the learner only considers an exponential number of hypotheses (for example, all possible conjunctive concepts), then the data’s exponential payoff cancels it and you’re OK, provided you have plenty of examples and not too many attributes. On the other hand, if it considers a doubly exponential number (for example, all possible rule sets), then the data cancels only one of the exponentials and you’re still in trouble. You can even figure out in advance how many examples you’ll need to be pretty sure that the learner’s chosen hypothesis is very close to the true one, provided it fits all the data; in other words, for the hypothesis to be probably approximately correct. Harvard’s Leslie Valiant received the Turing Award, the Nobel Prize of computer science, for inventing this type of analysis, which he describes in his book entitled, appropriately enough,Probably Approximately Correct.. For each pair of facts, we construct the rule that allows us to infer the second fact from the first one and generalize it by Newton’s principle. When the same general rule is induced over and over again, we can have some confidence that it’s true.. We can also induce rules purely from other rules. If we know that all philosophers are human and mortal, we can induce that all humans are mortal. (We don’t induce that all mortals are human because we know other mortal creatures, like cats and dogs. On the other hand, scientists, artists, and so on are also human and mortal, reinforcing the rule.) In general, the more rules and facts we start out with, the more opportunities we have to induce newrules using “inverse deduction.” And the more rules we induce, the more rules we can induce. It’s a virtuous circle of knowledge creation, limited only by overfitting risk and computational cost. But here, too, having initial knowledge helps: if instead of one large hole we have many small ones to fill, our induction steps will be less risky and therefore less likely to overfit. (For example, given the same number of examples, inducing that all philosophers are human is less risky than inducing that all humans are mortal.). In the meantime, one important application of inverse deduction is predicting whether new drugs will have harmful side effects. Failure during animal testing and clinical trials is the main reason new drugs take many years and billions of dollars to develop. By generalizing from known toxic molecular structures, we can form rules that quickly weed out many apparently promising compounds, greatly increasing the chances of successful trials on the remaining ones.. In an early demonstration of the power of backprop, Terry Sejnowski and Charles Rosenberg trained a multilayer perceptron to read aloud. Their NETtalk system scanned the text, selected the correct phonemes according to context, and fed them to a speech synthesizer. NETtalk not only generalized accurately to new words, which knowledge-based systems could not, but it learned to speak in a remarkably human-like way. Sejnowski used to mesmerize audiences at research meetings by playing a tape of NETtalk’s progress: babbling at first, then starting to make sense, then speaking smoothly with only the occasional error. (You can find samples on YouTube by typing “sejnowski nettalk.”). The path to optimal learning begins with a formula that many people have heard of: Bayes’ theorem. But here we’ll see it in a whole new light and realize that it’s vastly more powerful than you’d guess from its everyday uses. At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patient’s symptoms in areasonable amount of time can take a lot of smarts. Once we know how to do all these things, we’ll be ready to learn the Bayesian way. For Bayesians, learning is “just” another application of Bayes’ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner. Bayesians have invented fiendishly clever kinds of models. So let’s get started.. There’s a big snag in all of this, unfortunately. Just because a Bayesian network lets us compactly represent a probability distribution doesn’t mean we can also reason efficiently with it. Suppose you want to computeP(Burglary | Bob called, Claire didn’t). By Bayes’ theorem, you know this is justP(Burglary) P(Bob called, Claire didn’t | Burglary) / P(Bob called, Claire didn’t), or equivalently,P(Burglary, Bob called, Claire didn’t) / P(Bob called, Claire didn’t). If you had the full table with the probabilities of all states, you could obtain both of these probabilities by adding up the corresponding lines in the table. For example,P(Bob called, Claire didn’t) is the sum of the probabilities of all the lines where Bob calls and Claire doesn’t. But the Bayesian network doesn’t give you the full table. You could always construct it from the individual tables, but that takes exponential time and space. What we really want is to computeP(Burglary | Bob called, Claire didn’t) without building the full table. That, in a nutshell, is the problem of inference in Bayesian networks.. Practical successes aside, SVMs also turned a lot of machine-learning conventional wisdom on its head. For example, they gave the lie to the notion, sometimes misidentified with Occam’s razor, that simpler models are more accurate. On the contrary, an SVM can have an infinite number of parameters and still not overfit, provided it has a large enough margin.. Here’s a challenge: you have fifteen minutes to combine decision trees, multilayer perceptrons, classifier systems, Naïve Bayes, and SVMs into a single algorithm possessing the best properties of each. Quick-what can you do? Clearly, it can’t involve the details of the individual algorithms; there’s no time for that. But how about the following? Think of each learner as an expert on a committee. Each looks carefully at the instance to be classified-what is the diagnosis for this patient?-and confidently makes its prediction. You’re not an expert yourself, but you’re the chair of the committee, and your job is to combine their recommendations into a final decision. What you have on your hands is in fact a new classification problem, where instead of the patient’s symptoms, the input is the experts’ opinions. But you can apply machine learning to this problem in the same way the experts applied it to the original one. We call this metalearning because it’s learning about the learners. The metalearner can itself be any learner, from a decision tree to a simple weighted vote. To learn the weights, or the decision tree, we replace the attributes of each original example by the learners’ predictions. Learners that often predict the correct class will get high weights, and inaccurate ones will tend to be ignored. With a decision tree, the choice of whether to use a learner can be contingent on other learners’ predictions. Either way, to obtain a learner’s prediction for a given training example, we must first apply it to the original training setexcluding that example and use the resulting classifier-otherwise the committee risks being dominated by learners that overfit, since they can predict the correct class just by remembering it. The Netflix Prize winner used metalearning to combine hundreds of different learners. Watson uses it to choose its final answer from the available candidates. Nate Silver combines polls in a similar way to predict election results.. Sasha Issenberg’sThe Victory Lab (Broadway Books, 2012) dissects the use of data analysis in politics.“How President Obama’s campaign used big data to rally individual votes,” by the same author (MIT Technology Review, 2013), tells the story of its greatest success to date. Nate Silver’sThe Signal and the Noise (Penguin Press, 2012) has a chapter on his poll aggregation method.. “The unreasonable effectiveness of data,” by Alon Halevy, Peter Norvig, and Fernando Pereira (IEEE Intelligent Systems, 2009), argues for machine learning as the new discovery paradigm. Benoît Mandelbrot explores the fractal geometry of nature in the eponymous book* (Freeman, 1982). James Gleick’sChaos (Viking, 1987) discusses and depicts the Mandelbrot set. The Langlands program, a research effort that seeks to unify different subfields of mathematics, is described inLove and Math, by Edward Frenkel (Basic Books, 2014).The Golden Ticket, by Lance Fortnow (Princeton University Press, 2013), is an introduction to NP-completeness and the P = NP problem.The Annotated Turing,* by Charles Petzold (Wiley, 2008), explains Turing machines by revisiting Turing’s original paper on them..