Statistical Data Mining Tutorials

Tutorial Slides by Andrew Moore

Advertisment: In 2006 I joined Google. We are growing a Google Pittsburgh office on CMU's campus. We are hiring creative computer scientists who love programming, and Machine Learning is one the focus areas of the office. We're also currently accepting resumes for Fall 2008 intenships. If you might be interested, feel welcome to send me email: awm@google.com .

The following links point to a set of tutorials on many aspects ofstatistical data mining, including the foundations of probability, thefoundations of statistical data analysis, and most of the classicmachine learning and data mining algorithms.

These include classification algorithms such as decision trees, neuralnets, Bayesian classifiers, Support Vector Machines and cased-based(aka non-parametric) learning. They include regression algorithmssuch as multivariate polynomial regression, MARS, Locally WeightedRegression, GMDH and neural nets. And they include other data miningoperations such as clustering (mixture models, k-means andhierarchical), Bayesian networks and Reinforcement Learning.

I hope they're useful (and please let me know if they are, or if you have suggestions or error-corrections). Click here for a short list of topics.

Decision Trees. The Decision Tree is one of the most popular classification algorithmsin current use in Data Mining and Machine Learning. This tutorial canbe used as a self-contained introduction to the flavor andterminology of data mining without needing to review many statisticalor probabilistic pre-requisites. If you're new to data mining you'llenjoy it, but your eyebrows will raise at how simple it all is!After having defined the job of classification, we explain howinformation gain (next Andrew Tutorial) can be used to find predictiveinput attributes. We show how applying this procedure recursively allowsus to build a decision tree to predict future events. We then lookcarefully at a question so fundamental, it is the basis for much of allstatistics and machine learning theory: how do you choose between acomplicated model that fits the data really well and an "Occam's razor"model that is succinct yet not so good at fitting data (this topic willbe revisited in later Andrew Lectures, including "Cross-validation" and"VC-dimension"). We also discuss the very wide world of improvementsand tweaks on the basic decision tree idea.

Information Gain. This tutorial steps through the ideas from Information Theory that eventuallylead to Information Gain...one of the most popular measures ofassociation currently used in data mining. We visit the ideasof Entropy and Conditional Entropy along the way. Look at the lectureon Gaussians for discussion of Entropy in the case of continuousprobability density functions.

Probability for Data Miners. This tutorial reviews Probability starting right at ground level.It is, arguably, a useful investment to be completely happy with probabilitybefore venturing into advanced algorithms from data mining, machinelearning or applied statistics. In addition to setting the stage fortechniques to be used over and over again throughout the remainingtutorials, this tutorial introduces the notion of Density Estimationas an important operation, and then introduces Bayesian Classifiers suchas the overfitting-prone Joint-Density Bayes Classifier, and theover-fitting-resistant Naive Bayes Classifier.

Probability Density Functions. A review of a world that you've probably encountered before: real-valuedrandom variables, probability density functions, and how to dealwith multivariate (i.e. high dimensional) probablity densities. Here'swhere you can review things like Expectations, Covariance Matrices, Independence, Marginal Distributions and Conditional Distributions. Onceyou're happy with this stuff you won't be a data miner, but you'll havethe tools to very quickly become one.

Gaussians. Gaussians, both the friendly univariate kind, and theslightly-reticent-but-nice-when-you-get-to-know-them multivariatekind are extremely useful in many parts of statistical datamining, including many data mining models in which the underlying dataassumption is highly non-Gaussian. You need to be friend withmultivariate Gaussians.

Maximum Likelihood Estimation. MLE is a solid tool for learning parameters of a data mining model.It is a methodlogy which tries to do two things. First, it is areasonably well-principled way to work out what computation you should bedoing when you want to learn some kinds of model from data. Second, itis often fairly computationally tractable. In any case, the important thingis that in order to understand things like polynomial regression, neuralnets, mixture models, hidden Markov models and many other things it'sgoing to really help if you're happy with MLE.

Gaussian Bayes Classifiers. Once you are friends with Gaussians, it it easy to use them as subcomponentsof Bayesian Classifiers. This tutorial show you how.

Cross-Validation. Cross-validation is one of several approaches to estimating how wellthe model you've just learned from some training data is going to performon future as-yet-unseen data. We'll review testset validation, leave-one-onecross validation (LOOCV) and k-fold cross-validation, and we'll discussa wide variety of places that these techniques can be used. We'll alsodiscuss overfitting...the terrible phenomenon that CV is supposed topresent. And at the end, our hairs will stand on end as we realize thateven when using CV, you can still overfit arbitrarily badly.

Neural Networks. We begin by talking about linear regression...the ancestor of neural nets. We look at how linear regression can use simple matrixoperations to learn from data. We gurgle with delight as we seewhy one initial assumption leads inevitably to the decision totry to minimize sum squared error. We then explore an alternative wayto compute linear parameters---gradient descent. And then we exploitgradient descent to allow classifiers in addition to regressors, and finallyto allow highly non-linear models---full neural nets in all their glory.

Instance-based learning (aka Case-based or Memory-based or non-parametric). Over a century old, this form of data mining is still being usedvery intensively by statisticians and machine learners alike. We explorenearest neighbor learning, k-nearest-neighbor, kernel methods andlocally weighted polynomial regression. Software and data for the algorithms in this tutorialare available from http://www.cs.cmu.edu/~awm/vizier. The example figures in this slide-set were created with the same software and data.

Eight Regression Algorithms. You'll have to wait to find out Andrew's ordering on them, but based onall the foundations you've covered so far we will quickly be able to runthrough:Regression Trees, Cascade Correlation, Group Method Data Handling (GMDH), Multivariate Adaptive Regression Splines (MARS),Multilinear Interpolation, Radial Basis Functions, Robust Regression, Cascade Correlation + Projection Pursuit

Predicting Real-valued Outputs: An introduction to regression. This lecture is made up entirely from material from the start of theNeural Nets lecture and a subset of the topics in the "FavoriteRegression Algorithms" lecture. We talk about linear regression, andthen these topics: Varying noise, Non-linear regression (verybriefly), Polynomial Regression, Radial Basis Functions, RobustRegression, Regression Trees, Multilinear Interpolation and MARS.

Bayesian Networks. The tutorial first reviews the fundamentals of probability (but to dothat properly, please see the earlier Andrew lectures on Probability forData Mining). It then discusses the use of Joint Distributions for representing and reasoning about uncertain knowledge. Having discussedthe obvious drawback (the curse of dimensionality) for Joint Distributionsas a general tool, we visit the world of clever tricks involvingindepedence and conditional independence that allow us to express ouruncertain knowledge much more succinctly. And then we beam with pleasureas we realize we've got most of the knowledge we need to understandand appreciate Bayesian Networks already. The remainder of the tutorialintroduces the important question of how to do inference with BayesianNetworks (see also the next Andrew Lecture for that).

Inference in Bayesian Networks (by Scott Davies and Andrew Moore). The majority of these slides were conceived and created by ScottDavies (scottd@cs.cmu,edu). Once you've got hold of a BayesianNetwork, there remains the question of how you do inference withit. Inference is the operation in which some subset of the attributesare given to us with known values, and we must use the Bayes net toestimate the probability distribution of one or more of the remainingattributes. A typical use of inference is "I've got a temperature of101, I'm a 37-year-old Male and my tongue feels kind of funny but Ihave no headache. What's the chance that I've got bubonic plague?".

Learning Bayesian Networks. This short and simple tutorial overviews the problem of learningBayesian networks from data, and the approaches that are used. Thisis an area of active research by many research group, includingAndrew and his students (see the Auton LabWebsite for more details).

A Short Intro to Naive Bayesian Classifiers. I recommend using Probability ForData Mining for a more in-depth introduction to Density estimationand general use of Bayes Classifiers, with Naive Bayes Classifiers asa special case. But if you just want the executive summary bottom lineon learning and using Naive Bayes classifiers on categoricalattributes then these are the slides for you.

Short Overview of Bayes Nets. This is a very short 5 minute "executive overview" of theintuition and insight behind Bayesian Networks. Read thefull Bayes NetTutorial for more information.

Gaussian Mixture Models. Gaussian Mixture Models (GMMs) are among the most statistically maturemethods for clustering (though they are also used intensively fordensity estimation). In this tutorial, we introduce the concept ofclustering, and see how one form of clustering...in which we assumethat individual datapoints are generated by first choosing one of aset of multivariate Gaussians and then sampling from them...can be awell-defined computational operation. We then see how to learn such athing from data, and we discover that an optimization approach notused in any of the previous Andrew Tutorials can help considerablyhere. This optimization method is called Expectation Maximization(EM). We'll spend some time giving a few high level explanations anddemonstrations of EM, which turns out to be valuable for many otheralgorithms beyond Gaussian Mixture Models (we'll meet EM again in thelater Andrew Tutorial on Hidden Markov Models). The wild'n'crazyalgebra mentioned in the text can be found (hand-written) here.

K-means and Hierarchical Clustering. K-means is the most famous clustering algorithm. In this tutorialwe review just what it is that clustering is trying to achieve, and weshow the detailed reason that the k-means approach is cleverly optimizingsomething very meaningful. Oh yes, and we'll tell you (and show you)what the k-means algorithm actually does. You'll also learn aboutanother famous class of clusterers: hierarchical methods (much belovedin the life sciences). Phrases like "Hierarchical Agglomerative Clustering"and "Single Linkage Clustering" will be bandied about.

Hidden Markov Models. In this tutorial we'll begin by reviewing Markov Models (aka MarkovChains) and then...we'll hide them! This simulates a very commonphenomenon... there is some underlying dynamic system running alongaccording to simple and uncertain dynamics, but we can't see it. Allwe can see are some noisy signals arising from the underlying system.From those noisy observations we want to do things like predict themost likely underlying system state, or the time history of states, orthe likelihood of the next observation. This has applications in faultdiagnosis, robot localization, computational biology, speechunderstanding and many other areas. In the tutorial we will describehow to happily play with the mostly harmless math surrounding HMMs andhow to use a heart-warming, and simple-to-implement, approach calleddynamic programming (DP) to efficiently do most of the HMM computationsyou could ever want to do. These operations include state estimation,estimating the most likely path of underlying states, and and a grand(and EM-filled) finale, learning HMMs from data.

VC dimension. This tutorial concerns a well-known piece of Machine Learning Theory.If you've got a learning algorithm in one hand and a dataset inthe other hand, to what extent can you decide whether thelearning algorithm is in danger of overfitting or underfitting? If youwant to put some formal analysis into the fascinating question ofhow overfitting can happen, then this is the tutorial for you. In additionto getting good understanding of the overfitting phenomenon, you alsoend up with a method for estimating how well an algorithm willperform on future data that is solely based on its training set error,and a property (VC dimension) of the learning algorithm. VC-dimensionthus gives an alternative to cross-validation, called Structural RiskMinimization (SRM), for choosing classifiers. We'll discuss that. We'llalso very briefly compare both CV and SRM to two other model selectionmethods: AIC and BIC.

Support Vector Machines. We review the idea of the margin of a classifier, and why that maybe a good criterion for measuring a classifier's desirability. Then weconsider the computational problem of finding the largest margin linearclassifier. At this point we look at our toes with embarassment andnote that we have only done work applicable to noise-free data. But wecheer up and show how to create a noise resistant classifier, and thena non-linear classifier. We then look under a microscope at the twothings SVMs are renowned for---the computational ability to surviveprojecting data into a trillion dimensions and the statisticalability to survive what at first sight looks like a classic overfittingtrap.

PAC Learning. PAC stands for "Probably Approximately Correct" and concerns a niceformalism for deciding how much data you need to collect in order fora given classifier to achieve a given probability of correct predictionson a given fraction of future test data. The resulting estimate issomewhat conservative but still represents an interesting avenue bywhich computer science has tried to muscle in on the kind of analyticalproblem that you would normally find in a statistics department.

Markov Decision Processes. How do you plan efficiently if the results of your actions areuncertain? There is some remarkably good news, and some somesignificant computational hardship. We begin by discussing MarkovSystems (which have no actions) and the notion of Markov Systems withRewards. We then motivate and explain the idea of infinite horizondiscounted future rewards. And then we look at two competing approachesto deal with the following computational problem: given a MarkovSystem with Rewards, compute the expected long-term discounted rewards.The two methods, which usually sit at opposite corners of the ring andsnarl at each other, are straight linear algebra and dynamic programming.We then make the leap up to Markov Decision Processes, and find thatwe've already done 82% of the work needed to compute not only thelong term rewards of each MDP state, but also the optimal action totake in each state.
- In addition to these slides, for a survey onReinforcement Learning, please seethis paper or Sutton and Barto's book.
- Visual simulation of Markov Decision Process and Reinforcement Learning algorithms by Rohit Kelkar and Vivek Mehta.

Reinforcement Learning. You need to be happy about Markov Decision Processes (the previousAndrew Tutorial) before venturing into Reinforcement Learning. It concernsthe fascinating question of whether you can train a controller toperform optimally in a world where it may be necessary to suck upsome short term punishment in order to achieve long term reward. Wewill discuss certainty-equivalent RL, the Temporal Difference (TD)learning, and finally Q-learning. The curse of dimensionality willbe constantly learning over our shoulder, salivating and cackling.
- In addition to these slides, for a survey onReinforcement Learning, please seethis paper or Sutton and Barto's book.
- Visual simulation of Markov Decision Process and Reinforcement Learning algorithms by Rohit Kelkar and Vivek Mehta.

Biosurveillance: An example. We review methods described in other biosurveillance slides as applied to hospital admissions data from the Walkerton Cryptosporidium outbreak of 2000. This is work performed as part of the ECADS project.

Elementary probability and Naive Bayes classifiers. This slide repeats much of the material of the main Probability Slide from Andrew's tutorial series, but this slide-set focusses on disease surveillance examples, and includes a very detailed description for non-experts about how Bayes rule is used in practice, about Bayes Classifiers, and how to learn Naive Bayes classifiers from data.

Spatial Surveillance. This tutorial discusses Scan Statistics, a famous epidemiological method for discovering overdensities of disease cases.

Time Series Methods. This tutorial reviews some elementary univariate time series methods, with a focus on using the time series for alerting when a sequence of observations is starting to behave strangely.

Game Tree Search Algorithms, including Alpha-Beta Search. Introduction to algorithms for computer game playing. We describe theassumptions about two-player zero-sum discrete finite deterministic games of perfect information. We also practice saying that noun-phrase in a single breath. After the recovery teams have done their job we talk about solving such games with minimax and then alpha-beta search. We also discuss the dynamic programming approach, used most commonly for end-games. We also debate the theory and practice of heuristic evaluation functions in games.

Zero-Sum Game Theory. Want to know how and why to bluff in poker? How games can be compileddown to a matrix form? And general discussion of thebasics of games of hidden information? Then these are the slides foryou.It might help you to begin by reading the slides on game-tree search.

Non-zero-sum Game Theory. Auctions and electronic negotiations are a fascinating topic. Theseslides take you through most of the basic story of the assumptions,the formalism and the mathematics behind non-zero-sum game theory.It might help you to begin by reading the slides on game-tree searchand Zero-sum Game theory with Hidden informationavailable from this same set of tutorials.In this tutorial we cover the definitionof a multiplayer non-zero-sum game, domination of strategies, NashEquilibia. We deal with discrete games, and also games in which strategiesinclude real numbers, such as your bid in a two player double auctionnegotiation. We cover prisoner's dilemma, tragedy of the commons, double auctions, and multi-player auctions such as the first pricesealed auction and the second price auction.The math for the double auction analysis can be found athttp://www.cs.cmu.edu/~awm/double_auction_math.pdf.

Introductory overview of time-series-based anomaly detection algorithms. This simple tutorial overviews some methods for detecting anomaliesin biosurveillance time series. The slides are incomplete: verbal commentaryfrom the presentation has not yet been included as explanatory textboxes.Please let me (awm@cs.cmu.edu) know if you would be interested in moredetail on these slides and/or access to the software that implements andgraphs the various univariate methods. If I receive enough requests Iwill try to make both of the above available.

AI Class introduction. A very quick informal discussion of the different kinds of AI research motivations out there

Search Algorithms. What is a search algorithm? What job does it do and where can it beapplied? We introduce various flavors of Breadth First Search andDepth First search and then looks at alternatives and improvementsthat include Iterative Deepening and Bidirectional Search. Then welook with furrowed brow at an idea called Best First Search. This willbe our first view of a search algorithm that is able to exploit a heuristicfunction.

A-star Heuristic Search. The classic algorithm for finding shortests paths given an admissibleheuristic. We'll deal with the notion of admissibility (summary: admissible =optimistic). We show how you can prove properties of A*. We'll alsobriefly discuss IDA* (iterative deepening A*).

Constraint Satisfaction Algorithms, with applications in Computer Vision and Scheduling. The tutorial teaches concepts from the AI literature on ConstraintSatisfaction. Accompanying animations are inhttp://www.cs.cmu.edu/~awm/animations/constraint.This is a special case of uninformed search in which wewant to find a solution configuration for some set of variables thatsatisfies a set of constraints. Example problems including graphcoloring, 8-queens, magic squares, the Waltz algorithm forinterpreting line drawings, many kinds of scheduling and mostimportant of all, the deduction phase of minesweeper. The algorithmswe'll look at include backtracking search, forward checking search andconstraint propagation search. We'll also look at general-purposeheuristics for additional search accelerations.

Robot Motion Planning. We review some algorithms for clever path planning once we arrive inreal-valued continuous space instead of the safe and warm discretespace we've been sheltering in so far. We look at configuration spaces,visibility graphs, cell-decomposition, voronoi-based planning andpotential field methods. Unfortunately some of the figures are missingfrom the PDF version.

HillClimbing, Simulated Annealing and Genetic Algorithms. Some very useful algorithms, to be used only in case of emergency.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。