Since the book doesn't cover case-based learning (CBR), I am emailing a variant of today's (April 2's) lecture on CBR. Feel free to come to office hours or talk after class if you have questions. Jude In ML, probabilistic reasoning, and (later this term) in logical reasoning we tend to focus on general 'rules' rather than on specific cases. However, another approach to intelligent behavior is to reason directly with specific cases. Both the 'general' and the 'specific' approaches have their value. Eg, compare reasoning with Newton's Laws in physics to learning about business by studying specific companies (eg, IBM). The basic framework for CBR is: 0) Store all cases (eg, the specific training examples in ML) in a 'case library' 1) When given a new problem (eg, a testset example), find the k *most similar* cases in the case library 2) Use these k cases to create a solution for the new problem 3) If the correct answer for the new problem is later obtained, add this new problem and its solution to the case library 4) Goto 1 The "k-nearest-neighbor" (k-NN) is a simple case of CBR: Store all training examples. For each testset example compute the distance to each training example keep k closest training examples (ie, the k nearest neighbors) use the majority answer on these k examples as the predicted category for the testset example To use CBR in general and k-NN in particular, one must choose a similarity function (or, equivalently, a distance function; eg, '-distance' or '1/distance' can be viewed as similarity functions). The most common distance function is Euclidean distance, ie, sum the square of the differences of each feature then take the square root (for non-numeric data, can say distance=0 if feature values are the same and distance=1 otherwise). However other similarity or distance functions are possible. (Note: we should also normalize all numeric features to the range [0,1] or [-1,1], since we dont want to get different results if we measured (a) age in years and height in inches versus (b) age in seconds and height in meters.) In k-NN one also needs to choose the k? We can use a tuning set (or more generally, 'cross validation') to select a good value for k for a given dataset. Why not use K=1? That might be the best in some domains, but in others a larger value might be better, in order to "smooth out" noise. Example (you might want to change to a fixed-width font): F1 F2 Category ex1 1 2 + ex2 4 3 - ex3 7 9 + test 6 6 ? distance(test, ex1) = 52 + 42 distance(test, ex2) = 22 + 32 distance(test, ex3) = 12 + 32 Assuming k=1, we want the CLOSEST training example, which is ex3, so we would predict category(test) = + You should also think about the "Venn Diagram View of Feature Space" and the k-NN algorithm (see Fig 20.12a in the textbook for something similar). Pro's of k-NN (pro's = good aspects) ----------------- very simple algo training is very fast can be accurate on real-world problems (especially if k is 'tuned') Con's of k-NN (con's = bad aspects) ------------------ can take a long time to classify a new example (in the simplest approach need to compute distance to all training examples; however, really only need the k closest, so with some clever tricks can find the best k w/o looking at ALL training examples) no learned model for the human user to inspect and see what was learned (*I forgot to mention this one in class today*) computes distance for IRRELEVANT features as well as for the relevant ones - so likely to be mislead if many features ("curse of dimensionality") - might want to use, say, info gain, to select the L best features before computing distances need to choose similarity/distance function and k (but most ML algo's have parameters to set, so k-NN isn't that bad comparably, and if we have a good amount of training data we can 'let the data decide' on good choices).