CSE 439 – Data Mining Assist. Prof. Dr. Derya BİRANT Classification Part 1 CSE 439 – Data Mining Assist. Prof. Dr. Derya BİRANT
Outline What Is Classification? Classification Examples Classification Methods Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches
What Is Classification? Construction of a model to classify data When constructing the model, use the training set and the class labels After the construction of the model, use it in classifying new data
Classification (A Two-Step Process) Model construction Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, trees, or mathematical formulae Model usage (Classifying future or unknown objects) Estimate accuracy rate of the model Accuracy rate is the percentage of test set samples that are correctly classified by the model If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Classification (A Two-Step Process) Data To Predict DM Engine Predicted Data Mining Model Mining Model DM Engine Training Data Mining Model
Classification Example Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? Process (2): Using the Model in Prediction Process (1): Model Construction
Classification Example Given old data about customers and payments, predict new applicant’s loan eligibility. Good Customers Bad Customers Previous customers Classifier Rules Salary > 5 L Prof. = Exec Good/ bad Age Salary Profession Location Customer type New applicant’s data
Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches
Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches …
Decision Trees Decision Tree is a tree where internal nodes are simple decision rules on one or more attributes leaf nodes are predicted class labels Decision trees are used for deciding between several courses of action Attribute Value age? student? credit rating? <=30 >40 no yes 31..40 Fair Excellent Yes No Classification
Desicion Tree Applications Decision trees are used extensively in data mining. Has been applied to: classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment, ... Salary < 1 M Job = teacher Good Age < 30 Bad House Hiring
Decision Trees (Different Representation) DT Splits Area ( Different representation of decision tree) Minivan Age Car Type YES NO <30 >=30 Sports, Truck 30 60 Age YES NO Minivan Sports, Truck short medium tall short medium tall
Decision Tree Adv. DisAdv. Positives (+) Reasonable training time Fast application Easy to interpret (can be re-represented as if-then-else rules) Easy to implement Can handle large number of features Does not require any prior knowledge of data distribution Negatives (-) Cannot handle complicated relationship between features Simple decision boundaries Problems with lots of missing data Output attribute must be categorical Limited to one output attribute
Rules Indicated by Decision Trees Write a rule for each path in the decision tree from the root to a leaf.
Decision Tree Algorithms ID3 Quinlan (1981) Tries to reduce expected number of comparison C 4.5 Quinlan (1993) It is an extension of ID3 Just starting to be used in data mining applications Also used for rule induction CART Breiman, Friedman, Olshen, and Stone (1984) Classification and Regression Trees CHAID Kass (1980) Oldest decision tree algorithm Well established in database marketing industry QUEST Loh and Shih (1997)
Decision Tree Construction Which attribute is the best classifier? Calculate the information gain G(S,A) for each attribute A. The basic idea is that we select the attribute with the highest information gain.
Decision Tree Construction Which attribute first? Hava Sıcaklık Nem Rüzgar Tenis Güneşli Sıcak Yüksek Hafif Hayır Kuvvetli Bulutlu Evet Yağmurlu Ilık Serin Normal
Decision Tree Construction Hava Sıcaklık Nem Rüzgar Tenis Güneşli Sıcak Yüksek Hafif Hayır Kuvvetli Bulutlu Evet Yağmurlu Ilık Serin Normal Gain(S, Hava) = 0,246 Gain(S, Sıcaklık) = 0,029 Gain(S, Nem) = 0,151 Gain(S, Rüzgar) = 0,048
Decision Tree Construction Which attribute is next? Hava Güneşli Bulutlu Yağmurlu ? Evet
Decision Tree Construction Hava Sıcaklık Nem Rüzgar Tenis R1 Güneşli Sıcak Yüksek Hafif Hayır R2 Kuvvetli R3 Bulutlu Evet R4 Yağmurlu Ilık R5 Serin Normal R6 R7 R8 R9 R10 R11 R12 R13 R14 Hava Güneşli Bulutlu Yağmurlu Nem Yüksek Normal Hayır Evet Rüzgar Hafif Kuvvetli [R3,R7,R12,R13] [R4,R5,R10] [R6,R14] [R1,R2, R8] [R9,R11]
Another Example At the weekend: - go shopping, - watch a movie, - play tennis or - just stay in. What you do depends on three things: the weather (windy, rainy or sunny); how much money you have (rich or poor) - whether your parents are visiting.
Another Example
Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches …
Classification Techniques 2- Bayesian Classification A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities. Foundation: Based on Bayes’ Theorem. Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
Classification Techniques 2- Bayesian Classification C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (Age <=30, Income = medium, Student = yes Credit_rating = Fair)
Classification Techniques 2- Bayesian Classification X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(C1): P(buys_computer = “yes”) = 9/14 = 0.643 P(C2): P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 P(X|C1) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|C2) : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)
Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches …
K-Nearest Neighbor (k-NN) An object is classified by a majority vote of its neighbors (k closest members) . If k = 1, then the object is simply assigned to the class of its nearest neighbor. Euclidean Distance measure is used to calculate how close
K-Nearest Neighbor (k-NN)
Classification Evaluation (Testing) categorical categorical continuous class Test Set Learn Classifier Model Training Set
Classification Accuracy True Positive False Negative False Positive True Negative Which classification model is better?
Validation Techniques Simple Validation Cross Validation n-Fold Cross Validation Training set Test set Training set Test set Test set Training set Bootstrap Method