... konulu sunumlar: "CSE 439 – Data Mining Assist. Prof. Dr. Derya BİRANT"— Sunum transkripti:
1CSE 439 – Data Mining Assist. Prof. Dr. Derya BİRANT Classification Part 1CSE 439 – Data MiningAssist. Prof. Dr. Derya BİRANT
2Outline What Is Classification? Classification Examples Classification MethodsDecision TreesBayesian ClassificationK-Nearest NeighborNeural NetworkGenetic AlgorithmsSupport Vector Machines (SVM)Fuzzy Set Approaches
3What Is Classification? Construction of a model to classify dataWhen constructing the model, use the training set and the class labelsAfter the construction of the model, use it in classifying new data
4Classification (A Two-Step Process) Model constructionEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, trees, or mathematical formulaeModel usage (Classifying future or unknown objects)Estimate accuracy rate of the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
5Classification (A Two-Step Process) DataTo PredictDMEnginePredicted DataMining ModelMining ModelDMEngineTraining DataMining Model
6Classification Example TrainingDataClassificationAlgorithmsIF rank = ‘professor’OR years > 6THEN tenured = ‘yes’Classifier(Model)ClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?Process (2): Using the Model in PredictionProcess (1): Model Construction
7Classification Example Given old data about customers and payments, predict new applicant’s loan eligibility.Good CustomersBad CustomersPrevious customersClassifierRulesSalary > 5 LProf. = ExecGood/badAgeSalaryProfessionLocationCustomer typeNew applicant’s data
10Decision Trees Decision Tree is a tree where internal nodes are simple decision rules on one or more attributesleaf nodes are predicted class labelsDecision trees are used for deciding between several courses of actionAttributeValueage?student?credit rating?<=30>40noyes31..40FairExcellentYesNoClassification
11Desicion Tree Applications Decision trees are used extensively in data mining.Has been applied to:classify medical patients based on the disease,equipment malfunction by cause,loan applicant by likelihood of payment,...Salary < 1 MJob = teacherGoodAge < 30BadHouse Hiring
12Decision Trees (Different Representation) DT Splits Area ( Different representation of decision tree)MinivanAgeCar TypeYESNO<30>=30Sports, Truck3060AgeYESNOMinivanSports, Truckshortmediumtallshortmediumtall
13Decision Tree Adv. DisAdv. Positives (+)Reasonable training timeFast applicationEasy to interpret(can be re-represented as if-then-else rules)Easy to implementCan handle large number of featuresDoes not require any prior knowledge of data distributionNegatives (-)Cannot handle complicated relationship between featuresSimple decision boundariesProblems with lots of missing dataOutput attribute must be categoricalLimited to one output attribute
14Rules Indicated by Decision Trees Write a rule for each path in the decision tree from the root to a leaf.
15Decision Tree Algorithms ID3Quinlan (1981)Tries to reduce expected number of comparisonC 4.5Quinlan (1993)It is an extension of ID3Just starting to be used in data mining applicationsAlso used for rule inductionCARTBreiman, Friedman, Olshen, and Stone (1984)Classification and Regression TreesCHAIDKass (1980)Oldest decision tree algorithmWell established in database marketing industryQUESTLoh and Shih (1997)
16Decision Tree Construction Which attribute is the best classifier?Calculate the information gain G(S,A) for each attribute A.The basic idea is that we select the attribute with the highest information gain.
17Decision Tree Construction Which attribute first?HavaSıcaklıkNemRüzgarTenisGüneşliSıcakYüksekHafifHayırKuvvetliBulutluEvetYağmurluIlıkSerinNormal
18Decision Tree Construction HavaSıcaklıkNemRüzgarTenisGüneşliSıcakYüksekHafifHayırKuvvetliBulutluEvetYağmurluIlıkSerinNormalGain(S, Hava) = 0,246Gain(S, Sıcaklık) = 0,029Gain(S, Nem) = 0,151Gain(S, Rüzgar) = 0,048
19Decision Tree Construction Which attribute is next?HavaGüneşliBulutluYağmurlu?Evet
20Decision Tree Construction HavaSıcaklıkNemRüzgarTenisR1GüneşliSıcakYüksekHafifHayırR2KuvvetliR3BulutluEvetR4YağmurluIlıkR5SerinNormalR6R7R8R9R10R11R12R13R14HavaGüneşliBulutluYağmurluNemYüksekNormalHayırEvetRüzgarHafifKuvvetli[R3,R7,R12,R13][R4,R5,R10][R6,R14][R1,R2, R8][R9,R11]
21Another Example At the weekend: - go shopping, - watch a movie, - play tennis or- just stay in.What you do depends on three things:the weather (windy, rainy or sunny);how much money you have (rich or poor) - whether your parents are visiting.
24Classification Techniques 2- Bayesian Classification A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.Foundation: Based on Bayes’ Theorem.Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
28K-Nearest Neighbor (k-NN) An object is classified by a majority vote of its neighbors (k closest members) .If k = 1, then the object is simply assigned to the class of its nearest neighbor.Euclidean Distance measure is used to calculate how close