ERROR CHECKING and OUTLIERS Prof.Dr. Hamit ACEMOĞLU 1
The Aim By the end of this lecture, the students will be aware of error checking and outliers. 2
The Goals -Understand the importance of error checking before begining to deal with the data -Must be able to count screening methods against data errors -frequancy distribution -distribution width -Able to detect outliers -frequancy -grafics -Explain how to use the outliers 3
During collecting data or entering the data into computer, errors may be involved. Carefully applying the rules will reduce the possibility of data entry errors. However, the review of our database in terms of incorrect data and then starting the analysis phase is very important. 4 4
After analysis and writing our article, we may realise that some data have been entered incorrectly or same mistakes have been done during measurements. In such cases, you may need to make reanalysis even completely.
-The most mistakes we encounter with are done while entering data into the computer. -Sometimes regardless of the computer screen, data can be entered in order to make data entry faster in the long form data. In this case, if a variable domain is omitted, all data remaining will be shifted. -Also errors may occure by pressing the same key more than once. In this case, values such as 11 or 111 may be entered instead of 1. 6 6
The comparison of separete databases can be done to prevent errors in data entry by two people. After the utmost care during data entry, nevertheless we have to check for errors in the entered data. 7 7
Error search Because of the limited value that can be entered categorical variables are relatively easy to check for errors. Numerical variables are more difficult to control. Since there is certain range, data can be viewed on whether exceeding that range by sorting. 8
-We can control data by eyes individually -We can control data by eyes individually. It takes time in large databases. It's easy to find the value in a variable other than 1 and 2, as encoded 1 “yes” and 2 “no”. -We can estimate inappropriate data in numerical variables. In a study among high school students, we expect to have *age range is 14-20 and *hemoglobin range is 10-16mg/dl. 9 9
In order to easily find inaccurate data, we can use Sorting The width of the frequency Distribution properties in SPSS. 10 10
Let us debug hataayiklama. sav database Let us debug hataayiklama.sav database. The data "id" (survey number) is sorted by the variables. We will look at “age” variables. We can control data by sorting these variables. 1-Sorting Data > Sort Cases> [Let’s bring “age” variable into “sort by” area]> OK Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 11 11
Let's have a lookat a categorical variable Let's have a lookat a categorical variable. We can control “sex” variable by sorting, as in the example above. Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 13 13
2-The width of the frequency Another method would be to look at the distribution of variable width. Analyze> Descriptive Statistics > Descriptives > [Let’s bring “age” variable into “Variable(s)” area]>OK Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 15 15
Descriptive Statistics N Minimum Maximum Mean Std. Deviation Age 439 22 99 54,27 12,494 Valid N (listwise)
3-Distribution properties Another method might be to look at the frequency distribution of this variable: Analyze> Descriptive Statistics > Frequencies > [Let’s bring “sex” variable into “Variable(s)” area ]> OK Cinsiyet için 440 veri girildiğini, 241 adet 1 (Male), 195 adet 2 (Female), bir adet 3, bir adet 4, bir adet 11 ve bir adet te 22 girildiğini görüyoruz. 11 olarak girilen verinin 1 (Male), 22 olarak girilen verinin de 2 (Female) olma ihtimali yüksektir. Bu 4 veriyi de anket numaralarını bularak kontrol etmeli, hatayı bulup düzeltmeliyiz. 17 17
Sex of the patient Frequency Percent Valid Percent Cumulative Percent Valid Male 241 54,8 Female 195 44,3 99,1 3 1 ,2 99,3 4 99,5 11 99,8 22 100,0 Total 440
If missing data is present, what we do? There are several causes of missing data: -1-The data could not be obtained due to individual refusal (e.g. Individual may not want to specify the use of alcohol). -2-The data could not be obtained because it is not appropriate to the individual. (e.g. Male respondents will leave the question blank, such as "Do you use birth control pills?“) -3-The data were taken, but not entered into the computer (Secretary error). Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 19
If missing data is present, what we do? What ever the couse, presence of missing data is not a wanted situation. It is more severe if the missing data is our main outcome measure. Some analysis con not be performed or the reliability of the result is affected. Our results may be severely biased. Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 20
If missing data is present, what we do? It may be possible to reduce this bias by using appropriate statistical methods Estimating the missing data in some way, To minimize the amount of missing data at the outset. Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 21
Outliers Outliers are very different from all the others, considered to be unsuitable for the data set as compared with other data. As might be incorrect, such unsuitable values may reflect the thruth. Therefore, debugging should be performed and accuracy must be checked. Örn: Bir bayanın 190 cm boyunda olması bir uç değerdir. Ancak, nadir de olsa bu durum mümkündür. Varsa bu bireyin yaş ve ağırlık verilerine de bakarak yorum yapabiliriz. 22
We must protect these values, when we decide that; outliers reflect the reality. An outlier should be deleted only if there is doubt. Presense of more outliers may affect the statistical analysis. In this case we can apply to data conversion or we need to choose a non-parametric tests. 23 23
-By sorting the data, we can control the presence of outliers by our eyes. 24 24
1. sınıf Biyoistatistik 2009-2010 26 26
-Another method is controling by produsing box plots. *Graphs>Interactive>Boxplot [Let’s put “Weight” variable on Y axis, “Marital status” variable on X axis]>OK *In boxplot graphic, individuals which are marked beyond stack represent outleirs. 27 27
Boxplot graphics give information about 5 features of varibles; -The smallest value (minimum) -The limit that 25% of values start (first quartile ) -Median -The limit that 75% of values start (third quartile ) -The largest value 29 29
1. sınıf Biyoistatistik 2009-2010 SPSS’te saplı kutu grafikleri çizildiğinde kutunun sap kısımının dışında uç değerler ve aşırı değerler de gösterilir.Veri kutudan 1.5 dörtlük çeyrek değerler genişliği (interquartile range) uzaktaysa “uç değer” (outlier), 3 veya daha fazla dörtlük çeyrek değerler genişliği uzaktaysa “aşırı değer” (extreme) olarak tanımlanır. SPSS’te uç değerler daire ile, aşırı değerler ise yıldızla gösterilir. 1. sınıf Biyoistatistik 2009-2010 32 32
-Another method is controling by produsing box plots. *Graphs>Interactive>Boxplot [Let’s put “Weight” variable on Y axis, “Marital status” variable on X axis]>OK *In boxplot graphic, individuals which are marked beyond stack represent outleirs. 33 33
It is possible for multiple variables to have boxplot graphics drowen and to look at outleirs in SPSS. Let’s have boxplot graphics of “age”, “weight” and “height” variables drowed in Diyabet.sav data set. Graphs>Legacy Dialogues>Boxplot [Simple, Select Summaries of seperate variables, than press Define button]>[Bring “age”, “weight”, “height” variables into “Boxes represent:” area]>OK. We get the following graphic: 35 35
1. sınıf Biyoistatistik 2009-2010 Görüldüğü üzere “age” değişkeninde 112. bireyin yaşı (90 yaşında) uç değer olarak belirtilmiştir. “Weight” değişkeninde ise hem uç, hem de aşırı değerler vardır. Height değişkeninde aşırı ve uç değer saptanmamıştır. 1. sınıf Biyoistatistik 2009-2010 36 36
Summary Analysis should not be initiated immediately after the computer data entry is done. Analysis phase should be started always after debugging. Debugging is done by; eyes, frequency, graphics and distribution width methods. 37