ERROR CHECKING and OUTLIERS

Slides:



Advertisements
Benzer bir sunumlar
Chapter Seventeen 11. HAFTA.
Advertisements

Alakalı müşterileri hedefleyin. Google ile Yeniden Pazarlama Remarketing with Google. Target customers who are already showing interest in your business.
Google Display Network Targeting options.
Doç. Dr. Turan SET Karadeniz Teknik Üniversitesi Tıp Fakültesi Aile Hekimli ğ i Anabilim Dalı HATA AYIKLAMA VE UÇ DEĞERLER.
VERİLERİN GRAFİKLERLE GÖSTERİLMESİ
I ASİMO I ASİMO PREPARED: CENGİZ MURAT TEKİNBÜĞRÜ English Course Presentation TURKEY Mechatronics Engineering at SAKARYA UNIVERSITY PREPARED: CENGİZ.
Üniversitemiz Öğrenci Bilgi Sistemine Kullanıcı Adı (Öğrenci Numarası ) ve tarafınızdan belirlenen Şifre ile giriş yapılır; You can have access to Student.
Atama ve eşleme (eşleştirme) problemleri (Matching and Assignment problems)
Hareket halindeki insanlara ulaşın.Mobil Arama Ağı Reklamları Reach customers with Mobile Search Network.
INQUIRY FROM A B2B SITE Dear Sir/Madam We are writing to enquire about your sunflower oil. Please send us your product specification and price. Best Regards.
NOUN CLAUSES (İSİM CÜMLECİKLERİ).
/ 141 Yrd. Doç. Dr. Turan SET Atatürk University Medical Faculty, Erzurum QUALİTY CIRCLES
Bilgisayarlar kodu nasıl işler?
Today’s Lesson By the end of this lesson you should be able to say phone numbers in Turkish.
BİLİMSEL ARAŞTIRMA YÖNTEMLERİ
Kampanyanızı optimize edin. Görüntülü Reklam Kampanyası Optimize Edici'yi Kullanma Display Ads Campaign Optimizer. Let Google technology manage your diplay.
EXPRESSING AGREEMENT AND DISAGREEMENT
COSTUMES KILIKLAR (KOSTÜMLER)
Key Terms from the Chapters. Chapter -1 Statistics, Data, and Statistical Thinking Fundemantal Elements of Statistics Statistics: EN: Statistics is the.
Veri Yapıları ve Algoritmalar
BM-305 Mikrodenetleyiciler Güz 2015 (6. Sunu) (Yrd. Doç. Dr. Deniz Dal)
Database for APED Büşra Bilgili | Emirhan Aydoğan | Meryem Şentürk | M. Arda Aydın COMPE 341.
AVL Trees / Slide 1 Silme * Anahtar hedefi silmek için, x yaprağında buluruz ve sonra sileriz. * Dikkat edilmesi gereken iki durum vardır. (1) Hedef bazi.
21/02/2016 A Place In My Heart Nana Mouskouri « Istanbul « (A Different Adaptation)
Araştırma yöntemleri/Research methods 1.İyi bir araştırma, açık ve kesin sorular sorar ve bu sorulara yanıt vermeden önce destekleyici olgusal kanıt bulmaya.
LITERARY TRANSLATION 2 Week 5. In-class translation workshop.
Practice your writing skills
Hata ayıklama ve uç değerler
First Conditional Sentences. LOOK AT THE EXAMPLES If the weather is fine, we’ll play tenis If I have enough money, I’ll buy the car If it rains, we’ll.
MIDDLE SCHOOL I-SEARCH Research Paper... Product.... Presentation...
Prof. Dr. Hamit ACEMOĞLU. The aim By the and of this lecture, the studests will be aware of basic statistical significance tests and applications used.
This is beak. There are feet. There are wings. There are eyes. This is tongue.
DISCUSSION
CHILD PORNOGRAPHY IŞIK ÜNİVERSİTESİ
Students social life and join the social clubs. BARIŞ KILIÇ - EGE DÖVENCİ IŞIK ÜNİVERSİTESİ
Sieve Analysis Concrete Mix Design Technician School.
Self-Registration on the Coats Supplier Portal
İSTATİSTİK II Hipotez Testleri 1.
CHAPTER 1 uzm. Psk. Özlem ataoğlu
YDI101 YABANCI DIL 1 HAFTA 1. We use subject pronouns when the pronoun is the subject of the sentence. When the subject appears the second time, we don’t.
The Simple Linear Regression Model
BİLİMSEL ÇALIŞMA BASAMAKLARI SCIENTIFIC WORKING STEPS MHD BASHAR ALREFAEI Y
LEFM and EPFM LEFM In LEFM, the crack tip stress and displacement field can be uniquely characterized by K, the stress intensity factor. It is neither.
Banach Sabit Nokta Teoremi (Büzülme Teoremi)
Bilgisayarlar kodu nasıl işler?
Bir Problemin Programa Dönüştürülme Süreci
Future: I will/shall & I am going to. Structure: Subject+will/shall+verb(base form)+object.
tomorrow soon next week / year in five minutes/ in two hours later today I'll go to the market tomorrow. Don’t worry. He will be here soon. There.
ETwinning Nedir?. Türkiye’de 81 ilin katılımıyla yürütülen 1.Projemiz.
Döngüler ve Shift Register
NİŞANTAŞI ÜNİVERSİTESİ
Turkish cuisine is very popular around the world. It has a very wide options for everyone. The variety of the recipes and the ingredients which are grown.
NİŞANTAŞI ÜNİVERSİTESİ
“Differentiation for making a positive Difference!!!!”
BİLL GATES Şule Eslem ÖZTÜRK NUN OKULLARI Prep-A.
Multipoint programlama
NİŞANTAŞI ÜNİVERSİTESİ
NİŞANTAŞI ÜNİVERSİTESİ
Before the Battle of Çanakkale. Why a Front in Çanakkale was Opened? In the summer of 1914, the war continued in Europe with all its intensity, and by.
Feminism, unlike the idea of ​​ mankind, is a trend that is prioritized to bring gender inequality to the agenda. The notion of feminism, which is not.
(Dr. Öğr. Üyesi Deniz Dal)
Imagine that you are a teacher and you are taking your 20 students to England for the summer school.
THE MYSTERIOUS ISLAND JULES VERNE. INFORMATION ABOUT THE BOOK  Name of the book: The Mysterious Island  Author: Jules Verne  Type: Adventure  Number.
PREPARED BY: 9-B STUDENTS. Sumerians, who laid the foundations of great civilizations and the world cultural heritage, emerged to the st The Sumerians.
DÜZLEMSEL MEKANİZMALARIN
Bilgisayarlar kodu nasıl işler?
SUBJECT NAME Prepeared by Write the names of group members here
People with an entrepreneurial mindset are always brave.
Sunum transkripti:

ERROR CHECKING and OUTLIERS Prof.Dr. Hamit ACEMOĞLU 1

The Aim By the end of this lecture, the students will be aware of error checking and outliers. 2

The Goals -Understand the importance of error checking before begining to deal with the data -Must be able to count screening methods against data errors -frequancy distribution -distribution width -Able to detect outliers -frequancy -grafics -Explain how to use the outliers 3

During collecting data or entering the data into computer, errors may be involved. Carefully applying the rules will reduce the possibility of data entry errors. However, the review of our database in terms of incorrect data and then starting the analysis phase is very important. 4 4

After analysis and writing our article, we may realise that some data have been entered incorrectly or same mistakes have been done during measurements. In such cases, you may need to make reanalysis even completely.

-The most mistakes we encounter with are done while entering data into the computer. -Sometimes regardless of the computer screen, data can be entered in order to make data entry faster in the long form data. In this case, if a variable domain is omitted, all data remaining will be shifted. -Also errors may occure by pressing the same key more than once. In this case, values ​​such as 11 or 111 may be entered instead of 1. 6 6

The comparison of separete databases can be done to prevent errors in data entry by two people. After the utmost care during data entry, nevertheless we have to check for errors in the entered data. 7 7

Error search Because of the limited value that can be entered categorical variables are relatively easy to check for errors. Numerical variables are more difficult to control. Since there is certain range, data can be viewed on whether exceeding that range by sorting. 8

-We can control data by eyes individually -We can control data by eyes individually. It takes time in large databases. It's easy to find the value in a variable other than 1 and 2, as encoded 1 “yes” and 2 “no”. -We can estimate inappropriate data in numerical variables. In a study among high school students, we expect to have *age range is 14-20 and *hemoglobin range is 10-16mg/dl. 9 9

In order to easily find inaccurate data, we can use Sorting The width of the frequency Distribution properties in SPSS. 10 10

Let us debug hataayiklama. sav database Let us debug hataayiklama.sav database. The data "id" (survey number) is sorted by the variables. We will look at “age” variables. We can control data by sorting these variables. 1-Sorting Data > Sort Cases> [Let’s bring “age” variable into “sort by” area]> OK Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 11 11

Let's have a lookat a categorical variable Let's have a lookat a categorical variable. We can control “sex” variable by sorting, as in the example above. Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 13 13

2-The width of the frequency Another method would be to look at the distribution of variable width. Analyze> Descriptive Statistics > Descriptives > [Let’s bring “age” variable into “Variable(s)” area]>OK Age değişkeninin 22 ile 99 arasında dağıldığını görüyoruz. Bu değerler normal olabilir. 99 yaşındaki bireyin anketini bulup (34 nolu anket) yaşını kontrol edebiliriz. 15 15

Descriptive Statistics N Minimum Maximum Mean Std. Deviation Age 439 22 99 54,27 12,494 Valid N (listwise)

3-Distribution properties Another method might be to look at the frequency distribution of this variable: Analyze> Descriptive Statistics > Frequencies > [Let’s bring “sex” variable into “Variable(s)” area ]> OK Cinsiyet için 440 veri girildiğini, 241 adet 1 (Male), 195 adet 2 (Female), bir adet 3, bir adet 4, bir adet 11 ve bir adet te 22 girildiğini görüyoruz. 11 olarak girilen verinin 1 (Male), 22 olarak girilen verinin de 2 (Female) olma ihtimali yüksektir. Bu 4 veriyi de anket numaralarını bularak kontrol etmeli, hatayı bulup düzeltmeliyiz. 17 17

Sex of the patient Frequency Percent Valid Percent Cumulative Percent Valid Male 241 54,8 Female 195 44,3 99,1 3 1 ,2 99,3 4 99,5 11 99,8 22 100,0 Total 440

If missing data is present, what we do? There are several causes of missing data: -1-The data could not be obtained due to individual refusal (e.g. Individual may not want to specify the use of alcohol). -2-The data could not be obtained because it is not appropriate to the individual. (e.g. Male respondents will leave the question blank, such as "Do you use birth control pills?“) -3-The data were taken, but not entered into the computer (Secretary error). Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 19

If missing data is present, what we do? What ever the couse, presence of missing data is not a wanted situation. It is more severe if the missing data is our main outcome measure. Some analysis con not be performed or the reliability of the result is affected. Our results may be severely biased. Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 20

If missing data is present, what we do? It may be possible to reduce this bias by using appropriate statistical methods Estimating the missing data in some way, To minimize the amount of missing data at the outset. Eksik verinin sebebi ne olursa olsun istenmeyen bir durumdur. Hatta veri eksikliği olan değişken esas araştırma konumuz (main outcome measure/ana sonuç ölçütü) ise bu durumda durum daha da ciddir. 21

Outliers Outliers are very different from all the others, considered to be unsuitable for the data set as compared with other data. As might be incorrect, such unsuitable values may reflect the thruth. Therefore, debugging should be performed and accuracy must be checked. Örn: Bir bayanın 190 cm boyunda olması bir uç değerdir. Ancak, nadir de olsa bu durum mümkündür. Varsa bu bireyin yaş ve ağırlık verilerine de bakarak yorum yapabiliriz. 22

We must protect these values, when we decide that; outliers ​reflect the reality. An outlier should be deleted only if there is doubt. Presense of more outliers may affect the statistical analysis. In this case we can apply to data conversion or we need to choose a non-parametric tests. 23 23

-By sorting the data, we can control the presence of outliers by our eyes. 24 24

1. sınıf Biyoistatistik 2009-2010 26 26

-Another method is controling by produsing box plots. *Graphs>Interactive>Boxplot [Let’s put “Weight” variable on Y axis, “Marital status” variable on X axis]>OK *In boxplot graphic, individuals which are marked beyond stack represent outleirs. 27 27

Boxplot graphics give information about 5 features of varibles; -The smallest value (minimum) -The limit that 25% of values start (first quartile ) -Median -The limit that 75% of values start (third quartile ) -The largest value 29 29

1. sınıf Biyoistatistik 2009-2010 SPSS’te saplı kutu grafikleri çizildiğinde kutunun sap kısımının dışında uç değerler ve aşırı değerler de gösterilir.Veri kutudan 1.5 dörtlük çeyrek değerler genişliği (interquartile range) uzaktaysa “uç değer” (outlier), 3 veya daha fazla dörtlük çeyrek değerler genişliği uzaktaysa “aşırı değer” (extreme) olarak tanımlanır. SPSS’te uç değerler daire ile, aşırı değerler ise yıldızla gösterilir. 1. sınıf Biyoistatistik 2009-2010 32 32

-Another method is controling by produsing box plots. *Graphs>Interactive>Boxplot [Let’s put “Weight” variable on Y axis, “Marital status” variable on X axis]>OK *In boxplot graphic, individuals which are marked beyond stack represent outleirs. 33 33

It is possible for multiple variables to have boxplot graphics drowen and to look at outleirs in SPSS. Let’s have boxplot graphics of “age”, “weight” and “height” variables drowed in Diyabet.sav data set. Graphs>Legacy Dialogues>Boxplot [Simple, Select Summaries of seperate variables, than press Define button]>[Bring “age”, “weight”, “height” variables into “Boxes represent:” area]>OK. We get the following graphic: 35 35

1. sınıf Biyoistatistik 2009-2010 Görüldüğü üzere “age” değişkeninde 112. bireyin yaşı (90 yaşında) uç değer olarak belirtilmiştir. “Weight” değişkeninde ise hem uç, hem de aşırı değerler vardır. Height değişkeninde aşırı ve uç değer saptanmamıştır. 1. sınıf Biyoistatistik 2009-2010 36 36

Summary Analysis should not be initiated immediately after the computer data entry is done. Analysis phase should be started always after debugging. Debugging is done by; eyes, frequency, graphics and distribution width methods. 37