This R Tutorial builds on the previous R tutorial and arms you with few more fundamental analysis tools for capturing essence of data. This will go a long way in understanding the data more.
Continued from previous post R : Basic Data Analysis – Part 1
Please use the dataset : WHO ( Open the XLS and and save it as csv for rest of the discussion )
How to create Tables of Summary
- To get to know data better we need to dig deeper into data by knowing sum, count, mean etc.. over levels of factors or ranges of numerical variables
- Lets say we want to know how many countries are in each region using Region variable which is a categorical variable (factor). We can do this using simple table command.
table(WHO$Region) Africa Americas Eastern Mediterranean 46 35 22 Europe South-East Asia Western Pacific 53 11 27
- Lets get it a little complex . Let us say we want to know how many countries we in each of the region which have population above 30000
table(WHO$Region,WHO$Population > 30000) FALSE TRUE Africa 38 8 Americas 29 6 Eastern Mediterranean 16 6 Europe 44 9 South-East Asia 6 5 Western Pacific 22 5
- Now the problem that the table command has is that it only gives you a count. For getting mean or sum or standard deviation we need to use another function called tapply. tapply is very similar to pivot tables in Excel.
- Lets say we want to get mean life expectancies of all regions
tapply(WHO$LifeExpectancy,WHO$Region,mean) Africa Americas Eastern Mediterranean 57.95652 74.34286 69.59091 Europe South-East Asia Western Pacific 76.73585 69.36364 72.33333
- Now lets say i want to get standard deviations for Life Expectancies or countries whose population is more or less than 30000 and if there are any missing values I want to ignore them
tapply(WHO$LifeExpectancy,WHO$Population > 30000,sd) FALSE TRUE 9.372043 8.823353
With this little introduction to getting peek into data , we are ready to go.
See you till next time.