R : Basic Data Analysis – Part 1

This First R Tutorial aims at introducing you to the fascinating world of Data Science and Analytics using the almighty tool called R. You will be aided with how to do steps which you can follow and work with.

File for this discussion. Please open the xls file and save it as csv = > WHO


How to Open a CSV file in R ?

    • Set working directory to the directory containing the CSV/xls file . say /directory/filename.csv
      setwd("directory")
    • A Dataframe is an in memory representation of the data where each column represents a variable and each row represents a single observations in the data file.
    • We will open the csv file and store it in a dataframe.
      DataFrame = read.csv("filename.csv")
    • In this post we will talk about a sample data sate from World Health Organization ( WHO.csv)
      WHO = read.csv("WHO.csv")

How to do basic analysis of data

    • Once the WHO.csv is loaded in WHO dataframe now we should inspect the dataframe to check for basic analysis on the data which includes –
      • Which variables
      • Their types
      • Min , Max, Mean of numerical variables
      • if their are any missing values
    • This can be done using 2 basic commands
      >str(WHO)
      data.frame':	194 obs. of  13 variables:
       $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
       $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
       $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
       $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
       $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
       $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
       $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
       $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
       $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
       $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
       $ GNI                          : num  1140 8820 8310 NA 5230 ...
       $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
       $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
      
    • str command gives us an idea of datatypes of each variables as well as their sample values. In case of a factor variable it also gives us an information on number or levels and sample level values.
    • >summary(WHO)
                    Country                      Region     Population     
       Afghanistan        :  1   Africa               :46   Min.   :      1  
       Albania            :  1   Americas             :35   1st Qu.:   1696  
       Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
       Andorra            :  1   Europe               :53   Mean   :  36360  
       Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
       Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
       (Other)            :188                                               
          Under15          Over60      FertilityRate   LifeExpectancy 
       Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
       1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
       Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
       Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
       3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
       Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
                                       NA's   :11                     
       ChildMortality    CellularSubscribers  LiteracyRate        GNI       
       Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
       1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
       Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
       Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
       3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
       Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
                         NA's   :10          NA's   :91      NA's   :32     
       PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
       Min.   : 37.20              Min.   : 32.50               
       1st Qu.: 87.70              1st Qu.: 87.30               
       Median : 94.70              Median : 95.10               
       Mean   : 90.85              Mean   : 89.63               
       3rd Qu.: 98.10              3rd Qu.: 97.90               
       Max.   :100.00              Max.   :100.00               
       NA's   :93                  NA's   :93
    • summary command is vary useful in diving litter deeper into variable inspection. It gives us Min ,Max ,Mean as well as 1st , 2nd and 3rd Quartile information. This can give us an indication on whether our variable is following Normal distribution or not. Usually Mean is representative of central tendency only if we have a variable normally distributed. It also gives us an idea of how many missing values that variable has.
    • Another very useful command is subset command. Many a times we want to divide our dataset on one or more criteria. Subset does exactly that for us. Let us say in our WHO dataframe we want to get data for countries whose population is greater than 30,000.
WHO_higher_population = subset(WHO,Population >30000)
    • Lets say we want to inspect the variable Population further and know standard deviation and which record # has the max value
sd(WHO$Population)
which.max(WHO$Population)
    • We can do basic plotting to inspect the nature of data and relationship between variables
    • Histogram
hist(WHO$LifeExpectancy)
hist(WHO$Population)
    • ScatterPlot
plot(WHO$FertilityRate,WHO$LifeExpectancy)
    • BoxPlot
boxplot(WHO$FertilityRate)

We will continue with rest of our analysis and dive deeper into understanding our data in next post of this series.
> Continue to Part 2 of Basic Data Analysis using R

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s