R : Basic Data Analysis – Part 1 – Tutorials for Data Science , Machine Learning, AI & Big Data

This First R Tutorial aims at introducing you to the fascinating world of Data Science and Analytics using the almighty tool called R. You will be aided with how to do steps which you can follow and work with.

File for this discussion. Please open the xls file and save it as csv = > WHO

How to Open a CSV file in R ?

Set working directory to the directory containing the CSV/xls file . say /directory/filename.csv
```
setwd("directory")
```
A Dataframe is an in memory representation of the data where each column represents a variable and each row represents a single observations in the data file.
We will open the csv file and store it in a dataframe.
```
DataFrame = read.csv("filename.csv")
```
In this post we will talk about a sample data sate from World Health Organization ( WHO.csv)
```
WHO = read.csv("WHO.csv")
```

How to do basic analysis of data

Once the WHO.csv is loaded in WHO dataframe now we should inspect the dataframe to check for basic analysis on the data which includes –
- Which variables
- Their types
- Min , Max, Mean of numerical variables
- if their are any missing values

This can be done using 2 basic commands

>str(WHO)
data.frame':	194 obs. of  13 variables:
 $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
 $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
 $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
 $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
 $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
 $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
 $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
 $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
 $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
 $ GNI                          : num  1140 8820 8310 NA 5230 ...
 $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
 $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...

str command gives us an idea of datatypes of each variables as well as their sample values. In case of a factor variable it also gives us an information on number or levels and sample level values.

>summary(WHO)
              Country                      Region     Population     
 Afghanistan        :  1   Africa               :46   Min.   :      1  
 Albania            :  1   Americas             :35   1st Qu.:   1696  
 Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
 Andorra            :  1   Europe               :53   Mean   :  36360  
 Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
 Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
 (Other)            :188                                               
    Under15          Over60      FertilityRate   LifeExpectancy 
 Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
 1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
 Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
 Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
 3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
 Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
                                 NA's   :11                     
 ChildMortality    CellularSubscribers  LiteracyRate        GNI       
 Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
 1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
 Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
 Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
 3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
 Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
                   NA's   :10          NA's   :91      NA's   :32     
 PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
 Min.   : 37.20              Min.   : 32.50               
 1st Qu.: 87.70              1st Qu.: 87.30               
 Median : 94.70              Median : 95.10               
 Mean   : 90.85              Mean   : 89.63               
 3rd Qu.: 98.10              3rd Qu.: 97.90               
 Max.   :100.00              Max.   :100.00               
 NA's   :93                  NA's   :93

summary command is vary useful in diving litter deeper into variable inspection. It gives us Min ,Max ,Mean as well as 1st , 2nd and 3rd Quartile information. This can give us an indication on whether our variable is following Normal distribution or not. Usually Mean is representative of central tendency only if we have a variable normally distributed. It also gives us an idea of how many missing values that variable has.
Another very useful command is subset command. Many a times we want to divide our dataset on one or more criteria. Subset does exactly that for us. Let us say in our WHO dataframe we want to get data for countries whose population is greater than 30,000.

WHO_higher_population = subset(WHO,Population >30000)

Lets say we want to inspect the variable Population further and know standard deviation and which record # has the max value

sd(WHO$Population)
which.max(WHO$Population)

We can do basic plotting to inspect the nature of data and relationship between variables
Histogram

hist(WHO$LifeExpectancy)
hist(WHO$Population)

ScatterPlot

plot(WHO$FertilityRate,WHO$LifeExpectancy)

BoxPlot

boxplot(WHO$FertilityRate)

We will continue with rest of our analysis and dive deeper into understanding our data in next post of this series.
> Continue to Part 2 of Basic Data Analysis using R

R : Basic Data Analysis – Part 1

Published by

Shantanu Deo

Leave a comment Cancel reply

Share this:

Related

Published by

Shantanu Deo

Leave a comment Cancel reply