# R : Basic Data Analysis – Part 1

This First R Tutorial aims at introducing you to the fascinating world of Data Science and Analytics using the almighty tool called R. You will be aided with how to do steps which you can follow and work with.

File for this discussion. Please open the xls file and save it as csv = > WHO

How to Open a CSV file in R ?

• Set working directory to the directory containing the CSV/xls file . say /directory/filename.csv
``setwd("directory")``
• A Dataframe is an in memory representation of the data where each column represents a variable and each row represents a single observations in the data file.
• We will open the csv file and store it in a dataframe.
``DataFrame = read.csv("filename.csv")``
• In this post we will talk about a sample data sate from World Health Organization ( WHO.csv)
``WHO = read.csv("WHO.csv")``

How to do basic analysis of data

• Once the WHO.csv is loaded in WHO dataframe now we should inspect the dataframe to check for basic analysis on the data which includes –
• Which variables
• Their types
• Min , Max, Mean of numerical variables
• if their are any missing values
• This can be done using 2 basic commands
```>str(WHO)
data.frame':	194 obs. of  13 variables:
\$ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
\$ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
\$ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
\$ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
\$ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
\$ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
\$ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
\$ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
\$ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
\$ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
\$ GNI                          : num  1140 8820 8310 NA 5230 ...
\$ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
\$ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
```
• str command gives us an idea of datatypes of each variables as well as their sample values. In case of a factor variable it also gives us an information on number or levels and sample level values.
• ```>summary(WHO)
Country                      Region     Population
Afghanistan        :  1   Africa               :46   Min.   :      1
Albania            :  1   Americas             :35   1st Qu.:   1696
Algeria            :  1   Eastern Mediterranean:22   Median :   7790
Andorra            :  1   Europe               :53   Mean   :  36360
Angola             :  1   South-East Asia      :11   3rd Qu.:  24535
Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000
(Other)            :188
Under15          Over60      FertilityRate   LifeExpectancy
Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00
1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00
Median :28.65   Median : 8.53   Median :2.400   Median :72.50
Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01
3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00
Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00
NA's   :11
ChildMortality    CellularSubscribers  LiteracyRate        GNI
Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340
1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335
Median : 18.600   Median : 97.75      Median :91.80   Median : 7870
Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321
3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558
Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440
NA's   :10          NA's   :91      NA's   :32
PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
Min.   : 37.20              Min.   : 32.50
1st Qu.: 87.70              1st Qu.: 87.30
Median : 94.70              Median : 95.10
Mean   : 90.85              Mean   : 89.63
3rd Qu.: 98.10              3rd Qu.: 97.90
Max.   :100.00              Max.   :100.00
NA's   :93                  NA's   :93```
• summary command is vary useful in diving litter deeper into variable inspection. It gives us Min ,Max ,Mean as well as 1st , 2nd and 3rd Quartile information. This can give us an indication on whether our variable is following Normal distribution or not. Usually Mean is representative of central tendency only if we have a variable normally distributed. It also gives us an idea of how many missing values that variable has.
• Another very useful command is subset command. Many a times we want to divide our dataset on one or more criteria. Subset does exactly that for us. Let us say in our WHO dataframe we want to get data for countries whose population is greater than 30,000.
```WHO_higher_population = subset(WHO,Population >30000)
```
• Lets say we want to inspect the variable Population further and know standard deviation and which record # has the max value
```sd(WHO\$Population)
which.max(WHO\$Population)```
• We can do basic plotting to inspect the nature of data and relationship between variables
• Histogram
```hist(WHO\$LifeExpectancy)
hist(WHO\$Population)```
• ScatterPlot
`plot(WHO\$FertilityRate,WHO\$LifeExpectancy)`
• BoxPlot
`boxplot(WHO\$FertilityRate)`

We will continue with rest of our analysis and dive deeper into understanding our data in next post of this series.
> Continue to Part 2 of Basic Data Analysis using R