This First R Tutorial aims at introducing you to the fascinating world of Data Science and Analytics using the almighty tool called R. You will be aided with how to do steps which you can follow and work with.
File for this discussion. Please open the xls file and save it as csv = > WHO
How to Open a CSV file in R ?
- Set working directory to the directory containing the CSV/xls file . say /directory/filename.csv
setwd("directory") - A Dataframe is an in memory representation of the data where each column represents a variable and each row represents a single observations in the data file.
- We will open the csv file and store it in a dataframe.
DataFrame = read.csv("filename.csv") - In this post we will talk about a sample data sate from World Health Organization ( WHO.csv)
WHO = read.csv("WHO.csv")
How to do basic analysis of data
- Once the WHO.csv is loaded in WHO dataframe now we should inspect the dataframe to check for basic analysis on the data which includes –
- Which variables
- Their types
- Min , Max, Mean of numerical variables
- if their are any missing values
- This can be done using 2 basic commands
>str(WHO) data.frame': 194 obs. of 13 variables: $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ... $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ... $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ... $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ... $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ... $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ... $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ... $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ... $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ... $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ... $ GNI : num 1140 8820 8310 NA 5230 ... $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ... $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
- str command gives us an idea of datatypes of each variables as well as their sample values. In case of a factor variable it also gives us an information on number or levels and sample level values.
-
>summary(WHO) Country Region Population Afghanistan : 1 Africa :46 Min. : 1 Albania : 1 Americas :35 1st Qu.: 1696 Algeria : 1 Eastern Mediterranean:22 Median : 7790 Andorra : 1 Europe :53 Mean : 36360 Angola : 1 South-East Asia :11 3rd Qu.: 24535 Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000 (Other) :188 Under15 Over60 FertilityRate LifeExpectancy Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00 Median :28.65 Median : 8.53 Median :2.400 Median :72.50 Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00 Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00 NA's :11 ChildMortality CellularSubscribers LiteracyRate GNI Min. : 2.200 Min. : 2.57 Min. :31.10 Min. : 340 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335 Median : 18.600 Median : 97.75 Median :91.80 Median : 7870 Mean : 36.149 Mean : 93.64 Mean :83.71 Mean :13321 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558 Max. :181.600 Max. :196.41 Max. :99.80 Max. :86440 NA's :10 NA's :91 NA's :32 PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale Min. : 37.20 Min. : 32.50 1st Qu.: 87.70 1st Qu.: 87.30 Median : 94.70 Median : 95.10 Mean : 90.85 Mean : 89.63 3rd Qu.: 98.10 3rd Qu.: 97.90 Max. :100.00 Max. :100.00 NA's :93 NA's :93 - summary command is vary useful in diving litter deeper into variable inspection. It gives us Min ,Max ,Mean as well as 1st , 2nd and 3rd Quartile information. This can give us an indication on whether our variable is following Normal distribution or not. Usually Mean is representative of central tendency only if we have a variable normally distributed. It also gives us an idea of how many missing values that variable has.
- Another very useful command is subset command. Many a times we want to divide our dataset on one or more criteria. Subset does exactly that for us. Let us say in our WHO dataframe we want to get data for countries whose population is greater than 30,000.
WHO_higher_population = subset(WHO,Population >30000)
- Lets say we want to inspect the variable Population further and know standard deviation and which record # has the max value
sd(WHO$Population) which.max(WHO$Population)
- We can do basic plotting to inspect the nature of data and relationship between variables
- Histogram
hist(WHO$LifeExpectancy) hist(WHO$Population)
- ScatterPlot
plot(WHO$FertilityRate,WHO$LifeExpectancy)
- BoxPlot
boxplot(WHO$FertilityRate)
We will continue with rest of our analysis and dive deeper into understanding our data in next post of this series.
> Continue to Part 2 of Basic Data Analysis using R