Quick tour of RStudio

The RStudio console on the left side of RStudio provides direct access to the R interpreter. We can use it as a simple calculator or to interact with the current R session. Type or paste the commands in the dark grey boxes into the console.

print("Hello World!")
## [1] "Hello World!"
x <- 2 * (1:10)
print(x)
##  [1]  2  4  6  8 10 12 14 16 18 20

The Environment tab shows the data that has been loaded into the current R session. We can see the x variable that we just typed. This data is lost once we end the session.

The Files tab is a simple file manager for your project. You can use it to organise your files.

You can get more info on the RStudio IDE by looking at the cheatsheet at “Help > Cheatsheets > RStudio IDE Cheat Sheet”.

Exercises

We shall use a R script file to record our commands for future use.

  • Open a new R script by clicking “File > New File > R script”. This opens an “untitled tab” in RStudio. Click on “File > Save” and in the dialog give the name as “code.R”.
  • In the editor type x <- sum(1:10) and on a new line type print(x). Save the file.
  • Execute the script by clicking the “Source” button so that output appears on the console.

Loading data in R

We shall use the “titanic.csv” file to illustrate how we can manipulate larger data sets in R.

To load the data, click on the “Import Dataset” button in the Environment tab in RStudio, and then click on “From Text (readr)”. This opens a dialog from which you should browse to the “titanic.csv” file and then click on the Import button.

You should see the following commands on the console.

library(readr)
titanic <- read_csv("titanic.csv")
## Parsed with column specification:
## cols(
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   `Siblings/Spouses Aboard` = col_double(),
##   `Parents/Children Aboard` = col_double(),
##   Fare = col_double()
## )
View(titanic)

This loads the data into the R session in the form of a tabular data called a “tibble”. You can view the tibble in RStudio by clicking on the grid icon to the right of the dataset in the Environment tab.

You can get more information on loading data into R by looking at the Data Import Cheatsheet.

Exercises

  • Open the “Import dataset” dialog again and browse to the titanic data set.
  • Paste the code in the dialog into your file “code.R” and save it.
  • Execute the script and observe the output on the R console.

Manipulating data

The R package “dplyr” allows us to manipulate tibbles. Often it is best to build up the manipulation of data in stages so that you can see the effect of each command.

Type the following into your R script “code.R”. This loads the “dplyr” library and summarises the data.

library(dplyr)
titanic %>% 
  summarise(Total_Survived = sum(Survived), 
            Number_Passengers = n(), 
            Survival_Rate = 100 * Total_Survived / Number_Passengers)
## # A tibble: 1 x 3
##   Total_Survived Number_Passengers Survival_Rate
##            <dbl>             <int>         <dbl>
## 1            342               887          38.6

The pipe operator %>% takes the output from the LHS and uses it as input to the RHS. The summarise function aggregates the titanic tibble and returns new columns of aggregated data. In this case, we find the survival rate for all passengers.

If we want to find the survival rate for each passenger class we first group_by this column and then summarise. Type the following into the R console.

titanic_Pclass_rate <- titanic %>% 
  group_by(Pclass) %>% 
  summarise(Survival_Rate = 100 * sum(Survived) / n()) 
titanic_Pclass_rate
## # A tibble: 3 x 2
##   Pclass Survival_Rate
##    <dbl>         <dbl>
## 1      1          63.0
## 2      2          47.3
## 3      3          24.4

We see the survival rate per ticket class in percent.

You can find more information on data manipulation on the cheatsheet.

Exercises

  • Refine your code to use the mean function from R to calculate survival rates rather than using sum.
  • Copy the data manipulation code chunks into your file “code.R” and ensure that the script runs correctly.

Visualising data

Often we can gain additional insight into the data by visualisation techniques. We shall use the R package “ggplot2” for this task.

The “ggplot2” package rivals commercial plotting applications such as Tableau in its plotting capabilities. The package is a language for building plots based on the “Grammar of Graphics”. Unlike “dplyr”, it uses the + operator to build up a plot.

As a simple example, type the following into the R console.

library(ggplot2)
ggplot(data = titanic_Pclass_rate)

ggplot(data = titanic_Pclass_rate, aes(x = Pclass, y = Survival_Rate))

ggplot(data = titanic_Pclass_rate, aes(x = Pclass, y = Survival_Rate)) + geom_col() 

The plot is built up in stages. The function aes maps the data to the plotting system, while geom_col selects the type of graphic, which in this case is a bar chart.

We can generate a more sophisticated plot by splitting the bars based on the Sex column. The argument position = "dodge" ensures that bars are not stacked on each other.

titanic_Sex_Pclass_rate <- titanic %>% 
  group_by(Pclass, Sex) %>% 
  summarise(Survival_Rate = 100 * mean(Survived)) 
ggplot(data = titanic_Sex_Pclass_rate, aes(x = Pclass, y = Survival_Rate, fill = Sex)) +
  geom_col(position = "dodge")

More information can be found in the ggplot cheat sheet. You can see some more features of ggplot used in the plot below.

Exercises

  • Using the cheatsheet add proper labels and a title to the plot as shown above.
  • Move the legend to inside the plot area.
  • Copy your code into the file “code.R” and execute the script.
  • Reorganise your “code.R” file so that library commands appear at the top and that each section is commented using the # token.