Data manipulation and visualisation

Marine Ecosystem Dynamics

Kinlan Jan, Baptiste Serandour

Plan for today’s lecture

  • Introduction to tidyverse
  • Pipe the data using magrittr
  • Clean the data using tidyr
  • Arrange the data using dplyr
  • Plot using ggplot2

Tidyverse

  • tidyverse is a collection of packages
  • It is now a standard in data analysis
  • It is easier to read and keep track of what is happening with the pipe operator %>%

Pipe the data using magrittr

%>% takes the data from the left and place it to the right

  • x %>% function() = function(x)

Without the pipe operator:

iris_subset <- iris[, c("Species", "Sepal.Length", "Sepal.Width")]
iris_subset$Sepal_Ratio <- iris_subset$Sepal.Length / iris_subset$Sepal.Width

aggregate(Sepal_Ratio ~ Species, data = iris_subset, 
                    FUN = function(x) c(Average_ratio = mean(x), 
                                        standard_deviation = sd(x)))

With the pipe operator:

library(magrittr)
iris %>% 
  dplyr::select(Species, Sepal.Length, Sepal.Width) %>%
  dplyr::mutate(Sepal_Ratio = Sepal.Length/Sepal.Width) %>%
  dplyr::group_by(Species)%>%
  dplyr::summarise(Average_ratio = mean(Sepal_Ratio),
                   standard_deviation = sd(Sepal_Ratio))

Tidy the data with tidyr

A table is tidy if:

  • Each variable is in its own column
  • Each observation is in its own row

Key functions:

  • pivot_longer
  • pivot_wider
  • unite
  • separate

Tidy the data with tidyr - iris example

Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
5.1 3.5 1.4 0.2 setosa 1
4.9 3.0 1.4 0.2 setosa 2
4.7 3.2 1.3 0.2 setosa 3
4.6 3.1 1.5 0.2 setosa 4
iris %<>% dplyr::mutate(id = 1:150)
iris |> head(4)

Tidy the data with tidyr - pivot_longer

Species id Parameter Size
setosa 1 Sepal.Length 5.1
setosa 1 Sepal.Width 3.5
setosa 1 Petal.Length 1.4
setosa 1 Petal.Width 0.2
long_iris <- iris |> 
  tidyr::pivot_longer(1:4,
                      names_to = "Parameter",
                      values_to = "Size")
long_iris |> head(4)

Tidy the data with tidyr - separate

Species id Organ Measure Size
setosa 1 Sepal Length 5.1
setosa 1 Sepal Width 3.5
setosa 1 Petal Length 1.4
setosa 1 Petal Width 0.2
sep_iris <- long_iris |> 
  tidyr::separate(Parameter, into = c("Organ", "Measure"))
sep_iris |> head(4)

Tidy the data with tidyr - pivot_wider

Species id Measure Sepal Petal
setosa 1 Length 5.1 1.4
setosa 1 Width 3.5 0.2
setosa 2 Length 4.9 1.4
setosa 2 Width 3.0 0.2
wide_iris <- sep_iris |> 
  tidyr::pivot_wider(names_from = "Organ",
                     values_from = "Size")
wide_iris |>  head(4)

Tidy the data with tidyr - unite

Species id Measure Sepal/Petal
setosa 1 Length 5.1/1.4
setosa 1 Width 3.5/0.2
setosa 2 Length 4.9/1.4
setosa 2 Width 3/0.2
unite_iris <- wide_iris |> 
  tidyr::unite(col = "Sepal/Petal", c(Sepal, Petal), sep = "/")
unite_iris |>  head(4)

Arrange the data with dyplr

dplyr simplifies the data manipulation with self-explanatory functions:

  • filter observations based on their values
  • mutate a new column as a function of others
  • select variables based on their names
  • group_by variable
  • summarise the data
iris %>%
  dplyr::filter(Petal.Length >= 1.4) %>%
  dplyr::mutate(Sepal_Ratio = Sepal.Length/Sepal.Width) %>%
  dplyr::select(Species, Sepal_Ratio) %>% # equaivalent to select(-c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))
  dplyr::group_by(Species) %>%
  dplyr::summarise(Average_ratio = mean(Sepal_Ratio),
                   standard_deviation = sd(Sepal_Ratio))

Data visulalisation

It is very important to look at the data. Totally different data might have similar statistics…

statistics value
mean_x 54.27
mean_y 47.83
sd_x 16.77
sd_y 26.94

Visualise the data with ggplot2

plot(x = iris$Sepal.Length,
     y = iris$Sepal.Width)

library(ggplot2)
ggplot(iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point()

Visualise the data with ggplot2

species_palette <- c("#F8766D", "#00BA38", "#619CFF")
size_vector <- iris$Petal.Width
plot(x = iris$Sepal.Length,
     y = iris$Sepal.Width,
     col = species_palette[iris$Species],
     bg = species_palette[iris$Species],
     pch = 21,
     cex = size_vector,
     xlim = c(min(iris$Sepal.Length), max(iris$Sepal.Length)),
     ylim = c(min(iris$Sepal.Width), max(iris$Sepal.Width)),
     xlab = "Sepal.Length",
     ylab = "Sepal.Width")
legend("topright", legend = levels(iris$Species), col = species_palette, pch = 21, pt.bg = species_palette, cex = 1.2, title = "Species")
grid(lwd = 1, lty = "dotted")

library(ggplot2)
ggplot(iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width,
                     fill = Species,
                     size = Petal.Width)) +
  geom_point(shape = 21,
             col = 1) +
  theme_minimal()

Grammar of graphic (gg)

  • Data
  • Aesthetics - Visual characteristics (e.g., x, y, size)
  • Geometry - How to represent the data (e.g., lines, point, boxplot)
  • Statistics - What statistics to show
  • Facets - Split the data
  • Coordinates - Position of the geometry
  • Themes - Visual changes
ggplot(data = <DATA>,
       mapping = aes(<AESTHETICS>)) +
  geom_<GEOMETRY>()

Let’s plot using ggplot2 - Data

ggplot(data = iris)

Let’s plot using ggplot2 - Aesthetics

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width))

Let’s plot using ggplot2 - Geometry

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point()

Let’s plot using ggplot2 - Geometry

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species))

Let’s plot using ggplot2 - Statistics

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species))

Let’s plot using ggplot2 - Facets

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species)

Let’s plot using ggplot2 - Coordinates

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar()

Let’s plot using ggplot2 - Themes

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar() +
  theme_light()

Let’s plot using ggplot2 - Themes

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar() +
  theme_light() +
  theme(axis.line = element_line(color = "red"))

Let’s plot using ggplot2 - Themes

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar() +
  theme_light() +
  theme(axis.line = element_line(color = "red"),
        strip.text = element_text(size = 13, color = "pink"))

Let’s plot using ggplot2 - Extra tips

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar() +
  theme_light() +
  theme(axis.line = element_line(color = "red"),
        strip.text = element_text(size = 13, color = "pink")) +
  labs(title = "Sepal", x = "Length" , y = "Width", subtitle = "relationship between width\nand length", caption = "Caption appear here")

Let’s plot using ggplot2 - Extra tips

ggplot(data = iris,
       mapping = aes(x = Sepal.Length,
                     y = Sepal.Width)) +
  geom_point(mapping = aes(col = Species)) +
  stat_smooth(method = "lm") +
  stat_smooth(method = "lm",
              mapping = aes(col = Species,
                            fill = Species)) +
  facet_wrap(~Species) +
  coord_polar() +
  theme_light() +
  theme(axis.line = element_line(color = "red"),
        strip.text = element_text(size = 13, color = "pink")) +
  labs(title = "Sepal", x = "Length" , y = "Width", subtitle = "relationship between width\nand length", caption = "Caption appear here") +
  scale_color_manual(values = c("forestgreen", "firebrick", "burlywood"))