PSTAT 10 Worksheet 3 Solutions

Problem 1: Contains Duplicate

Write the function contains_duplicate(v) that takes a numeric vector v and returns TRUE if any value appears at least twice in the vector and FALSE otherwise.

There are several ways to do this. The extremely easy way is it use the duplicated function.

contains_duplicate <- function(v) {
  any(duplicated(v))
}
contains_duplicate(c(1, 2, 3, 1))

## [1] TRUE

contains_duplicate(c(1, 2, 3, 4))

## [1] FALSE

contains_duplicate(c(1, 1, 1, 3, 3, 4, 3, 2, 4, 2))

## [1] TRUE

Another way involves using a loop:

contains_duplicate <- function(v) {
  seen <- rep(NA, length(v)) # Initialize with NA values
  for (i in seq_along(v)) {
    if (v[i] %in% seen) {
      return(TRUE)
    }
    seen[i] <- v[i]
  }
  return(FALSE)
}

We haven’t talked about NA yet, but we need it above for a technical reason: initializing with vector(length = length(v)) would create a vector containing zeros. This leads to wrong output if v contain a zero.

Problem 2: More on iris

Convert the iris data frame to a tibble and call it iris_tbl

iris_tbl <- as_tibble(iris)

Find the median Petal.Width and then create a tibble that only contains petal widths greater than the median.

median(iris_tbl$Petal.Width)

## [1] 1.3

iris_tbl |>
  filter(Petal.Width > median(iris_tbl$Petal.Width))

## # A tibble: 72 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>
##  1          7           3.2          4.7         1.4 versicolor
##  2          6.4         3.2          4.5         1.5 versicolor
##  3          6.9         3.1          4.9         1.5 versicolor
##  4          6.5         2.8          4.6         1.5 versicolor
##  5          6.3         3.3          4.7         1.6 versicolor
##  6          5.2         2.7          3.9         1.4 versicolor
##  7          5.9         3            4.2         1.5 versicolor
##  8          6.1         2.9          4.7         1.4 versicolor
##  9          6.7         3.1          4.4         1.4 versicolor
## 10          5.6         3            4.5         1.5 versicolor
## # ... with 62 more rows

Call the area of a petal its length times its width. Create a tibble containing only the variables Sepal.Length, Sepal.Width, Species, and Petal.Area and only the rows where the petal width is greater than the median.

iris_tbl |>
  filter(Petal.Width > median(Petal.Width)) |>
  mutate(Petal.Area = Petal.Width * Petal.Length) |>
  select(-Petal.Length, -Petal.Width)

## # A tibble: 72 x 4
##    Sepal.Length Sepal.Width Species    Petal.Area
##           <dbl>       <dbl> <fct>           <dbl>
##  1          7           3.2 versicolor       6.58
##  2          6.4         3.2 versicolor       6.75
##  3          6.9         3.1 versicolor       7.35
##  4          6.5         2.8 versicolor       6.9
##  5          6.3         3.3 versicolor       7.52
##  6          5.2         2.7 versicolor       5.46
##  7          5.9         3   versicolor       6.3
##  8          6.1         2.9 versicolor       6.58
##  9          6.7         3.1 versicolor       6.16
## 10          5.6         3   versicolor       6.75
## # ... with 62 more rows

Problem 3: More on heights data

Load the heights_df data frame from worksheet 1.

heights_df <- read.csv("heights.csv")

Recall the height variable is given in centimeters (cm). In worksheet 2, we created cm_to_ft_inch that converts from cm to a string representation of feet and inches.

Using dplyr functionality, create a tibble with a variable height_ft_in in place of height.

heights_df |>
  as_tibble() |>
  mutate(height_ft_in = cm_to_ft_inch(height)) |>
  select(-height)

## # A tibble: 506 x 4
##     id_. gender   age height_ft_in
##    <int> <chr>  <int> <chr>
##  1     1 Female    19 5 2
##  2     2 Female    19 5 7
##  3     3 Female    22 5 6
##  4     4 Male      19 5 11
##  5     5 Female    21 5 8
##  6     6 Male      19 6 2
##  7     7 Female    21 5 1
##  8     8 Female    21 5 5
##  9     9 Male      18 6 4
## 10    10 Female    18 5 4
## # ... with 496 more rows