PSTAT 10 Homework 2 Solutions

Problem 1: Convergence to e

The number \(e\approx 2.718\) can be expressed in many different ways. One way is as a limit and another is as an infinite series: \[\begin{align*} e &= \lim_{n\to\infty} \left(1+\frac{1}{n}\right)^n &\text{limit representation} \\ e &= \sum_{k=0}^\infty \frac{1}{k!} &\text{series representation} \end{align*}\]

Loop Approach

# First initialize the vector of values
x <- 1:100
e_limit <- vector(length = length(x))
e_limit[1] <- 1
e_series <- vector(length = length(x))
e_series[1] <- 1

# Next, use a loop to fill in the values.
for(i in seq(2, length(x))) {
  e_limit[i] <- (1 + 1/i)^i
  e_series[i] <- e_series[i-1] + 1/factorial(i-1)
}

The above loop can be written in several ways. I started the loop index at 2 instead of 1 since I had already filled in the value at index 1. One must take care when dealing with loop indices at the boundary of the range.

Vectorized Approach

A simpler, faster, and more elegant way is to use vectorization. Think about how this works.

x <- 0:100
e_limit <- (1 + 1/x)^x
e_series <- cumsum(1/factorial(x))

Regardless of the approach, here is the code to generate the plot.

plot(x, e_limit, type="n", main="Convergence to e", ylab="y")
points(x, e_limit, type="l", col = "blue")
points(x, e_series, type="l", col = "red")
legend("bottomright", legend=c("Limit", "Series"), col=c("blue", "red"), lty=1)

Problem 2: nycflights13

For this problem, we will work with the flights data set in the nycflights13 package. Install and load the nycflights13 package. Also make sure you load tidyverse in case you need it.

library(nycflights13)
library(tidyverse)

Provide a brief description of the data set and a few of the variables (use ?flights). Is flights a tibble?
flights is a tibble. This can be seen by printing flights to the console or by calling is_tibble(flights).
Extract a tibble containing American Airlines (AA) flights to LAX that departed before 1030. Return only the columns month, day, dep_time, dest, and carrier. How many flights fit these criteria?

flights |>
  filter(dest == "LAX", carrier == "AA", dep_time < 1030) |>
  select(month, day, dep_time, dest, carrier)

## # A tibble: 977 x 5
##    month   day dep_time dest  carrier
##    <int> <int>    <int> <chr> <chr>  
##  1     1     1      743 LAX   AA     
##  2     1     1      856 LAX   AA     
##  3     1     1     1026 LAX   AA     
##  4     1     2      732 LAX   AA     
##  5     1     2      855 LAX   AA     
##  6     1     3      730 LAX   AA     
##  7     1     3      855 LAX   AA     
##  8     1     3     1024 LAX   AA     
##  9     1     4      728 LAX   AA     
## 10     1     4      858 LAX   AA     
## # ... with 967 more rows

We see 977 rows. Hence 977 flights fit these criteria

Extract a tibble containing all flights on Christmas day; 12/25/2013. What is the total number of miles traveled across all flights on this day?

flights_christmas <- flights |>
  filter(month == 12, day == 25)
sum(flights_christmas$distance)

## [1] 803747

We created a new tibble flights_christmas and then computed the sum of the distances directly. Alternatively, you could use the summarize function:

flights |>
  filter(month == 12, day == 25) |>
  summarize(sum(distance))

## # A tibble: 1 x 1
##   `sum(distance)`
##             <dbl>
## 1          803747

The air_time variable gives the duration of the flight in minutes. Create a tibble containing flights on Christmas day and only the variables month, day, origin, dest, and air_time_hour where the last variable gives the duration of the flight in hours.

flights |>
  filter(month == 12, day == 25) |>
  mutate(air_time_hour = air_time / 60) |>
  select(month, day, origin, dest, air_time_hour)

## # A tibble: 719 x 5
##    month   day origin dest  air_time_hour
##    <int> <int> <chr>  <chr>         <dbl>
##  1    12    25 EWR    CLT            1.63
##  2    12    25 EWR    IAH            3.38
##  3    12    25 JFK    MIA            2.43
##  4    12    25 JFK    BQN            3.18
##  5    12    25 LGA    ORD            2.05
##  6    12    25 LGA    DTW            1.47
##  7    12    25 LGA    ATL            1.97
##  8    12    25 LGA    FLL            2.45
##  9    12    25 EWR    FLL            2.48
## 10    12    25 JFK    MCO            2.28
## # ... with 709 more rows

The order here is important; we must create the variable with mutate before selecting it.

Problem 3: mtcars

For this problem we will work with the mtcars data set in the datasets library.

library(datasets)

Provide a brief description of this data set and some of the variables. Is it a tibble?
Not a tibble, as can be seen by running mtcars in the console or using is_tibble.
Create the following histogram of displacement along with vertical lines indicating the mean and median displacement. Hint: I set the breaks to go from 0 to 500 in increments of 25.

hist(mtcars$disp, breaks=seq(0, 500, 25),
     main= "Hist of Displacement",
     xlab = "Displacement (cu.in.)")
abline(v = c(mean(mtcars$disp), median(mtcars$disp)),
       lty=c(2,3), lwd=2, col="red")
legend("topright", legend = c("mean disp", "median disp"),
       lty=c(2,3), lwd=2, col="red")

Create the following boxplot of the 1/4 mile time against the number of cylinders. There is one outlier. Which car is the outlier? To answer this, note that the names of the cars are row names in the data set.

boxplot(mtcars$qsec ~ mtcars$cyl,
        xlab = "Number of Cylinders", ylab="1/4 mile time (sec)")

The outlier is the car with the largest value of qsec. You can find this using which or which.max:

# Find the largest qsec
max_qsec <- max(mtcars$qsec)
# Find the index of the largest
max_qsec_idx <- which(mtcars$qsec == max_qsec)
# Get the row corresponding to this index
mtcars[max_qsec_idx, ]

##           mpg cyl  disp hp drat   wt qsec vs am gear carb
## Merc 230 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

The outlier is the Merc 230. The code below uses which.max:

mtcars[which.max(mtcars$qsec),]

##           mpg cyl  disp hp drat   wt qsec vs am gear carb
## Merc 230 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

Create the following barplot of counts of cars with different numbers of cylinders.

barplot(table(mtcars$cyl), ylab="Count", xlab="Number of cylinders")

Problem 4: Search insert position

Write a function search_insert_position(v, target) which takes a sorted numerical vector v, a numerical target target. If target is present in v, return the index of target. Otherwise, return the index in v of target where it would be if it were inserted in order.

The input v is guaranteed to be sorted and to contain unique values (i.e. no duplicates).

search_insert_position <- function(v, target) {
# target is present in v; return index of target
  if (target %in% v) {
    return(which(target == v))
  }
# if target exceeds all values in v, locate the last index of v: length(v)
  if (all(v < target)) {
    return(length(v) + 1)
  }
# Otherwise, find the first index i where v[i] > target
  return(which(v > target)[1])
}
x <- c(1, 3, 5, 6)
search_insert_position(x, 5)

## [1] 3

search_insert_position(x, 2)

## [1] 2

search_insert_position(x, 7)

## [1] 5