jeremydata: Find Index of First Instance

Jeremy Allen

Yes, another data.table and tidyverse speed test, but this is more than that! I swear. This is real code of me working through a specific real-world issue.

The problem: Get the position number of the first instance of a thing in one column, and use that number to pick a thing from another column, in my case, returning the first date on which a specified number of cases occurred.

Load packages and make some fake data.


library(data.table)
library(dplyr)
library(purrr)

# lots of dates
date = seq.Date(from = as.Date("1900-01-01"),
                to = as.Date("2900-12-31"),
                by = "day")

# lots of cases
cases = c(1:length(date)-1)

# make a dataframe
df <- data.frame(date = date, cases = cases)

Let’s use which.max() to get the date on which the first instance of 10,0000 cases occurred.


# Get the position of the first instance of 10,000 in the cases col,
# and use that number to index the date col, returning the first date
# on which 10,000 cases occurred.
dt <- as.data.table(df) # convert to data.table first
dt[, date[which.max(cases >= 10000)]]


[1] "1927-05-20"


# this only works because 10000 is a vlaue that can be found in that column.

However, which.max() returns 1 when it fails, thus indexing our first date, which we do not want because there are no days with 400,000 or more cases.

We need NA returned when we don’t find an instance of the value we are looking for.


# which.max returns 1 when it fails, thus indexing
# our first date, which we do not want because there
# are no days with 400,000 or more cases. We expect NA.
dt[, date[which.max(cases >= 400000)]]


[1] "1900-01-01"

Let’s test many methods. We want to speed test them at the end, so I’m putting each method inside a function because it’s easier to add them as functions in the speed test once we get there.


# which.max(), does NOT return NA when it fails. Bad.
dt_which_max_method <- function() {
  dt <- as.data.table(df)
  dt[, date[which.max(cases >= 400000)]]
} 

# match(true, x) will return NA when it fails, which
# is what we want so that we don't get a date returned
# when there are no days with 400,000 or more cases
dt_match_true_method <- function() {
  dt <- as.data.table(df)
  dt[, date[match(TRUE, cases >= 100000)]]
} 

# which()[1], test them all and return the first one, also returns NA
dt_which_first_method <- function() {
  dt <- as.data.table(df)
  dt[, date[which(cases >= 400000)[1L]]]  
}

# use base R's Position function, also returns NA
dt_position_method <- function() {
  dt <- as.data.table(df)
  dt[, date[Position(function(x) x >= 400000, cases)]]
}

# Tidyverse's purrr::detect_index(), returns 'Date of length 0'
tv_purrr_method <- function() {
  tb <- tibble::as_tibble(df)
  tb %>%
    slice(purrr::detect_index(cases, ~.x >= 400000)) %>% 
    pull(date)
}

# Tidyverse mixed with the base R's match function
tv_match_method <- function() {
  tb <- tibble::as_tibble(df)
  tb %>%
    slice(match(TRUE, cases >= 100000)) %>% 
    pull(date)
}

Get each function into microbenchmark and test each one 100 times.


#--- Speed test them each 100 times

microbenchmark::microbenchmark(
  dt_which_max_method(),
  dt_match_true_method(),
  dt_which_first_method(),
  dt_position_method(),
  tv_purrr_method(),
  tv_match_method(),
  times = 100L
)


Unit: milliseconds
                    expr        min         lq       mean     median
   dt_which_max_method()   1.945617   2.426501   3.481390   2.705386
  dt_match_true_method()   1.822808   2.400708   3.650070   2.729901
 dt_which_first_method()   1.945631   2.366789   3.678050   2.605177
    dt_position_method() 160.590374 178.364530 188.521428 184.895142
       tv_purrr_method() 785.695242 901.754074 947.888644 931.474398
       tv_match_method()   1.444482   1.897420   3.139463   2.094831
         uq        max neval
   3.306638   10.57839   100
   3.797825   16.06448   100
   4.860582   10.31487   100
 194.051222  265.28539   100
 977.642047 1403.64966   100
   2.812000   67.40537   100

The vectorized methods, such as match() used on either a data.table or a tidyverse tibble are clear winners over the base Position() and purrr detect_index() functions.

Comment on this article Share:

Find Index of First Instance

Reuse

Citation