Find one thing with another thing. We’ll speed test various data.table and tidyverse methods for finding the position of the first match and use that to index another column.
Yes, another data.table and tidyverse speed test, but this is more than that! I swear. This is real code of me working through a specific real-world issue.
The problem: Get the position number of the first instance of a thing in one column, and use that number to pick a thing from another column, in my case, returning the first date on which a specified number of cases occurred.
Load packages and make some fake data.
library(data.table)
library(dplyr)
library(purrr)
# lots of dates
date = seq.Date(from = as.Date("1900-01-01"),
to = as.Date("2900-12-31"),
by = "day")
# lots of cases
cases = c(1:length(date)-1)
# make a dataframe
df <- data.frame(date = date, cases = cases)
Let’s use which.max()
to get the date on which the first instance of 10,0000 cases occurred.
# Get the position of the first instance of 10,000 in the cases col,
# and use that number to index the date col, returning the first date
# on which 10,000 cases occurred.
dt <- as.data.table(df) # convert to data.table first
dt[, date[which.max(cases >= 10000)]]
[1] "1927-05-20"
# this only works because 10000 is a vlaue that can be found in that column.
However, which.max()
returns 1 when it fails, thus indexing our first date, which we do not want because there are no days with 400,000 or more cases.
We need NA returned when we don’t find an instance of the value we are looking for.
# which.max returns 1 when it fails, thus indexing
# our first date, which we do not want because there
# are no days with 400,000 or more cases. We expect NA.
dt[, date[which.max(cases >= 400000)]]
[1] "1900-01-01"
Let’s test many methods. We want to speed test them at the end, so I’m putting each method inside a function because it’s easier to add them as functions in the speed test once we get there.
# which.max(), does NOT return NA when it fails. Bad.
dt_which_max_method <- function() {
dt <- as.data.table(df)
dt[, date[which.max(cases >= 400000)]]
}
# match(true, x) will return NA when it fails, which
# is what we want so that we don't get a date returned
# when there are no days with 400,000 or more cases
dt_match_true_method <- function() {
dt <- as.data.table(df)
dt[, date[match(TRUE, cases >= 100000)]]
}
# which()[1], test them all and return the first one, also returns NA
dt_which_first_method <- function() {
dt <- as.data.table(df)
dt[, date[which(cases >= 400000)[1L]]]
}
# use base R's Position function, also returns NA
dt_position_method <- function() {
dt <- as.data.table(df)
dt[, date[Position(function(x) x >= 400000, cases)]]
}
# Tidyverse's purrr::detect_index(), returns 'Date of length 0'
tv_purrr_method <- function() {
tb <- tibble::as_tibble(df)
tb %>%
slice(purrr::detect_index(cases, ~.x >= 400000)) %>%
pull(date)
}
# Tidyverse mixed with the base R's match function
tv_match_method <- function() {
tb <- tibble::as_tibble(df)
tb %>%
slice(match(TRUE, cases >= 100000)) %>%
pull(date)
}
Get each function into microbenchmark and test each one 100 times.
#--- Speed test them each 100 times
microbenchmark::microbenchmark(
dt_which_max_method(),
dt_match_true_method(),
dt_which_first_method(),
dt_position_method(),
tv_purrr_method(),
tv_match_method(),
times = 100L
)
Unit: milliseconds
expr min lq mean median
dt_which_max_method() 1.945617 2.426501 3.481390 2.705386
dt_match_true_method() 1.822808 2.400708 3.650070 2.729901
dt_which_first_method() 1.945631 2.366789 3.678050 2.605177
dt_position_method() 160.590374 178.364530 188.521428 184.895142
tv_purrr_method() 785.695242 901.754074 947.888644 931.474398
tv_match_method() 1.444482 1.897420 3.139463 2.094831
uq max neval
3.306638 10.57839 100
3.797825 16.06448 100
4.860582 10.31487 100
194.051222 265.28539 100
977.642047 1403.64966 100
2.812000 67.40537 100
The vectorized methods, such as match()
used on either a data.table or a tidyverse tibble are clear winners over the base Position()
and purrr detect_index()
functions.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Allen (2020, May 11). jeremydata: Find Index of First Instance. Retrieved from https://jeremydata.com/posts/2020-05-11-find-index-of-first-instance/
BibTeX citation
@misc{allen2020find, author = {Allen, Jeremy}, title = {jeremydata: Find Index of First Instance}, url = {https://jeremydata.com/posts/2020-05-11-find-index-of-first-instance/}, year = {2020} }