This blog has relocated to https://coolbutuseless.github.ioand associated packages are now hosted at https://github.com/coolbutuseless.

29 April 2018

mikefc

tidyverse split(): The journey to the pull request

Last week I searched for a replacement for group_by + do, and this ended with split + map_dfr being my favourite alternative. Conceptually it was the most compact representation of the idea (just 2 commands) and avoided a the extra work that seemed necessary to nest a data.frame and operate on the nested data.

I then looked more closely at the split() function, in particular the runtime characteristics and its idioscrasies and noted that in the Base R split() function:

  1. runtime is quadratic in number of splitting variables - something nobody ever wants
  2. runtime is quadratic in number of groups within each variable - something nobody ever wants
  3. the splitting variable gets recycled if it’s not as long as the data.frame being split - something nobody ever wants
  4. values corresponding to NA values of the split variable are completely dropped from the data - something nobody ever wants

After seeing these problems, I sketched out a tidyverse version of split(), which I called cleave_by, and wrote a post about how it solved some of split()'s issues

hadleywickham suggested that I submit the code to the tidyr package. So I tidied the code, and renamed the function to chop() and opened a pull request

The rest of this post briefly shows how chop() + map_dfr() is a workable replacement for group_by() + do().

cleave_by() tidied up => chop()

gshotwell suggested that cleave_by() ought to respect any groupings on the data.frame and hadleywickham suggested that the behaviour should probably be similar to tidyr::nest().

With that in mind, I rewrote cleave_by() to be more nest()-like, and renamed it to chop() (as no-one really liked the verb cleave and I wasn’t attached to it either).

library(rlang)

chop <- function(data, ...) {

  chop_vars <- unname(tidyselect::vars_select(names(data), ...))

  # Only use group vars if no chop vars specified
  if (is_empty(chop_vars)) {
    chop_vars <- dplyr::group_vars(data)
  }

  data <- dplyr::ungroup(data)   # Same as nest() - chopped data frames are ungrouped.
  data <- dplyr::as_tibble(data) # Ensure we consistently return a list of tibbles

  if (is_empty(chop_vars) || nrow(data) == 0) {
    return(list(data))
  }

  idx <- dplyr::group_indices(data, !!! syms(chop_vars))

  unname(split(data, idx))
}

Using chop() + map_dfr() to replace group_by() + do()

In the following (very simplified!) application of split-apply-combine, I show how group_by() + do() and chop() + map_dfr() can be used to apply complex_func() to mtcars subsetted into groups by the value of cyl.

Using chop() + map_dfr() turns out to be a tiny bit simpler than group_by() + do() as there is no need for the final ungrouping (which can be disasterous if you ever forget to do it!)

mtcars %>%
  select(mpg, cyl, disp) %>%
  group_by(cyl) %>%
  do(complex_func(.)) %>%
  ungroup()
## # A tibble: 32 x 4
##      mpg   cyl  disp new_value        
##    <dbl> <dbl> <dbl> <chr>            
##  1  22.8    4. 108.  cyl plus one is 5
##  2  24.4    4. 147.  cyl plus one is 5
##  3  22.8    4. 141.  cyl plus one is 5
##  4  32.4    4.  78.7 cyl plus one is 5
##  5  30.4    4.  75.7 cyl plus one is 5
##  6  33.9    4.  71.1 cyl plus one is 5
##  7  21.5    4. 120.  cyl plus one is 5
##  8  27.3    4.  79.0 cyl plus one is 5
##  9  26.0    4. 120.  cyl plus one is 5
## 10  30.4    4.  95.1 cyl plus one is 5
## # ... with 22 more rows
mtcars %>%
  select(mpg, cyl, disp) %>%
  chop(cyl) %>%
  map_dfr(complex_func)
## # A tibble: 32 x 4
##      mpg   cyl  disp new_value        
##    <dbl> <dbl> <dbl> <chr>            
##  1  22.8    4. 108.  cyl plus one is 5
##  2  24.4    4. 147.  cyl plus one is 5
##  3  22.8    4. 141.  cyl plus one is 5
##  4  32.4    4.  78.7 cyl plus one is 5
##  5  30.4    4.  75.7 cyl plus one is 5
##  6  33.9    4.  71.1 cyl plus one is 5
##  7  21.5    4. 120.  cyl plus one is 5
##  8  27.3    4.  79.0 cyl plus one is 5
##  9  26.0    4. 120.  cyl plus one is 5
## 10  30.4    4.  95.1 cyl plus one is 5
## # ... with 22 more rows

Conclusion

  • A case for a tidyverse split() has been made.
  • The code has been written.
  • A pull request has been opened.
  • Now I wait…