Introduction
I currently process a lot of data a single entity at a time, but have a data.frame representing multiple entities as input.
I have specialist functions that do a lot of work on the data.frame for a single entity, so I want to split the original data.frame into multiple data.frames containing just one entity each and then process them one at a time.
This is a classic case of split-apply-combine, as outlined Hadley Wickham’s JStatSoft paper (pdf) and Jenny Bryan’s Stat545 notes
Currently I uses dplyr
group_by then do to achieve, this.
But as of 2016, dplyr::do()
is “basically deprecated” according to Hadley Wickham on Twitter:
@ijlyttle btw dplyr::do() is now basically deprecated in favour of the purrr approach
— Hadley Wickham (@hadleywickham) April 11, 2016
On a more recent community.rstudio.com thread, Hadley expanded:
do() is definitely going away in the long term, but I’m not yet sure we have comprehensive alternative solutions to all problems that do() solves.
(Also “going away” means that we won’t make improvements to it and we won’t mention it in documentation and tutorials, but the code will continue to exist for a number of years)
Given that the original tweet was from 2 years ago, and I’m still using group_by/do, it’s time I start searching for a usable “purrr approach” that suits my needs.
If there are tidyverse options I haven’t yet discovered, please let me know on twitter!
-
Split-Apply-Combine - Prehistoric times -
split
,lapply
,do.call(rbind, …)
-
Split-Apply-Combine - Stone Age with plyr -
plyr::ddply
-
Split-Apply-Combine - Early tidyverse era -
group_by
,do
-
Split-Apply-Combine - Early-mid tidyverse era
group_by
&by_slice
-
Split-Apply-Combine - Current era tidyverse:
group_by
,nest
,mutate(map())
-
Split-Apply-Combine - tidyverse/base hybrid:
split
,map_dfr
- Summary Table
The test data - big_df
My data is always in a single data.frame, with information for multiple entities contained within it. There is always 1 or more indexing variables to identify, group or split the data.
The test data used here (big_df
) is just a small subset of the mtcars
data set.
ID | mpg | disp |
---|---|---|
6 | 21.0 | 160.0 |
6 | 21.4 | 258.0 |
8 | 15.5 | 318.0 |
8 | 19.2 | 400.0 |
8 | 16.4 | 275.8 |
8 | 15.2 | 304.0 |
The complex function to run on the data.frame for each entity
This is a (dummy) function to run on the data.frame for each entity.
This function is usually quite complex and consists of multiple processing steps to produce a result.
I am also interested in whether or not this inner function has access to the ID of the entity i.e. the grouping variable. The dplyr::do()
approach does have access to the grouping variable, but other methods may not.
complex_func <- function(df) {
df$N <- nrow(df)
df$func_has_ID <- 'ID' %in% colnames(df)
df
}
Split-Apply-Combine - Prehistoric times - split
, lapply
, do.call(rbind, ...)
In the dark ages before dplyr
and pipes, the code looked like this.
split_df <- split(big_df, big_df$ID)
result_list_df <- lapply(split_df, complex_func)
result_df <- do.call(rbind, result_list_df)
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | TRUE |
6 | 21.4 | 258.0 | 2 | TRUE |
8 | 15.5 | 318.0 | 4 | TRUE |
8 | 19.2 | 400.0 | 4 | TRUE |
8 | 16.4 | 275.8 | 4 | TRUE |
8 | 15.2 | 304.0 | 4 | TRUE |
Notes
- the data.frame passed to
complex_func()
contains the ID variable
Split-Apply-Combine - Stone Age with plyr - plyr::ddply
One plyr
function call to do the split, apply and combine. Hasn’t been updated since 2016. C
result_df <- plyr::ddply(big_df, "ID", complex_func)
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | TRUE |
6 | 21.4 | 258.0 | 2 | TRUE |
8 | 15.5 | 318.0 | 4 | TRUE |
8 | 19.2 | 400.0 | 4 | TRUE |
8 | 16.4 | 275.8 | 4 | TRUE |
8 | 15.2 | 304.0 | 4 | TRUE |
Split-Apply-Combine - Early tidyverse era - group_by
, do
In the early days of the tidyverse, the group_by/do
approach was the way to go, and is the way I still write most of the code for split-apply-combine situations.
result_df <- big_df %>%
group_by(ID) %>%
do(complex_func(.)) %>%
ungroup()
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | TRUE |
6 | 21.4 | 258.0 | 2 | TRUE |
8 | 15.5 | 318.0 | 4 | TRUE |
8 | 19.2 | 400.0 | 4 | TRUE |
8 | 16.4 | 275.8 | 4 | TRUE |
8 | 15.2 | 304.0 | 4 | TRUE |
Notes
- the data.frame passed to
complex_func()
contains the ID variable - Explicit
ungroup()
required to remove grouping variable from result
Split-Apply-Combine - Early-mid tidyverse era group_by
& by_slice
For a brief moment in time, purrr
had a by_slice()
function which offered the same features as dplyr::do()
.
This function was then relegated to purrrlyr
as it wasn’t quite purrr
and it wasn’t quite dplyr
.
According to the purrrlyr NEWS file functions in this packages are unlikely to be updated, so using them would be probably be a mistake. This example is included for posterity.
result_df <- big_df %>%
group_by(ID) %>%
purrrlyr::by_slice(~complex_func(.x), .collate = 'rows')
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | FALSE |
6 | 21.4 | 258.0 | 2 | FALSE |
8 | 15.5 | 318.0 | 4 | FALSE |
8 | 19.2 | 400.0 | 4 | FALSE |
8 | 16.4 | 275.8 | 4 | FALSE |
8 | 15.2 | 304.0 | 4 | FALSE |
Notes
- the data.frame passed to
complex_func()
does not contain the ID variable - resulting data.frame does not have any grouping variables, and therefore no explicit
ungroup()
is required - The purrrlyr NEWS.md file does however offer the advice that instead of
by_slice
, the preferred method is a combination oftidyr::nest()
anddplyr::mutate()
using an innerpurrr::map
Split-Apply-Combine - Current era tidyverse: group_by
, nest
, mutate(map())
The current suggested route in the tidyverse is to nest the data, and then operate on the list column by mutating it via purrr::map
.
result_df <- big_df %>%
group_by(ID) %>%
nest() %>%
mutate(data = purrr::map(data, complex_func)) %>%
unnest()
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | FALSE |
6 | 21.4 | 258.0 | 2 | FALSE |
8 | 15.5 | 318.0 | 4 | FALSE |
8 | 19.2 | 400.0 | 4 | FALSE |
8 | 16.4 | 275.8 | 4 | FALSE |
8 | 15.2 | 304.0 | 4 | FALSE |
Notes
- the data.frame passed to
complex_func()
does not contain the ID variable - an explicit
unnest
is required at the end to get the data back in its original form - having to both mutate and map seems an extra step over all other methods.
Split-Apply-Combine - tidyverse/base hybrid: split
, map_dfr
This hybrid approach uses split()
to create a list of data.frames, and then uses map_dfr
to map a function over each data.frame and then combine (by rows) into a single data.frame.
result_df <- big_df %>%
split(.$ID) %>%
purrr::map_dfr(complex_func)
ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|
6 | 21.0 | 160.0 | 2 | TRUE |
6 | 21.4 | 258.0 | 2 | TRUE |
8 | 15.5 | 318.0 | 4 | TRUE |
8 | 19.2 | 400.0 | 4 | TRUE |
8 | 16.4 | 275.8 | 4 | TRUE |
8 | 15.2 | 304.0 | 4 | TRUE |
Notes
- This seems pretty compact - except for the very un-tidyverse
split(.$ID)
- the data.frame passed to
complex_func()
contains the ID variable - No explicit ungrouping required.
Split-Apply-Combine - Alternative universe data.table
Update: MattSummersgill and michael_chirico suggested using data.table
.
This doesn’t really fit within the scope of my search (I’m definitely in the tidyverse ecosystem), but it’s included for the sake of comparison.
library(data.table)
setDT(big_df)
result_df <- big_df[, complex_func(.SD), by = .(this_ID = ID), .SDcols=colnames(big_df)]
this_ID | ID | mpg | disp | N | func_has_ID |
---|---|---|---|---|---|
6 | 6 | 21.0 | 160.0 | 2 | TRUE |
6 | 6 | 21.4 | 258.0 | 2 | TRUE |
8 | 8 | 15.5 | 318.0 | 4 | TRUE |
8 | 8 | 19.2 | 400.0 | 4 | TRUE |
8 | 8 | 16.4 | 275.8 | 4 | TRUE |
8 | 8 | 15.2 | 304.0 | 4 | TRUE |
Notes
- the data.frame passed to
complex_func()
contains the ID variable - No explicit ungrouping required.
- Without being immersed in day-to-day use of
data.table
this solution is a little opaque to me. - There are issues around how
data.table
handles the ID column. Calling it one way gave 2 ID columns in the result I’ve explicitly set a new grouping variable name to avoid this.
Summary Table
Below is a summary table on how the split-apply-combine is achieved for various implementations.
A blank entry means that the action of split, apply or combine is handled by the previous entry.
Method | Split | Apply | Combine | Group var available in applied function |
---|---|---|---|---|
Prehistoric | split | lapply | do.call(rbind) | Yes |
Stone Age | plyr::ddply | Yes | ||
Early Tidyverse | dplyr::group_by | dplyr::do | dplyr::ungroup | Yes |
Early-Mid Tidyverse | dplyr::group_by | purrrlyr::by_slice | No | |
Current Era Tidyverse | dplyr::group_by + tidyr::nest | dplyr::mutate + purrr::map | tidyr::unnest | No |
Hybrid | split | purrr::map_dfr | Yes | |
Alternate Universe | data.table | Yes |
Conclusion
My aim is to find a replacement for group_by/do
now that dplyr::do()
is “basically deprecated”.
I looked at a number of tidyverse/tidyr/purrr replacements for do()
- if you know of a technique I missed, please let me know on twitter.
- My least favourite technique is to use current era tidyverse with
nest/mutate/map/unnest
- I find it too verbose with a high cognitive overhead
- have to nest and unnest then operate on a new
data
list-column - the nested data.frame has a new data.frame column list which doesn’t contain the grouping variable
- applying the function to the data.frame subsets requires a
mutate
AND amap
- Favourite replacement
- tidyverse/base hybrid with
split/map_dfr
- short and sweet.
- No need for
purrr::map
within amutate
- No need for
dplyr::do()
syntax with the.
e.g.complex_func(.)
- tidyverse/base hybrid with
Outstanding issue: Why isn’t there a purrr
version of the base function split
? According to Hadley on github:
A function that acts rowwise on a data frame doesn’t seem like it should live in purrr.
Next step: Make my own tidyverse split_by
?
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email