Base R split has issues - Part 2 idiosyncrasies

March 4, 2018 mikefc

Base R `split()`

My prior post on tidyverse split-apply-combined ended with me favouring split + map_dfr as a replacement for group_by + do. In this post I look at some idioscrasies of the split function from Base R.

The main 2 gotchas with base R split():

the splitting variable gets recycled if it’s not as long as the data.frame being split
NA levels are dropped from the data

Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough

If trying to split() a data.frame, the splitting factor needs to be the same length as the data.frame.

If the splitting factor is shorter than the data, then split() will assume (foolishly!) that you want to keep re-cycling through the factor to make up as many rows as necessary. Almost nobody ever wants this behaviour!!

R is nice enough to produce a warning about a length mismatch, but will do it for you anyway.

In the following example, note how the data is split alternately group 1 or 2, as split() keeps cycling through the bad_factor variable until it has finished processing the data.frame.

test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 1, 1, 2, 2))

bad_factor <- c(1, 2)

split(test_df, bad_factor)  # oops I've used bad_factor by mistake!

## $`1`
##   a good_factor
## 1 a           1
## 3 c           1
## 5 e           2
## 
## $`2`
##   a good_factor
## 2 b           1
## 4 d           1
## 6 f           2

Idiosyncrasy 2: NA levels are dropped from the data

It’s very unlikely I want to throw away data unless I make a specific request to do so e.g. using keep or filter

However, split() assumes that you never want to keep an NA level and just drops it during the splitting process.

In the following example, ideally I want 3 groups representing the 3 levels within good_factor, i.e. 1, 2, and NA. However, split just throws away all the data where good_factor is NA.

test_df <- data.frame(a = letters[1:6], good_factor=c(1, 1, 2, 2, NA, NA))
split(test_df, test_df$good_factor)

## $`1`
##   a good_factor
## 1 a           1
## 2 b           1
## 
## $`2`
##   a good_factor
## 3 c           2
## 4 d           2

coolbutuseless

About

Github

License

Twitter

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Base R split has issues - Part 2 idiosyncrasies

Base R `split()`

Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough

Idiosyncrasy 2: NA levels are dropped from the data

cool but useless

Recent Posts

r64 package - a c64/6502 assembler

Generating Executable ASCII art

Changing the default arguments to a function

Solving the 8 queens problem

Finding a length n needle in a haystack

Categories

About

About

Github

License

Twitter

Base R split()

Idiosyncrasy 1: the splitting variable gets recycled if it’s not long enough

Idiosyncrasy 2: NA levels are dropped from the data

About

Base R `split()`