Base R Nostalgia -- by, tapply, ave, ...

Steve Simpson bio photo By Steve Simpson

photo credit: Paul Yoakum

This evening I was feeling nostalgic for base R group-bys. Before there was dplyr, there was apply and its cousins. I thought it’d be nice to get out the ol’ photo-album.

To start off, the base R proto-ancestor of magrittr piping for me was the with function, especially with apply. It just cleaned up the syntax and visual appearance of the code by pulling out the redundancy of declaring the data.frame. So even though it isn’t necessary to use with for the functions below, I think it makes things easier on the eyes and brain.

Aggregate Group-Bys

In terms of exploratory analysis, base R’s equivalents to dplyr::summarize are by and tapply. In the case below for both tapply and by you have some a factor variable cyl for which you want to execute a function mean over the corresponding cases in vector of numbers mpg. So since mtcars cylinder variable cyl has 3 levels (4, 6, 8), we take the average miles-per-gallon for cars grouped by each of those cylinder categories.

with(mtcars, by(mpg, cyl, mean))
cyl: 4
[1] 26.66364
------------------------------------------------------------------------------------- 
cyl: 6
[1] 19.74286
------------------------------------------------------------------------------------- 
cyl: 8
[1] 15.1


with(mtcars, tapply(mpg, cyl, mean))
       4        6        8 
26.66364 19.74286 15.10000 

We can even get a similar behavior out of sapply by adding split to the mix. Since sapply doesn’t natively have a way to handle the grouped aspects of the calculation, we use the function split to break up mpg into the 3 groups first, like so:

$`4`
[1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

$`6`
[1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7

$`8`
[1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0

Using split returns a ragged list of 3 groups which sapply handles nicely:

with(mtcars, sapply(split(mpg, cyl), mean))
       4        6        8 
26.66364 19.74286 15.10000 

I was delighted to see I could hack out the same output using 2 xtabs (sum/n):

with(mtcars, xtabs(mpg ~ cyl) / xtabs(~ cyl))     
cyl
       4        6        8 
26.66364 19.74286 15.10000 

tapply is the most compact for my taste, both in terms of code and output; but I must confess by does the vertically stacked display of output I got initially used to from my earliest exposures with SPSS and Stata. We can get that and a data frame to boot from aggregate, as long as we pass in our group variable as a list:

with(mtcars, aggregate(mpg, list(cyl), mean))
  Group.1        x
1       4 26.66364
2       6 19.74286
3       8 15.10000

And this brings us back to dplyr with its dataframe output:

library(dplyr)
mtcars %>% group_by(cyl) %>% summarize(mean(mpg))
Source: local data frame [3 x 2]
    cyl mean(mpg)
  (dbl)     (dbl)
1     4  26.66364
2     6  19.74286
3     8  15.10000

Non-Aggregate Group-Bys

If tapply resembles dplyr’s group_by() %>% summarize(), then ave somewhat resembles dplyr’s group_by() %>% mutate(). ave’s syntax works just like tapply’s, though their outputs differ notably. Unlike tapply, ave returns a single vector answer of the same length of the data passed in.

with(mtcars, ave(mpg, cyl, FUN=mean))
[1] 19.74 19.74 26.66 19.74 15.10 19.74 15.10 26.66 26.66
[10] 19.74 19.74 15.10 15.10 15.10 15.10 15.10 15.10 26.66
[19] 26.66 26.66 26.66 15.10 15.10 15.10 15.10 26.66 26.66
[28] 26.66 15.10 19.74 15.10 26.66

This is because if tapply is for summarizing the data, then ave is for prepping those data for assignment <- back into the parent data.frame, as with mutate.

mtcars %>% group_by(cyl) %>% mutate(mean(mpg))

And again, with some cleverness we can get sapply return the same result as ave; this time passing in the levels of cyl to subset mpg and take its mean.

with(mtcars, sapply(cyl, function(x) mean(mpg[cyl==x])))

If you want to get dplyr to have somewhat similar behavior as ave, returning only the variables at play, use transmute instead of mutate. mutate returns the whole data.frame with the new variable included; transmute returns only the variables called or created in the code chunk.

mtcars %>% group_by(cyl) %>% transmute(mean(mpg))