A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:
The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr
), much more are:
- The number of unique groups.
- The number of columns to perform the join based on - note that this is also related to the previous point as in most cases, more the columns, more the number of unique groups.
That is, these features influence join speed in spite of having the same number of rows.
I'll consider a self-join to illustrate the idea on 10 million rows. Except that we'll compare the self-join on
- 100 and 100,000 groups on one column to join by (cases A and B)
- 100 and 100,000 groups on three columns to join by (cases C and D) .
I'll compare both
dplyr
anddata.table
- as this is a very nice case to compare the performance tradeoffs of hash table approach in dplyr vs radix sort + binary search approach from data.table.
## prepare functions
require(dplyr)
require(data.table)
set.seed(1L)
N = 1e7L
getDT <- function(N) {
data.table(v1 = sample(10L, N, TRUE), v2 = 2L,
v3 = sample(10L, N, TRUE), v4 = sample(1e3L, N, TRUE),
v5 = sample(1e5L, N, TRUE), val = runif(1e7))
}
## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))
## second data to perform self-join
i.DT = unique(x.DT, by="v4")[, list(v4)]
i.DF = tbl_df(as.data.frame(i.DT))
## data.table
system.time({
setkey(x.DT, v4)
ans1 = x.DT[i.DT]
})
# user system elapsed
# 2.312 0.249 2.736
## dplyr
system.time({
ans2 = left_join(i.DF, x.DF, by="v4")
})
# user system elapsed
# 2.291 0.403 2.749
## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))
## second data to perform self-join
i.DT = unique(x.DT, by="v5")[, list(v5)]
i.DF = tbl_df(as.data.frame(i.DT))
## data.table
system.time({
setkey(x.DT, v5)
ans1 = x.DT[i.DT]
})
# user system elapsed
# 2.379 0.251 2.664
## dplyr
system.time({
ans2 = left_join(i.DF, x.DF, by="v5")
})
# user system elapsed
# 6.901 0.493 7.760
Note that data.table
join time is not much different between 100 and 100,000 groups, as compared to dplyr
. In other words, number of groups doesn't seem to influence the run time much for data.table
approach (2.73 vs 2.67s), where as that seems not the case with dplyr
(2.75s vs 7.76s).
Now on to cases C and D.
## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))
## second data to perform self-join
i.DT = unique(x.DT, by=c("v1", "v2", "v3"))[, list(v1, v2, v3)]
i.DF = tbl_df(as.data.frame(i.DT))
## data.table
system.time({
setkey(x.DT, v1, v2, v3)
ans1 = x.DT[i.DT]
})
# user system elapsed
# 2.243 0.023 2.283
## dplyr
system.time({
ans2 = left_join(i.DF, x.DF, by=c("v1", "v2", "v3"))
})
# user system elapsed
# 1.724 0.566 2.325
## create the first data set
x.DT <- getDT(N)
x.DF <- tbl_df(as.data.frame(x.DT))
## second data to perform self-join
i.DT = unique(x.DT, by=c("v1", "v3", "v4"))[, list(v1, v3, v4)]
i.DF = tbl_df(as.data.frame(i.DT))
## data.table
system.time({
setkey(x.DT, v1, v3, v4)
ans1 = x.DT[i.DT]
})
# user system elapsed
# 2.787 0.298 3.120
## dplyr
system.time({
ans2 = left_join(i.DF, x.DF, by=c("v1", "v3", "v4"))
})
# user system elapsed
# 12.344 0.591 14.051
First, dplyr
and data.table
have more or less the same timings on 100 groups, even on three columns.
Second, when we increase the groups to ~100,000 on three columns, we see that the time to perform the join on dplyr
is ~1.8x more (7.7 vs 14 seconds)! Hashing multiple columns with same total unique groups seems to perform poorer than same unique groups on one single column.
Also note that data.table run time doesn't seem to be influenced here much by number of columns and groups.
Case Groups Columns dplyr data.table
A 100 1 2.75 2.73
B 100,000 1 7.76 2.66
C 100 3 2.32 2.28
D 100,000 3 14.05 3.12