Code
# dplyr version
system.time({
= gene_lookup_old(gene_frag, open_frag, frag_int, my_gene_id, open_oe_only, selected_cells)
old_result
})# user system elapsed
# 2.727 0.122 2.898
Chun Su
June 25, 2020
dplyr
and tidyr
have totally changed the way we code in R. I particularly love the pipe %>%
which comes from magrittr
package and makes the code easy to read and understand. While I am obessed with simplicity of dplyr
coding style, I am also fascinated with how fast data.table
wrangles data. Below is an example showing the speed difference using dplyr
and data.table
in a user function gene_lookup
for one of my shiny app.
Last year, Hadley Wickham and his team launched a hybrid package dtplyr
which uses the syntax of dplyr
but runs data.table
on the backend. It is a brilliant idea and I believe a lot of dplyr fan would love it, but dtplyr is still slower than data.table. Considering speed is the highest priority in some applications, I would like to switch to data.table completely. It took me only half day to get used to data.table syntax anyway. Plus, this post, which shows using pipe in data.table, makes this transition more smooth and pleasant.
In this blog, I will list data.table
and dplyr
/tidyr
syntax back to back, providing an easy reference for dplyr users who want to quickly pick up data.table
.
read_delim
and read_csv
are the two functions commonly used to read-in data from dplyr
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
The equivalent command in data.table
is fread
. Like “read_delim”, fread
automatically searches for header and defines column data type, but it is faster and automatically searches for separators as well.
[1] "data.table" "data.frame"
To switch between tibble and data.table class, use tbl_df
to convert data.table to tibble, and as.data.table()
to convert tibble/data.frame to data.table.
Subsetting rows is done by the function filter
from dplyr
.
Subsetting rows in data.table
is very similar to the base R, placing boolean vector at row index. The index separator ,
can even be omitted. Also if boolean expression contains column variables, we can use the variable names directly instead of using df$var.
Pipe can be used in the format of %>% .[]
to connect the code.
Subsetting columns is done by the function select
from dplyr
data.table
uses variable names to subset columns. Like base R, the variable name vector is a character class placed at column index position. Index separator ,
cannot be omitted here.
dplyr
uses mutate
to add column.
data.table
uses :=
to assign values to column variables. Be aware that, different from mutate
which returns the updated data frame without assignment, data.table is modified by reference and returned invisibly when :=
or any set*
functions are used. If you do not want to change the original data.table, take a copy first DT2 = copy(DT)
.
:=
can also be used for multiple column assignment and functional form. More details refer to the usage manual
Table pivoting is done by the functions spread
and gather
(or pivot_wider
and pivot_longer
) from tidyr
.
# wide to long (gather) -- combine automatic, manual two columns into transmission
tidy_mtcars_df = untidy_mtcars_df %>%
gather(key="transmission",value="tmp",automatic,manual) %>%
filter(tmp==1) %>%
select(-tmp)
# wide to long (pivot_longer) -- combine automatic, manual two columns into transmission
untidy_mtcars_df %>%
pivot_longer(cols=c("automatic","manual"),names_to="transmission",values_to="tmp") %>%
filter(tmp==1) %>%
select(-tmp)
# A tibble: 32 × 12
model mpg cyl disp hp drat wt qsec vs gear carb trans…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 4 4 manual
2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 4 4 manual
3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 4 1 manual
4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 automa…
5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 automa…
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 automa…
7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 automa…
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 automa…
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 automa…
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 automa…
# … with 22 more rows, and abbreviated variable name ¹transmission
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 32 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 1
2 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 1
3 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
4 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 1
7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 1
8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 4 4 1
9 Merc 450… 16.4 8 276. 180 3.07 4.07 17.4 0 3 3 1
10 Merc 450… 17.3 8 276. 180 3.07 3.73 17.6 0 3 3 1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
# name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# A tibble: 32 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 1
2 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 1
3 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
4 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 1
7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 1
8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 4 4 1
9 Merc 450… 16.4 8 276. 180 3.07 4.07 17.4 0 3 3 1
10 Merc 450… 17.3 8 276. 180 3.07 3.73 17.6 0 3 3 1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
# name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
The data.table
uses dcast
for long-to-wide and melt
for wide-to-long pivoting.
# wide to long (melt) -- combine automatic, manual two columns into transmission
tidy_mtcars_dt = untidy_mtcars_dt %>%
melt(measure.vars=c("automatic","manual"), variable.name="transmission", value.name="tmp") %>%
.[tmp==1] %>%
.[,-c("tmp")]
# long to wide (dcast) -- split transmission column to automatic manual
tidy_mtcars_dt %>%
.[, tmp:=1]
untidy_mtcars_dt = tidy_mtcars_dt %>%
dcast(model + mpg + cyl + disp + hp + drat + wt + qsec + vs + gear + carb ~transmission, value.var="tmp")
untidy_mtcars_dt %>%
.[, c("manual","automatic") := replace(.SD,is.na(.SD),0), .SDcols=c("manual","automatic")]
dplyr
uses SQL-like join functions to merge two or more tables together. There are inner_join
, full_join
, left_join
, right_join
, semi_join
and anti_join
.
data.table
, from the other end, use merge
for all the joins. By default, merge
is inner_join
, but we can adjust by
variables to fulfill other join functions.
tidy_mtcars_dt_engine = tidy_mtcars_df_engine %>% as.data.table()
tidy_mtcars_df_other = tidy_mtcars_df_other %>% as.data.table()
# inner_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all=FALSE
)
# left_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all.x=T
)
# right_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all.y=T
)
# full_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all=T
)
dplyr
uses group_by
and summarize
to calculate the new variable based on the group.
# A tibble: 3 × 2
cyl wt_mean
<dbl> <dbl>
1 4 2.29
2 6 3.12
3 8 4.00
data.table
uses by
to specify the group, and =
to summarize.
dplyr
uses group_by
and slice
to subset rows within the group.
# A tibble: 3 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
2 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
3 Lincoln C… 10.4 8 460 215 3 5.42 17.8 0 3 4 1
# … with 1 more variable: manual <dbl>, and abbreviated variable name
# ¹automatic
# ℹ Use `colnames()` to see all variable names
data.table
uses .SD
and by
to subset within the group.
cyl model mpg disp hp drat wt qsec vs gear carb
1: 8 Lincoln Continental 10.4 460.0 215 3.00 5.424 17.82 0 3 4
2: 4 Merc 240D 24.4 146.7 62 3.69 3.190 20.00 1 4 2
3: 6 Valiant 18.1 225.0 105 2.76 3.460 20.22 1 3 1
automatic manual
1: 1 0
2: 1 0
3: 1 0
For more .SD
reading, refer to https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html
There are other accessory functions useful in dplyr
and some of them have corresponding functions in data.table
too.
usage | dplyr | data.table |
---|---|---|
get unique rows | distinct() | unique() |
sort by column(s) | arrange() | setorder()/setorderv() |
change column names | rename(new=old) | setnames(old, new) |