Code
# dplyr version
system.time({
old_result = gene_lookup_old(gene_frag, open_frag, frag_int, my_gene_id, open_oe_only, selected_cells)
})
# user system elapsed
# 2.727 0.122 2.898 Chun Su
June 25, 2020
dplyr and tidyr have totally changed the way we code in R. I particularly love the pipe %>% which comes from magrittr package and makes the code easy to read and understand. While I am obessed with simplicity of dplyr coding style, I am also fascinated with how fast data.table wrangles data. Below is an example showing the speed difference using dplyr and data.table in a user function gene_lookup for one of my shiny app.
Last year, Hadley Wickham and his team launched a hybrid package dtplyr which uses the syntax of dplyr but runs data.table on the backend. It is a brilliant idea and I believe a lot of dplyr fan would love it, but dtplyr is still slower than data.table. Considering speed is the highest priority in some applications, I would like to switch to data.table completely. It took me only half day to get used to data.table syntax anyway. Plus, this post, which shows using pipe in data.table, makes this transition more smooth and pleasant.
In this blog, I will list data.table and dplyr/tidyr syntax back to back, providing an easy reference for dplyr users who want to quickly pick up data.table.
read_delim and read_csv are the two functions commonly used to read-in data from dplyr
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
The equivalent command in data.table is fread. Like “read_delim”, fread automatically searches for header and defines column data type, but it is faster and automatically searches for separators as well.
[1] "data.table" "data.frame"
To switch between tibble and data.table class, use tbl_df to convert data.table to tibble, and as.data.table() to convert tibble/data.frame to data.table.
Subsetting rows is done by the function filter from dplyr.
Subsetting rows in data.table is very similar to the base R, placing boolean vector at row index. The index separator , can even be omitted. Also if boolean expression contains column variables, we can use the variable names directly instead of using df$var.
Pipe can be used in the format of %>% .[] to connect the code.
Subsetting columns is done by the function select from dplyr
data.table uses variable names to subset columns. Like base R, the variable name vector is a character class placed at column index position. Index separator , cannot be omitted here.
dplyr uses mutate to add column.
data.table uses := to assign values to column variables. Be aware that, different from mutate which returns the updated data frame without assignment, data.table is modified by reference and returned invisibly when := or any set* functions are used. If you do not want to change the original data.table, take a copy first DT2 = copy(DT).
:= can also be used for multiple column assignment and functional form. More details refer to the usage manual
Table pivoting is done by the functions spread and gather (or pivot_wider and pivot_longer) from tidyr.
# wide to long (gather) -- combine automatic, manual two columns into transmission
tidy_mtcars_df = untidy_mtcars_df %>%
gather(key="transmission",value="tmp",automatic,manual) %>%
filter(tmp==1) %>%
select(-tmp)
# wide to long (pivot_longer) -- combine automatic, manual two columns into transmission
untidy_mtcars_df %>%
pivot_longer(cols=c("automatic","manual"),names_to="transmission",values_to="tmp") %>%
filter(tmp==1) %>%
select(-tmp)# A tibble: 32 × 12
model mpg cyl disp hp drat wt qsec vs gear carb trans…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 4 4 manual
2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 4 4 manual
3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 4 1 manual
4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 automa…
5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 automa…
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 automa…
7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 automa…
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 automa…
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 automa…
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 automa…
# … with 22 more rows, and abbreviated variable name ¹transmission
# ℹ Use `print(n = ...)` to see more rows
# A tibble: 32 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 1
2 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 1
3 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
4 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 1
7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 1
8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 4 4 1
9 Merc 450… 16.4 8 276. 180 3.07 4.07 17.4 0 3 3 1
10 Merc 450… 17.3 8 276. 180 3.07 3.73 17.6 0 3 3 1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
# name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
# A tibble: 32 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 3 1 1
2 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 3 2 1
3 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
4 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 3 4 1
5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
6 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 4 2 1
7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 4 4 1
8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 4 4 1
9 Merc 450… 16.4 8 276. 180 3.07 4.07 17.4 0 3 3 1
10 Merc 450… 17.3 8 276. 180 3.07 3.73 17.6 0 3 3 1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
# name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
The data.table uses dcast for long-to-wide and melt for wide-to-long pivoting.
# wide to long (melt) -- combine automatic, manual two columns into transmission
tidy_mtcars_dt = untidy_mtcars_dt %>%
melt(measure.vars=c("automatic","manual"), variable.name="transmission", value.name="tmp") %>%
.[tmp==1] %>%
.[,-c("tmp")]
# long to wide (dcast) -- split transmission column to automatic manual
tidy_mtcars_dt %>%
.[, tmp:=1]
untidy_mtcars_dt = tidy_mtcars_dt %>%
dcast(model + mpg + cyl + disp + hp + drat + wt + qsec + vs + gear + carb ~transmission, value.var="tmp")
untidy_mtcars_dt %>%
.[, c("manual","automatic") := replace(.SD,is.na(.SD),0), .SDcols=c("manual","automatic")]dplyr uses SQL-like join functions to merge two or more tables together. There are inner_join, full_join, left_join, right_join, semi_join and anti_join.
data.table, from the other end, use merge for all the joins. By default, merge is inner_join, but we can adjust by variables to fulfill other join functions.
tidy_mtcars_dt_engine = tidy_mtcars_df_engine %>% as.data.table()
tidy_mtcars_df_other = tidy_mtcars_df_other %>% as.data.table()
# inner_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all=FALSE
)
# left_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all.x=T
)
# right_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all.y=T
)
# full_join
merge(
tidy_mtcars_dt_engine,
tidy_mtcars_df_other,
all=T
)dplyr uses group_by and summarize to calculate the new variable based on the group.
# A tibble: 3 × 2
cyl wt_mean
<dbl> <dbl>
1 4 2.29
2 6 3.12
3 8 4.00
data.table uses by to specify the group, and = to summarize.
dplyr uses group_by and slice to subset rows within the group.
# A tibble: 3 × 13
model mpg cyl disp hp drat wt qsec vs gear carb autom…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 4 2 1
2 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 3 1 1
3 Lincoln C… 10.4 8 460 215 3 5.42 17.8 0 3 4 1
# … with 1 more variable: manual <dbl>, and abbreviated variable name
# ¹automatic
# ℹ Use `colnames()` to see all variable names
data.table uses .SD and by to subset within the group.
cyl model mpg disp hp drat wt qsec vs gear carb
1: 8 Lincoln Continental 10.4 460.0 215 3.00 5.424 17.82 0 3 4
2: 4 Merc 240D 24.4 146.7 62 3.69 3.190 20.00 1 4 2
3: 6 Valiant 18.1 225.0 105 2.76 3.460 20.22 1 3 1
automatic manual
1: 1 0
2: 1 0
3: 1 0
For more .SD reading, refer to https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html
There are other accessory functions useful in dplyr and some of them have corresponding functions in data.table too.
| usage | dplyr | data.table |
|---|---|---|
| get unique rows | distinct() | unique() |
| sort by column(s) | arrange() | setorder()/setorderv() |
| change column names | rename(new=old) | setnames(old, new) |