Transition from dplyr to data.table

data wrangle

Author

Chun Su

Published

June 25, 2020

dplyr and tidyr have totally changed the way we code in R. I particularly love the pipe %>% which comes from magrittr package and makes the code easy to read and understand. While I am obessed with simplicity of dplyr coding style, I am also fascinated with how fast data.table wrangles data. Below is an example showing the speed difference using dplyr and data.table in a user function gene_lookup for one of my shiny app.

Code

# dplyr version
system.time({
        old_result = gene_lookup_old(gene_frag, open_frag, frag_int, my_gene_id, open_oe_only, selected_cells)
})
# user  system elapsed 
# 2.727   0.122   2.898

Code

# data.table version
system.time({
        new_result = gene_lookup(gene_frag, open_frag, frag_int, my_gene_id, open_oe_only, selected_cells)
})
# user  system elapsed 
# 0.505   0.054   0.293

Last year, Hadley Wickham and his team launched a hybrid package dtplyr which uses the syntax of dplyr but runs data.table on the backend. It is a brilliant idea and I believe a lot of dplyr fan would love it, but dtplyr is still slower than data.table. Considering speed is the highest priority in some applications, I would like to switch to data.table completely. It took me only half day to get used to data.table syntax anyway. Plus, this post, which shows using pipe in data.table, makes this transition more smooth and pleasant.

In this blog, I will list data.table and dplyr/tidyr syntax back to back, providing an easy reference for dplyr users who want to quickly pick up data.table.

read file

read_delim and read_csv are the two functions commonly used to read-in data from dplyr

Code

untidy_mtcars_df = read_delim("https://raw.githubusercontent.com/sckinta/example_code/master/data_examples/untidy_mtcars1.csv", delim=",")
untidy_mtcars_df = read_csv("https://raw.githubusercontent.com/sckinta/example_code/master/data_examples/untidy_mtcars1.csv")
class(untidy_mtcars_df)

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

The equivalent command in data.table is fread. Like “read_delim”, fread automatically searches for header and defines column data type, but it is faster and automatically searches for separators as well.

Code

untidy_mtcars_dt = fread("https://raw.githubusercontent.com/sckinta/example_code/master/data_examples/untidy_mtcars1.csv")
class(untidy_mtcars_dt)

[1] "data.table" "data.frame"

To switch between tibble and data.table class, use tbl_df to convert data.table to tibble, and as.data.table() to convert tibble/data.frame to data.table.

subset rows

Subsetting rows is done by the function filter from dplyr.

Code

# subset Merc cars 
untidy_mtcars_df %>% 
  filter(grepl("Merc",model))

Subsetting rows in data.table is very similar to the base R, placing boolean vector at row index. The index separator , can even be omitted. Also if boolean expression contains column variables, we can use the variable names directly instead of using df$var.

Pipe can be used in the format of %>% .[] to connect the code.

Code

# subset Merc cars 
untidy_mtcars_dt %>% 
  .[grepl("Merc",model)]

subset columns

Subsetting columns is done by the function select from dplyr

Code

# subset columns -- model, mpg, cyl
untidy_mtcars_df %>% 
  select(model, mpg, cyl)

# de-select columns with partial names "ar"
untidy_mtcars_df %>% 
  select(-contains("ar"))

data.table uses variable names to subset columns. Like base R, the variable name vector is a character class placed at column index position. Index separator , cannot be omitted here.

Code

# subset columns -- model, mpg, cyl
untidy_mtcars_dt %>% 
  .[,c("model","mpg","cyl")]

# de-select columns with partial names "ar"
col_ar = colnames(untidy_mtcars_dt)[grepl("ar",colnames(untidy_mtcars_dt))]
untidy_mtcars_dt %>% 
  .[,-..col_ar]

add new variable

dplyr uses mutate to add column.

Code

# assign car size based on the weight
untidy_mtcars_df %>% 
  mutate(size=case_when(
    wt < 3 ~ "small",
    wt > 3 & wt < 5 ~ "median",
    wt > 5 ~ "large"
  ))

data.table uses := to assign values to column variables. Be aware that, different from mutate which returns the updated data frame without assignment, data.table is modified by reference and returned invisibly when := or any set* functions are used. If you do not want to change the original data.table, take a copy first DT2 = copy(DT).

Code

untidy_mtcars_dt2=copy(untidy_mtcars_dt)
untidy_mtcars_dt2 %>% 
  .[,size:=case_when(wt < 3 ~ "small",
    wt > 3 & wt < 5 ~ "median",
    wt > 5 ~ "large")]

:= can also be used for multiple column assignment and functional form. More details refer to the usage manual

pivot

Table pivoting is done by the functions spread and gather (or pivot_wider and pivot_longer) from tidyr.

Code

# wide to long (gather) -- combine automatic, manual two columns into transmission
tidy_mtcars_df = untidy_mtcars_df %>% 
  gather(key="transmission",value="tmp",automatic,manual) %>% 
  filter(tmp==1) %>% 
  select(-tmp)

# wide to long (pivot_longer) -- combine automatic, manual two columns into transmission
untidy_mtcars_df %>% 
  pivot_longer(cols=c("automatic","manual"),names_to="transmission",values_to="tmp") %>% 
  filter(tmp==1) %>% 
  select(-tmp)

# A tibble: 32 × 12
   model       mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb trans…¹
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  
 1 Mazda RX4  21       6  160    110  3.9   2.62  16.5     0     4     4 manual 
 2 Mazda RX…  21       6  160    110  3.9   2.88  17.0     0     4     4 manual 
 3 Datsun 7…  22.8     4  108     93  3.85  2.32  18.6     1     4     1 manual 
 4 Hornet 4…  21.4     6  258    110  3.08  3.22  19.4     1     3     1 automa…
 5 Hornet S…  18.7     8  360    175  3.15  3.44  17.0     0     3     2 automa…
 6 Valiant    18.1     6  225    105  2.76  3.46  20.2     1     3     1 automa…
 7 Duster 3…  14.3     8  360    245  3.21  3.57  15.8     0     3     4 automa…
 8 Merc 240D  24.4     4  147.    62  3.69  3.19  20       1     4     2 automa…
 9 Merc 230   22.8     4  141.    95  3.92  3.15  22.9     1     4     2 automa…
10 Merc 280   19.2     6  168.   123  3.92  3.44  18.3     1     4     4 automa…
# … with 22 more rows, and abbreviated variable name ¹transmission
# ℹ Use `print(n = ...)` to see more rows

Code

# long to wide (spread) -- make transmission to group automatic and manual
tidy_mtcars_df %>% 
  mutate(tmp=1) %>% 
  spread(transmission, tmp) %>% 
  mutate_if(function(x){any(is.na(x))}, function(x){ifelse(is.na(x),0,1)})

# A tibble: 32 × 13
   model       mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb autom…¹
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 Hornet 4…  21.4     6  258    110  3.08  3.22  19.4     1     3     1       1
 2 Hornet S…  18.7     8  360    175  3.15  3.44  17.0     0     3     2       1
 3 Valiant    18.1     6  225    105  2.76  3.46  20.2     1     3     1       1
 4 Duster 3…  14.3     8  360    245  3.21  3.57  15.8     0     3     4       1
 5 Merc 240D  24.4     4  147.    62  3.69  3.19  20       1     4     2       1
 6 Merc 230   22.8     4  141.    95  3.92  3.15  22.9     1     4     2       1
 7 Merc 280   19.2     6  168.   123  3.92  3.44  18.3     1     4     4       1
 8 Merc 280C  17.8     6  168.   123  3.92  3.44  18.9     1     4     4       1
 9 Merc 450…  16.4     8  276.   180  3.07  4.07  17.4     0     3     3       1
10 Merc 450…  17.3     8  276.   180  3.07  3.73  17.6     0     3     3       1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
#   name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Code

# long to wide (pivot_wider) -- 
tidy_mtcars_df %>% 
  mutate(tmp=1) %>% 
  pivot_wider(names_from=transmission, values_from=tmp) %>% 
  mutate_if(function(x){any(is.na(x))}, function(x){ifelse(is.na(x),0,1)})

# A tibble: 32 × 13
   model       mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb autom…¹
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
 1 Hornet 4…  21.4     6  258    110  3.08  3.22  19.4     1     3     1       1
 2 Hornet S…  18.7     8  360    175  3.15  3.44  17.0     0     3     2       1
 3 Valiant    18.1     6  225    105  2.76  3.46  20.2     1     3     1       1
 4 Duster 3…  14.3     8  360    245  3.21  3.57  15.8     0     3     4       1
 5 Merc 240D  24.4     4  147.    62  3.69  3.19  20       1     4     2       1
 6 Merc 230   22.8     4  141.    95  3.92  3.15  22.9     1     4     2       1
 7 Merc 280   19.2     6  168.   123  3.92  3.44  18.3     1     4     4       1
 8 Merc 280C  17.8     6  168.   123  3.92  3.44  18.9     1     4     4       1
 9 Merc 450…  16.4     8  276.   180  3.07  4.07  17.4     0     3     3       1
10 Merc 450…  17.3     8  276.   180  3.07  3.73  17.6     0     3     3       1
# … with 22 more rows, 1 more variable: manual <dbl>, and abbreviated variable
#   name ¹automatic
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

The data.table uses dcast for long-to-wide and melt for wide-to-long pivoting.

Code

# wide to long (melt) -- combine automatic, manual two columns into transmission
tidy_mtcars_dt = untidy_mtcars_dt %>% 
  melt(measure.vars=c("automatic","manual"), variable.name="transmission", value.name="tmp") %>% 
  .[tmp==1] %>% 
  .[,-c("tmp")]

# long to wide (dcast) -- split transmission column to automatic manual
tidy_mtcars_dt %>% 
  .[, tmp:=1]

untidy_mtcars_dt = tidy_mtcars_dt %>%   
  dcast(model + mpg + cyl + disp + hp + drat + wt + qsec + vs + gear + carb ~transmission, value.var="tmp")

untidy_mtcars_dt %>% 
  .[, c("manual","automatic") := replace(.SD,is.na(.SD),0), .SDcols=c("manual","automatic")]

merge

dplyr uses SQL-like join functions to merge two or more tables together. There are inner_join, full_join, left_join, right_join, semi_join and anti_join.

Code

# generate two tables
tidy_mtcars_df_engine = tidy_mtcars_df %>% 
  select(model, mpg, cyl, hp, vs)
tidy_mtcars_df_other = tidy_mtcars_df %>% 
  select(model, disp, qsec, transmission, gear, carb)

# left_join
left_join(
  tidy_mtcars_df_engine,
  tidy_mtcars_df_other
)

data.table, from the other end, use merge for all the joins. By default, merge is inner_join, but we can adjust by variables to fulfill other join functions.

Code

tidy_mtcars_dt_engine = tidy_mtcars_df_engine %>% as.data.table()
tidy_mtcars_df_other = tidy_mtcars_df_other %>% as.data.table()

# inner_join
merge(
  tidy_mtcars_dt_engine,
  tidy_mtcars_df_other,
  all=FALSE
)

# left_join
merge(
  tidy_mtcars_dt_engine,
  tidy_mtcars_df_other,
  all.x=T
)

# right_join
merge(
  tidy_mtcars_dt_engine,
  tidy_mtcars_df_other,
  all.y=T
)

# full_join
merge(
  tidy_mtcars_dt_engine,
  tidy_mtcars_df_other,
  all=T
)

group and summarize

dplyr uses group_by and summarize to calculate the new variable based on the group.

Code

# calculate weight mean for each group of differrent cylinder number
untidy_mtcars_df %>% 
  group_by(cyl) %>% 
  summarise(wt_mean=mean(wt)) %>% 
  ungroup()

# A tibble: 3 × 2
    cyl wt_mean
  <dbl>   <dbl>
1     4    2.29
2     6    3.12
3     8    4.00

data.table uses by to specify the group, and = to summarize.

Code

untidy_mtcars_dt %>% 
  .[,.(wt_mean=mean(wt)),by=cyl]

   cyl  wt_mean
1:   8 3.999214
2:   4 2.285727
3:   6 3.117143

group and subsetting

dplyr uses group_by and slice to subset rows within the group.

Code

# choose the heaviest cart for each cylinder group
untidy_mtcars_df %>% 
  group_by(cyl) %>% 
  slice(which.max(wt)) %>% 
  ungroup()

# A tibble: 3 × 13
  model        mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb autom…¹
  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 Merc 240D   24.4     4  147.    62  3.69  3.19  20       1     4     2       1
2 Valiant     18.1     6  225    105  2.76  3.46  20.2     1     3     1       1
3 Lincoln C…  10.4     8  460    215  3     5.42  17.8     0     3     4       1
# … with 1 more variable: manual <dbl>, and abbreviated variable name
#   ¹automatic
# ℹ Use `colnames()` to see all variable names

data.table uses .SD and by to subset within the group.

Code

# choose the heaviest cart for each cylinder group
untidy_mtcars_dt %>% 
  .[,.SD[which.max(wt)], by=cyl]

   cyl               model  mpg  disp  hp drat    wt  qsec vs gear carb
1:   8 Lincoln Continental 10.4 460.0 215 3.00 5.424 17.82  0    3    4
2:   4           Merc 240D 24.4 146.7  62 3.69 3.190 20.00  1    4    2
3:   6             Valiant 18.1 225.0 105 2.76 3.460 20.22  1    3    1
   automatic manual
1:         1      0
2:         1      0
3:         1      0

For more .SD reading, refer to https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html

last bits

There are other accessory functions useful in dplyr and some of them have corresponding functions in data.table too.

usage	dplyr	data.table
get unique rows	distinct()	unique()
sort by column(s)	arrange()	setorder()/setorderv()
change column names	rename(new=old)	setnames(old, new)