This article describes the translations provided by duckplyr for
different data types, verbs, and functions within verbs. If a
translation is not provided, duckplyr falls back to dplyr, see
vignette("fallback")
for details.
library(conflicted)
library(dplyr)
conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over
#> any other package.
conflict_prefer("lag", "dplyr")
#> [conflicted] Will prefer dplyr::lag over
#> any other package.
Data types
duckplyr supports the following data types:
is.logical()
is.integer()
is.numeric()
is.character()
is.Date()
-
is.POSIXct()
(with UTC time zone) is.difftime()
duckplyr::duckdb_tibble(
logical = TRUE,
integer = 1L,
numeric = 1.1,
character = "a",
Date = as.Date("2025-01-11"),
POSIXct = as.POSIXct("2025-01-11 19:23:00", tz = "UTC"),
difftime = as.difftime(1, units = "secs"),
) |>
compute()
#> # A duckplyr data frame: 7 variables
#> logical integer numeric character Date POSIXct
#> <lgl> <int> <dbl> <chr> <date> <dttm>
#> 1 TRUE 1 1.1 a 2025-01-11 2025-01-11 19:23:00
#> # ℹ 1 more variable: difftime <drtn>
Generally, zero-column tibbles are not supported by duckplyr, neither as input nor as a result.
duckplyr::duckdb_tibble()
#> Error in `duckplyr::duckdb_tibble()`:
#> ! Can't convert empty data frame to relational.
duckplyr::duckdb_tibble(a = 1, .prudence = "stingy") |>
select(-a)
#> Error in `select()`:
#> ! This operation cannot be carried out by DuckDB, and the input is
#> a stingy duckplyr frame.
#> • Zero-column result set not supported.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage
#> and continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
Support for more data types, and passthrough of unknown data types, is planned. Let’s discuss any additional data types you would like to see supported.
Verbs
Not all dplyr verbs are implemented within duckplyr. For unsupported
verbs, duckplyr automatically falls back to dplyr. See
?unsupported
for a list of verbs for which duckplyr does
not provide a method.
See the reference index for a list of verbs with corresponding duckplyr methods.
Let’s discuss any additional verbs you would like to see supported.
Functions within verbs
For all functions used in dplyr verbs, translations must be provided. If an expression contains a function for which no translation is provided, duckplyr falls back to dplyr. With some exceptions, only positional matching is implemented.
As of now, here are the translations provided:
Parentheses
Implemented: (
.
Reference: ?Paren
.
duckplyr::duckdb_tibble(a = 1, b = 2, c = 3, .prudence = "stingy") |>
mutate((a + b) * c)
#> # A duckplyr data frame: 4 variables
#> a b c `(a + b) * c`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 9
Comparison operators
Implemented: >
, >=
,
<
, <=
, ==
,
!=
.
Reference: ?Comparison
.
duckplyr::duckdb_tibble(
a = c(1, 2, NA),
b = c(2, NA, 3),
c = c(NA, 3, 4),
.prudence = "stingy"
) |>
mutate(a > b, b != c, c < a, a >= b, b <= c)
#> # A duckplyr data frame: 8 variables
#> a b c `a > b` `b != c` `c < a` `a >= b` `b <= c`
#> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 2 NA FALSE NA NA FALSE NA
#> 2 2 NA 3 NA NA FALSE NA NA
#> 3 NA 3 4 NA TRUE NA NA TRUE
Basic arithmetics
Implemented: +
, -
, *
,
/
.
Reference: ?Arithmetic
.
duckplyr::duckdb_tibble(a = 1, b = 2, c = 3, .prudence = "stingy") |>
mutate(a + b, a / b, a - b, a * b)
#> # A duckplyr data frame: 7 variables
#> a b c `a + b` `a/b` `a - b` `a * b`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 3 0.5 -1 2
Math functions
Implemented: log()
, log10()
,
abs()
.
Reference: ?Math
.
duckplyr::duckdb_tibble(a = 1, b = 2, c = -3, .prudence = "stingy") |>
mutate(log10(a), log(b), abs(c))
#> # A duckplyr data frame: 6 variables
#> a b c `log10(a)` `log(b)` `abs(c)`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -3 0 0.693 3
Logical operators
Implemented: !
, &
, |
.
Reference: ?Logic
.
duckplyr::duckdb_tibble(a = FALSE, b = TRUE, c = NA, .prudence = "stingy") |>
mutate(!a, a & b, b | c)
#> # A duckplyr data frame: 6 variables
#> a b c `!a` `a & b` `b | c`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 FALSE TRUE NA TRUE FALSE TRUE
Branching and conversion
Implemented:
-
is.na()
,dplyr::if_else()
,as.integer()
strftime(x, format)
duckplyr::duckdb_tibble(a = 1, b = NA, .prudence = "stingy") |>
mutate(is.na(b), if_else(is.na(b), 0, 1), as.integer(b))
#> # A duckplyr data frame: 5 variables
#> a b `is.na(b)` `if_else(is.na(b), 0, 1)` `as.integer(b)`
#> <dbl> <lgl> <lgl> <dbl> <int>
#> 1 1 NA TRUE 0 NA
duckplyr::duckdb_tibble(
a = as.POSIXct("2025-01-11 19:23:46", tz = "UTC"),
.prudence = "stingy") |>
mutate(strftime(a, "%H:%M:%S"))
#> # A duckplyr data frame: 2 variables
#> a `strftime(a, "%H:%M:%S")`
#> <dttm> <chr>
#> 1 2025-01-11 19:23:46 19:23:46
String manipulation
Implemented: grepl()
, substr()
,
sub()
, gsub()
.
duckplyr::duckdb_tibble(a = "abbc", .prudence = "stingy") |>
mutate(grepl("b", a), substr(a, 2L, 3L), sub("b", "B", a), gsub("b", "B", a))
#> # A duckplyr data frame: 5 variables
#> a `grepl("b", a)` `substr(a, 2L, 3L)` `sub("b", "B", a)`
#> <chr> <lgl> <chr> <chr>
#> 1 abbc TRUE bbc aBbc
#> # ℹ 1 more variable: `gsub("b", "B", a)` <chr>
Date manipulation
Implemented: lubridate::hour()
,
lubridate::minute()
, lubridate::second()
,
lubridate::wday()
.
duckplyr::duckdb_tibble(
a = as.POSIXct("2025-01-11 19:23:46", tz = "UTC"),
.prudence = "stingy"
) |>
mutate(
hour = lubridate::hour(a),
minute = lubridate::minute(a),
second = lubridate::second(a),
wday = lubridate::wday(a)
)
#> # A duckplyr data frame: 5 variables
#> a hour minute second wday
#> <dttm> <dbl> <dbl> <dbl> <int>
#> 1 2025-01-11 19:23:46 19 23 46 7
Aggregation
Implemented:
-
sum(x, na.rm)
,dplyr::n()
,dplyr::n_distinct()
-
mean(x, na.rm)
,median(x, na.rm)
,sd(x, na.rm)
-
min()
,max()
,any()
,all()
duckplyr::duckdb_tibble(a = 1:3, b = c(1, 2, 2), .prudence = "stingy") |>
summarize(
sum(a),
n(),
n_distinct(b),
)
#> # A duckplyr data frame: 3 variables
#> `sum(a)` `n()` `n_distinct(b)`
#> <dbl> <int> <dbl>
#> 1 6 3 2
duckplyr::duckdb_tibble(a = 1:3, b = c(1, 2, NA), .prudence = "stingy") |>
summarize(
mean(b, na.rm = TRUE),
median(a),
sd(b),
)
#> # A duckplyr data frame: 3 variables
#> `mean(b, na.rm = TRUE)` `median(a)` `sd(b)`
#> <dbl> <dbl> <dbl>
#> 1 1.5 2 NA
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
summarize(
min(a),
max(a),
any(a > 1),
all(a > 1),
)
#> # A duckplyr data frame: 4 variables
#> `min(a)` `max(a)` `any(a > 1)` `all(a > 1)`
#> <int> <int> <lgl> <lgl>
#> 1 1 3 TRUE FALSE
Shifting
All optional arguments to dplyr::lag()
and
dplyr::lead()
are supported.
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
mutate(lag(a), lead(a))
#> # A duckplyr data frame: 3 variables
#> a `lag(a)` `lead(a)`
#> <int> <int> <int>
#> 1 1 NA 2
#> 2 2 1 3
#> 3 3 2 NA
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
mutate(lag(a, 2), lead(a, n = 2))
#> # A duckplyr data frame: 3 variables
#> a `lag(a, 2)` `lead(a, n = 2)`
#> <int> <int> <int>
#> 1 1 NA 3
#> 2 2 NA NA
#> 3 3 1 NA
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
mutate(lag(a, default = 0), lead(a, default = 4))
#> # A duckplyr data frame: 3 variables
#> a `lag(a, default = 0)` `lead(a, default = 4)`
#> <int> <int> <int>
#> 1 1 0 2
#> 2 2 1 3
#> 3 3 2 4
duckplyr::duckdb_tibble(a = 1:3, b = c(2, 3, 1), .prudence = "stingy") |>
mutate(lag(a, order_by = b), lead(a, order_by = b))
#> # A duckplyr data frame: 4 variables
#> a b `lag(a, order_by = b)` `lead(a, order_by = b)`
#> <int> <dbl> <int> <int>
#> 1 3 1 NA 1
#> 2 1 2 3 2
#> 3 2 3 1 NA
Ranking
Ranking in DuckDB is very different from dplyr. Most functions in DuckDB rank only by the current row number, whereas in dplyr, ranking is done by a column. It will be difficult to provide translations for the following ranking functions.
Implementing dplyr::ntile()
is feasible for the
n
argument. The only ranking function currently implemented
is dplyr::row_number()
.
duckplyr::duckdb_tibble(a = c(1, 2, 2, 3), .prudence = "stingy") |>
mutate(row_number())
#> # A duckplyr data frame: 2 variables
#> a `row_number()`
#> <dbl> <int>
#> 1 1 1
#> 2 2 2
#> 3 2 3
#> 4 3 4
Special cases
$
(?Extract
) is implemented if the LHS is
.data
or .env
:
b <- 4
duckplyr::duckdb_tibble(a = 1, b = 2, .prudence = "stingy") |>
mutate(.data$a + .data$b, .env$b)
#> # A duckplyr data frame: 4 variables
#> a b `.data$a + .data$b` `.env$b`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4
%in%
(?match
) is implemented if the RHS is
a constant with up to 100 values:
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
mutate(a %in% c(1, 3)) |>
collect()
#> # A tibble: 3 × 2
#> a `a %in% c(1, 3)`
#> <int> <lgl>
#> 1 1 TRUE
#> 2 2 FALSE
#> 3 3 TRUE
duckplyr::last_rel()
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [a as a, ___coalesce("|"("r_base::=="(a, 1.0), "r_base::=="(a, 3.0)), false) as a %in% c(1, 3)]
#> r_dataframe_scan(0xdeadbeef)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#> - a %in% c(1, 3) (BOOLEAN)
dplyr::desc()
is only implemented in the context of
dplyr::arrange()
:
duckplyr::duckdb_tibble(a = 1:3, .prudence = "stingy") |>
arrange(desc(a)) |>
explain()
#> ┌---------------------------┐
#> │ ORDER_BY │
#> │ -------------------- │
#> │ dataframe_42_42 │
#> │ 40572985.a DESC │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ data.frame │
#> │ │
#> │ Projections: a │
#> │ │
#> │ ~3 Rows │
#> └---------------------------┘
suppressWarnings()
is a no-op:
duckplyr::duckdb_tibble(a = 1, .prudence = "stingy") |>
mutate(suppressWarnings(a + 1))
#> # A duckplyr data frame: 2 variables
#> a `suppressWarnings(a + 1)`
#> <dbl> <dbl>
#> 1 1 2
Contributing
Refer to our contributing guide to learn how to contribute new translations to the package. Ideally, duckplyr will also support adding custom translations for functions for the duration of the current R session.
Known incompatibilities
This section tracks known incompatibilities between dplyr and duckplyr. Changing these is likely to require substantial effort, and might be best addressed by providing new functions with consistent behavior in both dplyr and DuckDB.
Output order stability
DuckDB does not guarantee order stability for the output. For performance reasons, duckplyr does not enable output order stability by default.
duckplyr::flights_df() |>
duckplyr::as_duckdb_tibble() |>
distinct(day) |>
summarize(paste(day, collapse = " ")) # fallback
#> # A duckplyr data frame: 1 variable
#> `paste(day, collapse = " ")`
#> <chr>
#> 1 5 9 11 14 15 16 17 22 26 30 31 1 7 8 13 19 21 25 27 2 10 18 28 29 3 4 6…
duckplyr::flights_df() |>
distinct(day) |>
summarize(paste(day, collapse = " "))
#> # A tibble: 1 × 1
#> `paste(day, collapse = " ")`
#> <chr>
#> 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27…
This can be changed globally with the
DUCKPLYR_OUTPUT_ORDER
environment variable, see
?config
for details. With this setting, the output order is
stable, but the plans are more complicated, and DuckDB needs to do more
work.
duckplyr::flights_df() |>
duckplyr::as_duckdb_tibble() |>
distinct(day) |>
explain()
#> ┌---------------------------┐
#> │ HASH_GROUP_BY │
#> │ -------------------- │
#> │ Groups: #0 │
#> │ │
#> │ ~336776 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ day │
#> │ │
#> │ ~336776 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ data.frame │
#> │ │
#> │ Projections: day │
#> │ │
#> │ ~336776 Rows │
#> └---------------------------┘
withr::with_envvar(
c(DUCKPLYR_OUTPUT_ORDER = "TRUE"),
duckplyr::flights_df() |>
duckplyr::as_duckdb_tibble() |>
distinct(day) |>
explain()
)
#> ┌---------------------------┐
#> │ ORDER_BY │
#> │ -------------------- │
#> │ dataframe_42_42 │
#> │ 42.___row_number ASC│
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ day │
#> │ ___row_number │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ FILTER │
#> │ -------------------- │
#> │ (___row_number_by = 1) │
#> │ │
#> │ ~67355 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ #2 │
#> │ │
#> │ ~336776 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ WINDOW │
#> │ -------------------- │
#> │ Projections: │
#> │ ROW_NUMBER() OVER │
#> │ (PARTITION BY day ORDER BY│
#> │ ___row_number ASC NULLS │
#> │ LAST) │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ PROJECTION │
#> │ -------------------- │
#> │ #0 │
#> │ #1 │
#> │ │
#> │ ~336776 Rows │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ STREAMING_WINDOW │
#> │ -------------------- │
#> │ Projections: │
#> │ ROW_NUMBER() OVER () │
#> └-------------┬-------------┘
#> ┌-------------┴-------------┐
#> │ R_DATAFRAME_SCAN │
#> │ -------------------- │
#> │ data.frame │
#> │ │
#> │ Projections: day │
#> │ │
#> │ ~336776 Rows │
#> └---------------------------┘
sum()
In duckplyr, this function returns a numeric value also for integers, due to DuckDB’s type stability requirement.
duckplyr::duckdb_tibble(a = 1:100) |>
summarize(sum(a))
#> # A duckplyr data frame: 1 variable
#> `sum(a)`
#> <dbl>
#> 1 5050
duckplyr::duckdb_tibble(a = 1:1000000) |>
summarize(sum(a))
#> # A duckplyr data frame: 1 variable
#> `sum(a)`
#> <dbl>
#> 1 500000500000
tibble(a = 1:100) |>
summarize(sum(a))
#> # A tibble: 1 × 1
#> `sum(a)`
#> <int>
#> 1 5050
tibble(a = 1:1000000) |>
summarize(sum(a))
#> # A tibble: 1 × 1
#> `sum(a)`
#> <dbl>
#> 1 500000500000
Empty vectors in aggregate functions
At the time of writing, empty vectors only occur when summarizing an
empty table without grouping. In all cases, duckplyr returns
NA
, and the behavior of dplyr is different:
-
sum()
for an empty vector returns0
-
any()
andall()
returnFALSE
-
min()
andmax()
return infinity values (with a warning)
duckplyr::duckdb_tibble(a = integer(), b = logical()) |>
summarize(sum(a), any(b), all(b), min(a), max(a))
#> # A duckplyr data frame: 5 variables
#> `sum(a)` `any(b)` `all(b)` `min(a)` `max(a)`
#> <dbl> <lgl> <lgl> <int> <int>
#> 1 NA NA NA NA NA
tibble(a = integer(), b = logical()) |>
summarize(sum(a), any(b), all(b), min(a), max(a))
#> Warning: There were 2 warnings in `summarize()`.
#> The first warning was:
#> ℹ In argument: `min(a)`.
#> Caused by warning in `min()`:
#> ! no non-missing arguments to min; returning Inf
#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> # A tibble: 1 × 5
#> `sum(a)` `any(b)` `all(b)` `min(a)` `max(a)`
#> <int> <lgl> <lgl> <dbl> <dbl>
#> 1 0 FALSE TRUE Inf -Inf
min()
and max()
for logical input
For completeness, duckplyr returns a logical for min()
and max()
when the input is logical, while dplyr returns an
integer.
n_distinct()
and missing values
Unlike most other aggregation functions, n_distinct()
ignores missing values and does not support the na.rm
argument. This is tracked in https://github.com/tidyverse/duckplyr/issues/572.
duckplyr::duckdb_tibble(a = c(1, 2, NA, 1)) |>
summarize(n_distinct(a))
#> # A duckplyr data frame: 1 variable
#> `n_distinct(a)`
#> <dbl>
#> 1 2
duckplyr::duckdb_tibble(a = c(1, 2, NA, 1), .prudence = "stingy") |>
summarize(n_distinct(a, na.rm = TRUE))
#> Error in `summarize()`:
#> ! This operation cannot be carried out by DuckDB, and the input is
#> a stingy duckplyr frame.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage
#> and continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
#> Caused by error in `summarize()`:
#> ! Can't translate named argument `n_distinct(na.rm = )`.
tibble(a = c(1, 2, NA, 1)) |>
summarize(n_distinct(a))
#> # A tibble: 1 × 1
#> `n_distinct(a)`
#> <int>
#> 1 3
tibble(a = c(1, 2, NA, 1)) |>
summarize(n_distinct(a, na.rm = TRUE))
#> # A tibble: 1 × 1
#> `n_distinct(a, na.rm = TRUE)`
#> <int>
#> 1 2
is.na()
and NaN
values
This function returns FALSE
for NaN
values
in duckplyr, while it returns TRUE
in dplyr.
Other differences
Does the same pipeline give different results with
tibble()
and duckdb_tibble()
? We would love to
hear about it, please file an issue.